Add to_model() method to from_disk.{Content,Directory}, to convert to canonical model objects.
They will be used by loaders, so they can deal only with canonical model objects, instead of having to do the same conversion themselves.
Depends on !249 (closed).
Migrated from D2703 (view on Phabricator)
Merge request reports
Activity
Build is green See https://jenkins.softwareheritage.org/job/DMOD/job/tox/179/ for more details.
By default,
from_disk
will not pull the full data from disk into memory, so by doing this there's a good chance you'll end up with aContent
object that has an empty data attribute.No, in that case it's a SkippedContent.
We already have lots of issues with loaders having an "optimistic" usage of memory, so I'd be tempted to introduce a
DiskBackedContent
(DiskContent
?) model, inheriting fromContent
, with apath
attribute as well as a lazydata
attribute that will read data from disk the first time it's accessed.What's the difference?
! In !250 (closed), @olasd wrote: That'd be a lot more work downstream, but wouldn't it make sense to have
get_data
returnmodel
objects directly instead of dicts?They don't exactly return the same thing;
from_disk.Content.get_data()
also returns perms, name, and path.As most users are calling
collect()
on a directory object, that would remove a large chunk of indirection from loaders.I don't understand what
collect()
has to do with it! In !250 (closed), @vlorentz wrote: By default,
from_disk
will not pull the full data from disk into memory, so by doing this there's a good chance you'll end up with aContent
object that has an empty data attribute.No, in that case it's a SkippedContent.
Then you're generating SkippedContents for all files that are supposed to be lazy-loaded from disk. That's even more wrong than my original expectation.
We already have lots of issues with loaders having an "optimistic" usage of memory, so I'd be tempted to introduce a
DiskBackedContent
(DiskContent
?) model, inheriting fromContent
, with apath
attribute as well as a lazydata
attribute that will read data from disk the first time it's accessed.What's the difference?
The difference between what and what?
! In !250 (closed), @vlorentz wrote:
! In !250 (closed), @olasd wrote: That'd be a lot more work downstream, but wouldn't it make sense to have
get_data
returnmodel
objects directly instead of dicts?They don't exactly return the same thing;
from_disk.Content.get_data()
also returns perms, name, and path.That's really just an abstraction failure/leak in its implementation. These fields are not useful to the callers if the API of the returned
Content
s andDirectory
s is sensible (see also the lazy loading comments).As most users are calling
collect()
on a directory object, that would remove a large chunk of indirection from loaders.I don't understand what
collect()
has to do with itcollect()
is the external API ofDirectory
that's used by loaders to get the lists of objects to load. That function callsget_data()
on every node in the tree, which currently converts the data structure to a dict. So you don't have anyContent
objects on which to call.to_model
by the time you'vecollect()
ed, which makesto_model
's usefulness on its own quite limited.! In !250 (closed), @olasd wrote:
We already have lots of issues with loaders having an "optimistic" usage of memory, so I'd be tempted to introduce a
DiskBackedContent
(DiskContent
?) model, inheriting fromContent
, with apath
attribute as well as a lazydata
attribute that will read data from disk the first time it's accessed.What's the difference?
The difference between what and what?
I mean, what's the point of lazy-loading, instead of using the
data
argument offrom_disk
?! In !250 (closed), @olasd wrote: That's really just an abstraction failure/leak in its implementation. These fields are not useful to the callers if the API of the returned
Content
s andDirectory
s is sensible (see also the lazy loading comments).ack
collect()
is the external API ofDirectory
that's used by loaders to get the lists of objects to load. That function callsget_data()
on every node in the tree, which currently converts the data structure to a dict. So you don't have anyContent
objects on which to call.to_model
by the time you'vecollect()
ed, which makesto_model
's usefulness on its own quite limited.ack
! In !250 (closed), @vlorentz wrote:
! In !250 (closed), @olasd wrote:
We already have lots of issues with loaders having an "optimistic" usage of memory, so I'd be tempted to introduce a
DiskBackedContent
(DiskContent
?) model, inheriting fromContent
, with apath
attribute as well as a lazydata
attribute that will read data from disk the first time it's accessed.What's the difference?
The difference between what and what?
I mean, what's the point of lazy-loading, instead of using the
data
argument offrom_disk
?Ah, right.
Most loaders that are using
from_disk
will be re-processing files that they've already imported in one way or another: for instance, the svn loader calls runs it for every commit in turn; package manager loaders for every version in turn; lazy-loading allows us to only carry the full data in memory at the point we're actually sending it away to the archive, rather than all the time. In testing, the memory consumption savings were quite substantial.Build has FAILED
Link to build: https://jenkins.softwareheritage.org/job/DMOD/job/tox/186/ See console output for more information: https://jenkins.softwareheritage.org/job/DMOD/job/tox/186/console
Build is green See https://jenkins.softwareheritage.org/job/DMOD/job/tox/187/ for more details.
70 with tempfile.NamedTemporaryFile(mode='w+b') as fd: 71 fd.write(b'foo') 72 fd.seek(0) 73 content = DiskBackedContent( 74 length=42, status='visible', path=fd.name, 75 sha1=b'foo', sha1_git=b'bar', sha256=b'baz', blake2s256=b'qux') 76 fd.write(b'bar') 77 fd.seek(0) 78 content_with_data = content.with_data() 79 fd.write(b'baz') 80 fd.seek(0) 81 82 assert content_with_data.data == b'bar' 83 84 def test_with_data_cannot_read(self): 85 with tempfile.NamedTemporaryFile(mode='w+b') as fd: I don't think the expected content is that useful. The test would be easier to follow with just
assert content_with_data.data == b'bar'
.Using explicit contents such as
b'written before instantiation'
,b'written before with_data()'
,b'written after with_data()'
instead of metasyntactic stuff would make the test even clearer.
Build has FAILED
Link to build: https://jenkins.softwareheritage.org/job/DMOD/job/tox/188/ See console output for more information: https://jenkins.softwareheritage.org/job/DMOD/job/tox/188/console