Skip to content
Snippets Groups Projects

Add to_model() method to from_disk.{Content,Directory}, to convert to canonical model objects.

1 unresolved thread

They will be used by loaders, so they can deal only with canonical model objects, instead of having to do the same conversion themselves.

Depends on !249 (closed).


Migrated from D2703 (view on Phabricator)

Merge request reports

Approved by

Closed by Phabricator Migration userPhabricator Migration user 5 years ago (Feb 24, 2020 3:06pm UTC)

Merge details

  • The changes were not merged into generated-differential-D2703-target.

Activity

Filter activity
  • Approvals
  • Assignees & reviewers
  • Comments (from bots)
  • Comments (from users)
  • Commits & branches
  • Edits
  • Labels
  • Lock status
  • Mentions
  • Merge request status
  • Tracking
  • That'd be a lot more work downstream, but wouldn't it make sense to have get_data return model objects directly instead of dicts?

    As most users are calling collect() on a directory object, that would remove a large chunk of indirection from loaders.

    What do you think?

  • Author Maintainer

    By default, from_disk will not pull the full data from disk into memory, so by doing this there's a good chance you'll end up with a Content object that has an empty data attribute.

    No, in that case it's a SkippedContent.

    We already have lots of issues with loaders having an "optimistic" usage of memory, so I'd be tempted to introduce a DiskBackedContent (DiskContent?) model, inheriting from Content, with a path attribute as well as a lazy data attribute that will read data from disk the first time it's accessed.

    What's the difference?

  • Author Maintainer

    ! In !250 (closed), @olasd wrote: That'd be a lot more work downstream, but wouldn't it make sense to have get_data return model objects directly instead of dicts?

    They don't exactly return the same thing; from_disk.Content.get_data() also returns perms, name, and path.

    As most users are calling collect() on a directory object, that would remove a large chunk of indirection from loaders.

    I don't understand what collect() has to do with it

  • ! In !250 (closed), @vlorentz wrote: By default, from_disk will not pull the full data from disk into memory, so by doing this there's a good chance you'll end up with a Content object that has an empty data attribute.

    No, in that case it's a SkippedContent.

    Then you're generating SkippedContents for all files that are supposed to be lazy-loaded from disk. That's even more wrong than my original expectation.

    We already have lots of issues with loaders having an "optimistic" usage of memory, so I'd be tempted to introduce a DiskBackedContent (DiskContent?) model, inheriting from Content, with a path attribute as well as a lazy data attribute that will read data from disk the first time it's accessed.

    What's the difference?

    The difference between what and what?

  • ! In !250 (closed), @vlorentz wrote:

    ! In !250 (closed), @olasd wrote: That'd be a lot more work downstream, but wouldn't it make sense to have get_data return model objects directly instead of dicts?

    They don't exactly return the same thing; from_disk.Content.get_data() also returns perms, name, and path.

    That's really just an abstraction failure/leak in its implementation. These fields are not useful to the callers if the API of the returned Contents and Directorys is sensible (see also the lazy loading comments).

    As most users are calling collect() on a directory object, that would remove a large chunk of indirection from loaders.

    I don't understand what collect() has to do with it

    collect() is the external API of Directory that's used by loaders to get the lists of objects to load. That function calls get_data() on every node in the tree, which currently converts the data structure to a dict. So you don't have any Content objects on which to call .to_model by the time you've collect()ed, which makes to_model's usefulness on its own quite limited.

  • Author Maintainer

    ! In !250 (closed), @olasd wrote:

    We already have lots of issues with loaders having an "optimistic" usage of memory, so I'd be tempted to introduce a DiskBackedContent (DiskContent?) model, inheriting from Content, with a path attribute as well as a lazy data attribute that will read data from disk the first time it's accessed.

    What's the difference?

    The difference between what and what?

    I mean, what's the point of lazy-loading, instead of using the data argument of from_disk?

    ! In !250 (closed), @olasd wrote: That's really just an abstraction failure/leak in its implementation. These fields are not useful to the callers if the API of the returned Contents and Directorys is sensible (see also the lazy loading comments).

    ack

    collect() is the external API of Directory that's used by loaders to get the lists of objects to load. That function calls get_data() on every node in the tree, which currently converts the data structure to a dict. So you don't have any Content objects on which to call .to_model by the time you've collect()ed, which makes to_model's usefulness on its own quite limited.

    ack

  • ! In !250 (closed), @vlorentz wrote:

    ! In !250 (closed), @olasd wrote:

    We already have lots of issues with loaders having an "optimistic" usage of memory, so I'd be tempted to introduce a DiskBackedContent (DiskContent?) model, inheriting from Content, with a path attribute as well as a lazy data attribute that will read data from disk the first time it's accessed.

    What's the difference?

    The difference between what and what?

    I mean, what's the point of lazy-loading, instead of using the data argument of from_disk?

    Ah, right.

    Most loaders that are using from_disk will be re-processing files that they've already imported in one way or another: for instance, the svn loader calls runs it for every commit in turn; package manager loaders for every version in turn; lazy-loading allows us to only carry the full data in memory at the point we're actually sending it away to the archive, rather than all the time. In testing, the memory consumption savings were quite substantial.

  • Author Maintainer

    refactor diff to defer loading content data after creating content object.

70 with tempfile.NamedTemporaryFile(mode='w+b') as fd:
71 fd.write(b'foo')
72 fd.seek(0)
73 content = DiskBackedContent(
74 length=42, status='visible', path=fd.name,
75 sha1=b'foo', sha1_git=b'bar', sha256=b'baz', blake2s256=b'qux')
76 fd.write(b'bar')
77 fd.seek(0)
78 content_with_data = content.with_data()
79 fd.write(b'baz')
80 fd.seek(0)
81
82 assert content_with_data.data == b'bar'
83
84 def test_with_data_cannot_read(self):
85 with tempfile.NamedTemporaryFile(mode='w+b') as fd:
  • I don't think the expected content is that useful. The test would be easier to follow with just assert content_with_data.data == b'bar'.

    Using explicit contents such as b'written before instantiation', b'written before with_data()', b'written after with_data()' instead of metasyntactic stuff would make the test even clearer.

  • Please register or sign in to reply
  • Thanks!

  • Merge request was accepted

  • Nicolas Dandrimont approved this merge request

    approved this merge request

  • Author Maintainer

    apply comment

  • Author Maintainer

    Merge request was merged

  • closed

  • Please register or sign in to reply
    Loading