Skip to content
Snippets Groups Projects

loader: Implement resolve_ext_ref callback of PackInflater

5 unresolved threads

It exist cases where a git server can send a pack file containing external references to objects stored in the local repository. In that case, data for such objects are not included in the pack file and must be resolved from the local git object store.

Such a pack file is typically generated when a git client asks for a thin pack to the server but can also be generated when a client is not requesting a thin pack.

It has been observed that GitHub can send such pack files to the loader for origins already visited. The loader does not ask for thin pack files so it might be an implementation issue of git-upload-pack from their side.

This is problematic as dulwich raises a KeyError exception when it finds a pack file has external references not resolved in it and thus the git loading fails.

In order to workaround that issue, implement the resolve_ext_ref callback that can be passed as parameter to the PackInflater class. When dulwich encounters an external reference in the pack file, it is calling that callback to resolve object data. As an external object has already been stored into the archive, we can reconstruct its git manifest from the archive content and thus resolve the external reference in the pack file.

Fixes #4745 (closed)

As a side note, theses changes now allow the loader to ask for thin packs but I am not sure if there is any benefits for us to do so.

Merge request reports

Loading
Loading

Activity

Filter activity
  • Approvals
  • Assignees & reviewers
  • Comments (from bots)
  • Comments (from users)
  • Commits & branches
  • Edits
  • Labels
  • Lock status
  • Mentions
  • Merge request status
  • Tracking
179 179 for pack in list(self.packs):
180 180 try:
181 181 if sha in pack:
182 return pack[sha]
182 return pack[sha] # type: ignore
  • Jenkins job DLDG/gitlab-builds #4 succeeded .
    See Console Output and Coverage Report for more details.

  • FWIW LGTM from a Dulwich perspective.

    • As a side note, theses changes now allow the loader to ask for thin packs but I am not sure if there is any benefits for us to do so.

      Maybe this would help incremental fetches from repos with gigantic checkouts, e.g. azure-docs. Right now our incremental fetches are still pulling all the objects referenced from the new commits, I expect thin packs not to do that.

    • Author Maintainer

      So I tried the request of a thin pack by the loader, the number of full objects contained in the received pack file is indeed significantly lesser but we have much more external objects to resolve from the archive and this significantly impacts the loader performance.

      For instance when performing a new visit of https://github.com/cypress-io/cypress, this is the timing we obtain when not requesting a thin pack:

      14:01 $ time swh loader -C ~/.config/swh/loader.yml run git https://github.com/cypress-io/cypress
      INFO:swh.loader.git.loader.GitLoader:Load origin 'https://github.com/cypress-io/cypress' with type 'git'
      Enumerating objects: 5812, done.
      Counting objects: 100% (4589/4589), done.
      Compressing objects: 100% (1620/1620), done.
      Total 5812 (delta 3541), reused 3495 (delta 2967), pack-reused 1223
      INFO:swh.loader.git.loader:Listed 8803 refs for repo https://github.com/cypress-io/cypress
      INFO:swh.loader.git.loader.GitLoader:Fetched 5813 objects; 5172 are new
      {'status': 'eventful'} for origin 'https://github.com/cypress-io/cypress'
      
      real    0m18,918s
      user    0m7,185s
      sys     0m0,735s

      While this is the timing we obtain when requesting a thin pack:

      14:02 $ time swh loader -C ~/.config/swh/loader.yml run git https://github.com/cypress-io/cypress
      INFO:swh.loader.git.loader.GitLoader:Load origin 'https://github.com/cypress-io/cypress' with type 'git'
      Enumerating objects: 5812, done.
      Counting objects: 100% (4589/4589), done.
      Compressing objects: 100% (713/713), done.
      Total 5812 (delta 4018), reused 4410 (delta 3874), pack-reused 1223
      INFO:swh.loader.git.loader:Listed 8803 refs for repo https://github.com/cypress-io/cypress
      INFO:swh.loader.git.loader.GitLoader:Fetched 5813 objects; 5172 are new
      {'status': 'eventful'} for origin 'https://github.com/cypress-io/cypress'
      
      real    1m47,016s
      user    0m14,352s
      sys     0m1,265s

      So definitely not an option for us to request thin packs.

      Edited by Antoine Lambert
    • I agree that it may not be worth pursuing farther, but I think it's worth keeping this change on a branch somewhere so that we can investigate it further if needed.

    • Author Maintainer

      For the record, I tried to patch dulwich to not walk the external refs chain (by commenting these two lines) but obviously we miss a lot of deltified objects contained in the pack as the full chain of deltas is not walked.

      Edited by Antoine Lambert
    • Author Maintainer

      I forgot to detail how the loader can request a thin pack, by applying that simple diff:

      diff --git a/swh/loader/git/loader.py b/swh/loader/git/loader.py
      index e9fd4e9..e266336 100644
      --- a/swh/loader/git/loader.py
      +++ b/swh/loader/git/loader.py
      @@ -215,7 +215,7 @@ class GitLoader(BaseGitLoader):
               logger.debug("Transport url to communicate with server: %s", transport_url)
       
               client, path = dulwich.client.get_transport_and_path(
      -            transport_url, thin_packs=False
      +            transport_url, thin_packs=True
               )
       
               logger.debug("Client %s to fetch pack at %s", client, path)

      But as I said above, it significantly increases loading time for an already visited git repository due to the objects that need to be fetched from the archive. Fetching the objects by batch could improve overall performance when requesting a thin pack but it is currently not available for content and directory objects which are the real bottleneck here.

    • Please register or sign in to reply
  • 413 d = cnt.to_dict()
    414 d["data"] = storage.content_get_data(cnt.sha1)
    415 cnt = Content.from_dict(d)
    416 set_ext_ref(Blob.type_num, content_git_object(cnt))
    417 if sha1 not in ext_refs and not list(storage.directory_missing([sha1])):
    418 dir = directory_get(storage, sha1)
    419 assert dir is not None
    420 set_ext_ref(Tree.type_num, directory_git_object(dir))
    421 if sha1 not in ext_refs and not list(storage.revision_missing([sha1])):
    422 rev = storage.revision_get([sha1], ignore_displayname=True)[0]
    423 assert rev is not None
    424 set_ext_ref(Commit.type_num, revision_git_object(rev))
    425 if sha1 not in ext_refs and not list(storage.release_missing([sha1])):
    426 rel = storage.release_get([sha1], ignore_displayname=True)[0]
    427 assert rel is not None
    428 set_ext_ref(Tag.type_num, release_git_object(rel))
    • Comment on lines +408 to +428

      I think we could reduce the number of back and forth with storage by directly doing content_find/{directory,revision,release}_get and checking for None.

      This is worth adding statsd counters for, so we can tweak in which order we do these queries. Something like: swh_loader_git_external_reference_fetch_total{type="content/directory/revision/release",result="found/not_found"}

    • Author Maintainer

      I think we could reduce the number of back and forth with storage by directly doing content_find/{directory,revision,release}_get and checking for None.

      Of course, what was I thinking ?

      This is worth adding statsd counters for, so we can tweak in which order we do these queries. Something like: swh_loader_git_external_reference_fetch_total{type="content/directory/revision/release",result="found/not_found"}

      Ack but when not querying a thin pack, fetching an external reference from the archive is quite rare (currently we have a dozen of github loadings per day that require such operations) and only a few objects need to be reconstructed, so not sure it will be really useful.

    • I think statsd probes will be cheap to have so we can see how that behavior evolves over time. Fetching individual objects from the archive may be more expensive than fetching them again from the remote as well, when we introduce a smarter way of adding objects to the archive, so we may want to tweak that eventually.

      This would also give us a better sense of what order we should be fetching the objects in (i.e. what type of object is being needed most).

    • Author Maintainer

      Ack, I will add the probes then.

    • Antoine Lambert changed this line in version 2 of the diff

      changed this line in version 2 of the diff

    • Please register or sign in to reply
  • 252 255 self.loader.dumb = False
    253 256 assert self.loader.load() == {"status": "uneventful"}
    254 257
    258 def test_loader_with_ref_delta_in_pack(self, mocker):
    • Thanks, LGTM.

      What happens if the object we get from the storage is corrupt, though? (Could you add a test for that, if it's not too hard? you can get corrupt objects by attr.evolve()-ing their id after creating them)

    • Author Maintainer

      I guess we should raise an exception in that case, I will add a test for this case.

    • Please register or sign in to reply
  • Antoine Lambert added 3 commits

    added 3 commits

    • 6dbfa2e8 - loader: Implement resolve_ext_ref callback of PackInflater
    • 564d474d - loader: Check resolved git objects from the archive are not corrupted
    • 93fe9982 - loader: Add statsd metric to count number of resolved external refs

    Compare with previous version

  • Jenkins job DLDG/gitlab-builds #5 succeeded .
    See Console Output and Coverage Report for more details.

  • vlorentz approved this merge request

    approved this merge request

  • vlorentz mentioned in merge request !144 (merged)

    mentioned in merge request !144 (merged)

  • Antoine Lambert added 5 commits

    added 5 commits

    • 93fe9982...5d3926c9 - 2 commits from branch swh/devel:master
    • 5d93f40a - loader: Implement resolve_ext_ref callback of PackInflater
    • 4fe16a3e - loader: Check resolved git objects from the archive are not corrupted
    • bcaf4241 - loader: Add statsd metric to count number of resolved external refs

    Compare with previous version

  • Author Maintainer

    ^ Rebase

  • Jenkins job DLDG/gitlab-builds #15 succeeded .
    See Console Output and Coverage Report for more details.

  • Please register or sign in to reply
    Loading