loader: Implement resolve_ext_ref callback of PackInflater
It exist cases where a git server can send a pack file containing external references to objects stored in the local repository. In that case, data for such objects are not included in the pack file and must be resolved from the local git object store.
Such a pack file is typically generated when a git client asks for a thin pack to the server but can also be generated when a client is not requesting a thin pack.
It has been observed that GitHub can send such pack files to the loader for origins already visited. The loader does not ask for thin pack files so it might be an implementation issue of git-upload-pack from their side.
This is problematic as dulwich raises a KeyError exception when it finds a pack file has external references not resolved in it and thus the git loading fails.
In order to workaround that issue, implement the resolve_ext_ref callback that can be passed as parameter to the PackInflater class. When dulwich encounters an external reference in the pack file, it is calling that callback to resolve object data. As an external object has already been stored into the archive, we can reconstruct its git manifest from the archive content and thus resolve the external reference in the pack file.
Fixes #4745 (closed)
As a side note, theses changes now allow the loader to ask for thin packs but I am not sure if there is any benefits for us to do so.
Merge request reports
Activity
179 179 for pack in list(self.packs): 180 180 try: 181 181 if sha in pack: 182 return pack[sha] 182 return pack[sha] # type: ignore changed this line in version 3 of the diff
Jenkins job DLDG/gitlab-builds #4 succeeded .
See Console Output and Coverage Report for more details.As a side note, theses changes now allow the loader to ask for thin packs but I am not sure if there is any benefits for us to do so.
Maybe this would help incremental fetches from repos with gigantic checkouts, e.g. azure-docs. Right now our incremental fetches are still pulling all the objects referenced from the new commits, I expect thin packs not to do that.
So I tried the request of a thin pack by the loader, the number of full objects contained in the received pack file is indeed significantly lesser but we have much more external objects to resolve from the archive and this significantly impacts the loader performance.
For instance when performing a new visit of https://github.com/cypress-io/cypress, this is the timing we obtain when not requesting a thin pack:
14:01 $ time swh loader -C ~/.config/swh/loader.yml run git https://github.com/cypress-io/cypress INFO:swh.loader.git.loader.GitLoader:Load origin 'https://github.com/cypress-io/cypress' with type 'git' Enumerating objects: 5812, done. Counting objects: 100% (4589/4589), done. Compressing objects: 100% (1620/1620), done. Total 5812 (delta 3541), reused 3495 (delta 2967), pack-reused 1223 INFO:swh.loader.git.loader:Listed 8803 refs for repo https://github.com/cypress-io/cypress INFO:swh.loader.git.loader.GitLoader:Fetched 5813 objects; 5172 are new {'status': 'eventful'} for origin 'https://github.com/cypress-io/cypress' real 0m18,918s user 0m7,185s sys 0m0,735s
While this is the timing we obtain when requesting a thin pack:
14:02 $ time swh loader -C ~/.config/swh/loader.yml run git https://github.com/cypress-io/cypress INFO:swh.loader.git.loader.GitLoader:Load origin 'https://github.com/cypress-io/cypress' with type 'git' Enumerating objects: 5812, done. Counting objects: 100% (4589/4589), done. Compressing objects: 100% (713/713), done. Total 5812 (delta 4018), reused 4410 (delta 3874), pack-reused 1223 INFO:swh.loader.git.loader:Listed 8803 refs for repo https://github.com/cypress-io/cypress INFO:swh.loader.git.loader.GitLoader:Fetched 5813 objects; 5172 are new {'status': 'eventful'} for origin 'https://github.com/cypress-io/cypress' real 1m47,016s user 0m14,352s sys 0m1,265s
So definitely not an option for us to request thin packs.
Edited by Antoine LambertFor the record, I tried to patch dulwich to not walk the external refs chain (by commenting these two lines) but obviously we miss a lot of deltified objects contained in the pack as the full chain of deltas is not walked.
Edited by Antoine LambertI forgot to detail how the loader can request a thin pack, by applying that simple diff:
diff --git a/swh/loader/git/loader.py b/swh/loader/git/loader.py index e9fd4e9..e266336 100644 --- a/swh/loader/git/loader.py +++ b/swh/loader/git/loader.py @@ -215,7 +215,7 @@ class GitLoader(BaseGitLoader): logger.debug("Transport url to communicate with server: %s", transport_url) client, path = dulwich.client.get_transport_and_path( - transport_url, thin_packs=False + transport_url, thin_packs=True ) logger.debug("Client %s to fetch pack at %s", client, path)
But as I said above, it significantly increases loading time for an already visited git repository due to the objects that need to be fetched from the archive. Fetching the objects by batch could improve overall performance when requesting a thin pack but it is currently not available for content and directory objects which are the real bottleneck here.
413 d = cnt.to_dict() 414 d["data"] = storage.content_get_data(cnt.sha1) 415 cnt = Content.from_dict(d) 416 set_ext_ref(Blob.type_num, content_git_object(cnt)) 417 if sha1 not in ext_refs and not list(storage.directory_missing([sha1])): 418 dir = directory_get(storage, sha1) 419 assert dir is not None 420 set_ext_ref(Tree.type_num, directory_git_object(dir)) 421 if sha1 not in ext_refs and not list(storage.revision_missing([sha1])): 422 rev = storage.revision_get([sha1], ignore_displayname=True)[0] 423 assert rev is not None 424 set_ext_ref(Commit.type_num, revision_git_object(rev)) 425 if sha1 not in ext_refs and not list(storage.release_missing([sha1])): 426 rel = storage.release_get([sha1], ignore_displayname=True)[0] 427 assert rel is not None 428 set_ext_ref(Tag.type_num, release_git_object(rel)) - Comment on lines +408 to +428
I think we could reduce the number of back and forth with storage by directly doing
content_find
/{directory,revision,release}_get
and checking for None.This is worth adding statsd counters for, so we can tweak in which order we do these queries. Something like:
swh_loader_git_external_reference_fetch_total{type="content/directory/revision/release",result="found/not_found"}
I think we could reduce the number of back and forth with storage by directly doing
content_find
/{directory,revision,release}_get
and checking for None.Of course, what was I thinking ?
This is worth adding statsd counters for, so we can tweak in which order we do these queries. Something like:
swh_loader_git_external_reference_fetch_total{type="content/directory/revision/release",result="found/not_found"}
Ack but when not querying a thin pack, fetching an external reference from the archive is quite rare (currently we have a dozen of github loadings per day that require such operations) and only a few objects need to be reconstructed, so not sure it will be really useful.
I think statsd probes will be cheap to have so we can see how that behavior evolves over time. Fetching individual objects from the archive may be more expensive than fetching them again from the remote as well, when we introduce a smarter way of adding objects to the archive, so we may want to tweak that eventually.
This would also give us a better sense of what order we should be fetching the objects in (i.e. what type of object is being needed most).
changed this line in version 2 of the diff
252 255 self.loader.dumb = False 253 256 assert self.loader.load() == {"status": "uneventful"} 254 257 258 def test_loader_with_ref_delta_in_pack(self, mocker): Shouldn't the name be "not in pack"?
No ref delta is a type of object contained in a pack.
And could you add a docstring to explain this? (you can copy-paste the commit description)
Sure.
changed this line in version 2 of the diff
Jenkins job DLDG/gitlab-builds #5 succeeded .
See Console Output and Coverage Report for more details.mentioned in merge request !144 (merged)
added 5 commits
-
93fe9982...5d3926c9 - 2 commits from branch
swh/devel:master
- 5d93f40a - loader: Implement resolve_ext_ref callback of PackInflater
- 4fe16a3e - loader: Check resolved git objects from the archive are not corrupted
- bcaf4241 - loader: Add statsd metric to count number of resolved external refs
Toggle commit list-
93fe9982...5d3926c9 - 2 commits from branch
Jenkins job DLDG/gitlab-builds #15 succeeded .
See Console Output and Coverage Report for more details.