I executed that script and dumped its output to the following file: big_git_origins, 11074 origins were extracted, 10179 coming from GitHub.
I was wondering if we could get the size of the pack file sent by GitHub for the full content of a repository without cloning it, turned out it is possible by querying the Github API, see example below.
I checked with other repositories and sizes provided by the API are consistent with those from the git clone operations. Based on its API documentation GitHub recomputes a repository size every hour.
I patched the swh.core.utils.GitHubSession class by adding a method to get the size of a repository and wrote the following script to process all extracted github origins from the file previously created.
I executed that script and dumped the results in that file: big_github_repos.
Then I computed a couple of statistics on the sizes of repositories using the following script:
importstatisticsimporthumanizedefhuman_readable_size(size):returnhumanize.naturalsize(size*1024,binary=True,format="%.2f")repos=[]sizes=[]fori,lineinenumerate(open("big_github_repos","r")):ifi==0:continueurl,size=line.split(",")size=int(size)# filter out repos whose size has been reduced since swh visitsifsize>=4*1024*1024:repos.append({"url":url,"size":size})sizes.append(size)print(f"{len(repos)} github repositories have a pack size greater than 4.00 GiB")repo_min_size=min(repos,key=lambdad:d["size"])repo_max_size=max(repos,key=lambdad:d["size"])print("Repository with minimum pack size:",repo_min_size["url"],human_readable_size(repo_min_size["size"]),)print("Repository with maximum pack size:",repo_max_size["url"],human_readable_size(repo_max_size["size"]),)print(f"Repositories pack size mean: {human_readable_size(statistics.mean(sizes))}")print(f"Repositories pack size median: {human_readable_size(statistics.median(sizes))}")quartiles=[human_readable_size(q)forqinstatistics.quantiles(sizes)]print(f"Repositories pack size quartiles: {quartiles}")deciles=[human_readable_size(q)forqinstatistics.quantiles(sizes,n=10)]print(f"Repositories pack size deciles: {deciles}")
After executing it, we have the following results:
So the average pack file size is around 10 GiB, 75% of the repositories have a pack size lesser than 10.36 GiB and 90% of the repositories have a pack size lesser than 14.55 GiB.
Currently the maximum pack size the git loader is authorized to download is 4 GiB. Based on my understanding, the main reason for that limitation was due to an implementation issue in dulwich that was caching the downloaded pack file multiple times in memory. That issue is resolved since dulwich v0.20.43 that we are now using in production. So if we increase the maximum authorized pack size in the git loader by doubling or tripling it, we could archive more large repositories without significant negative impact.
As a bonus, we could fetch the repository size in the git loader for a GitHub origin never visited or without a valid snapshot / known refs and avoid downloading a pack file if its size is greater than the maximum authorized one. It would allow to save some bandwith and enable workers to quickly skip the processing of these big git origins.
14:27 <+anlambert> I did some analysis about the large repositories currently rejected by the git loader in https://gitlab.softwareheritage.org/swh/devel/swh-loader-git/-/issues/3652#note_136398 for those interested14:46 <+ardumont> nice ^ so, might be the lister gh could fetch some stats size about the origins (lifting the api call mentioned) and the scheduler could use that to route it to a queue for large workers to consume?
I patched the swh.core.utils.GitHubSession class by adding a method to get the size of a repository and wrote the following script to process all extracted github origins from the file previously created.
I'd say, that's worth a mr in swh.core so actually enhance that class so we can lift it in other repositories.
We should not be calling API endpoints for individual repositories at listing time. We want the listing operation to be reasonably fast (the full github listing, with current parallelism settings, and only hitting the repo list endpoints, already takes a few days / a week).
I think it's fine to use the metadata that would be fetched by the metadata loader earlier in the git loading process, and to eagerly skip loading the repository if there's no recorded parent snapshot and the repository size is above the current worker's threshold.