Cannot ingest git repositories with (too) large packfiles

marked this issue as related to #3653

changed title from Ingest git loader origins with smaller packfiles to Cannot ingest git repositories with (too) large packfiles

changed the description

mentioned in issue #3653

Sentry issue: SWH-LOADER-GIT-J2

I did some analysis on that issue by extracting a list of git origin URLs from the related Sentry issue using the following Python script:

import os
import sys

import requests

sentry_api_base_url = "https://sentry.softwareheritage.org/api/0"
sentry_issue_events_url = f"{sentry_api_base_url}/issues/5823/events/"

sentry_api_token = os.environ["SENTRY_TOKEN"]
auth_header = {"Authorization": f"Bearer {sentry_api_token}"}
origin_urls = set()

while True:
    response = requests.get(sentry_issue_events_url, headers=auth_header)
    events = response.json()
    if not events:
        break
    for event in events:        
        tags = {tag["key"]: tag["value"] for tag in event.get("tags", [])}
        if "swh.loader.origin_url" in tags:
            origin_urls.add(tags["swh.loader.origin_url"])

    sentry_issue_events_url = response.links.get("next", {}).get("url")

for origin_url in origin_urls:
    print(origin_url)

I executed that script and dumped its output to the following file: big_git_origins, 11074 origins were extracted, 10179 coming from GitHub.

I was wondering if we could get the size of the pack file sent by GitHub for the full content of a repository without cloning it, turned out it is possible by querying the Github API, see example below.

$ curl -s https://api.github.com/repos/stavroskasidis/FF7CCUP | jq .size | awk '{printf "%.2fGiB\n", $1=$1/1024/1024}'
4.00GiB
$ git clone https://github.com/stavroskasidis/FF7CCUP
Cloning into 'FF7CCUP'...
remote: Enumerating objects: 13653, done.
remote: Total 13653 (delta 0), reused 0 (delta 0), pack-reused 13653
Receiving objects: 100% (13653/13653), 4.00 GiB | 3.39 MiB/s, done.
Resolving deltas: 100% (1019/1019), done.
Updating files: 100% (11868/11868), done.

I checked with other repositories and sizes provided by the API are consistent with those from the git clone operations. Based on its API documentation GitHub recomputes a repository size every hour.

I patched the swh.core.utils.GitHubSession class by adding a method to get the size of a repository and wrote the following script to process all extracted github origins from the file previously created.

import logging

from swh.core.github.utils import GitHubSession

logging.basicConfig(level=logging.INFO)

credentials = [
    {"username": "anlambert", "token": "******"}
]

gh_session = GitHubSession(user_agent="SWH / anlambert", credentials=credentials)

print("repo_url,size")
for repo_url in open("big_git_origins", "r"):
    repo_url = repo_url[:-1]
    repo_size = gh_session.get_repository_size(repo_url)
    if repo_size:
        print(f"{repo_url},{repo_size}", flush=True)

I executed that script and dumped the results in that file: big_github_repos.

Then I computed a couple of statistics on the sizes of repositories using the following script:

import statistics

import humanize


def human_readable_size(size):
    return humanize.naturalsize(size * 1024, binary=True, format="%.2f")


repos = []
sizes = []

for i, line in enumerate(open("big_github_repos", "r")):
    if i == 0:
        continue
    url, size = line.split(",")
    size = int(size)
    # filter out repos whose size has been reduced since swh visits
    if size >= 4 * 1024 * 1024:
        repos.append({"url": url, "size": size})
        sizes.append(size)

print(f"{len(repos)} github repositories have a pack size greater than 4.00 GiB")

repo_min_size = min(repos, key=lambda d: d["size"])
repo_max_size = max(repos, key=lambda d: d["size"])

print(
    "Repository with minimum pack size:",
    repo_min_size["url"],
    human_readable_size(repo_min_size["size"]),
)

print(
    "Repository with maximum pack size:",
    repo_max_size["url"],
    human_readable_size(repo_max_size["size"]),
)

print(f"Repositories pack size mean: {human_readable_size(statistics.mean(sizes))}")
print(f"Repositories pack size median: {human_readable_size(statistics.median(sizes))}")

quartiles = [human_readable_size(q) for q in statistics.quantiles(sizes)]

print(f"Repositories pack size quartiles: {quartiles}")

deciles = [human_readable_size(q) for q in statistics.quantiles(sizes, n=10)]

print(f"Repositories pack size deciles: {deciles}")

After executing it, we have the following results:

10099 github repositories have a pack size greater than 4.00 GiB
Repository with minimum pack size: https://github.com/stavroskasidis/FF7CCUP 4.00 GiB
Repository with maximum pack size: https://github.com/CrazyXi/gaze 106.06 GiB
Repositories pack size mean: 9.95 GiB
Repositories pack size median: 7.49 GiB
Repositories pack size quartiles: ['5.89 GiB', '7.49 GiB', '10.36 GiB']
Repositories pack size deciles: ['4.64 GiB', '5.46 GiB', '6.39 GiB', '7.13 GiB', '7.49 GiB', '8.45 GiB', '9.50 GiB', '11.12 GiB', '14.55 GiB']

So the average pack file size is around 10 GiB, 75% of the repositories have a pack size lesser than 10.36 GiB and 90% of the repositories have a pack size lesser than 14.55 GiB.

Currently the maximum pack size the git loader is authorized to download is 4 GiB. Based on my understanding, the main reason for that limitation was due to an implementation issue in dulwich that was caching the downloaded pack file multiple times in memory. That issue is resolved since dulwich v0.20.43 that we are now using in production. So if we increase the maximum authorized pack size in the git loader by doubling or tripling it, we could archive more large repositories without significant negative impact.

As a bonus, we could fetch the repository size in the git loader for a GitHub origin never visited or without a valid snapshot / known refs and avoid downloading a pack file if its size is greater than the maximum authorized one. It would allow to save some bandwith and enable workers to quickly skip the processing of these big git origins.

So might be, that makes sense:

14:27 <+anlambert> I did some analysis about the large repositories currently rejected by the git loader in https://gitlab.softwareheritage.org/swh/devel/swh-loader-git/-/issues/3652#note_136398 for those interested
14:46 <+ardumont> nice ^ so, might be the lister gh could fetch some stats size about the origins (lifting the api call mentioned) and the scheduler could use that to route it to a queue for large workers to consume?

I patched the swh.core.utils.GitHubSession class by adding a method to get the size of a repository and wrote the following script to process all extracted github origins from the file previously created.

I'd say, that's worth a mr in swh.core so actually enhance that class so we can lift it in other repositories.

I pushed swh-core!349 (merged) adding a get_repository_metadata to GitHubSession class for retrieving full JSON metadata of a repository.

There is also swh-core!348 (merged) I submitted yesterday fixing rate limit management with latest version of GH API.

mentioned in commit anlambert/swh-core@862fc822

mentioned in merge request swh-core!349 (merged)

mentioned in commit anlambert/swh-core@3cc28eb7

We should not be calling API endpoints for individual repositories at listing time. We want the listing operation to be reasonably fast (the full github listing, with current parallelism settings, and only hitting the repo list endpoints, already takes a few days / a week).

I think it's fine to use the metadata that would be fetched by the metadata loader earlier in the git loading process, and to eagerly skip loading the repository if there's no recorded parent snapshot and the repository size is above the current worker's threshold.

mentioned in merge request !151 (merged)

mentioned in commit 0474c01f

mentioned in issue swh/infra/sysadm-environment#4872

Cannot ingest git repositories with (too) large packfiles

Designs

Child items ...

Activity