Skip to content
GitLab
Projects Groups Topics Snippets
  • /
  • Help
    • Help
    • Support
    • Community forum
    • Submit feedback
    • Contribute to GitLab
  • Register
  • Sign in
  • S swh-loader-git
  • Project information
    • Project information
    • Activity
    • Labels
    • Members
  • Repository
    • Repository
    • Files
    • Commits
    • Branches
    • Tags
    • Contributor statistics
    • Graph
    • Compare revisions
  • Issues 26
    • Issues 26
    • List
    • Boards
    • Service Desk
    • Milestones
  • Merge requests 5
    • Merge requests 5
  • CI/CD
    • CI/CD
    • Pipelines
    • Jobs
    • Schedules
  • Deployments
    • Deployments
    • Environments
    • Releases
  • Monitor
    • Monitor
    • Incidents
  • Analytics
    • Analytics
    • Value stream
    • CI/CD
    • Repository
  • Wiki
    • Wiki
  • Snippets
    • Snippets
  • Activity
  • Graph
  • Create a new issue
  • Jobs
  • Commits
  • Issue Boards
Collapse sidebar
  • Platform
  • Development
  • swh-loader-git
  • Merge requests
  • !136

git: Load git repository through multiple packfiles fetch operations

  • Review changes

  • Download
  • Patches
  • Plain diff
Closed Antoine R. Dumont requested to merge generated-differential-D6386-source into master Oct 01, 2021
  • Overview 19
  • Commits 1
  • Pipelines 0
  • Changes 4

This introduces the means to configure the packfile fetching policy. The default, as before, is to fetch one packfile to ingest everything unknown out of it. When fetch_multiple_packfiles is True (and the ingestion passes through the 'smart' protocol), the ingestion uses packfiles (with a given number_of_heads_per_packfile). After each packfile is loaded, a 'partial' (because incomplete) and 'incremental' (as in gathering seen refs so far) snapshot is created.

Even if the new fetching policy were activated, this should not impact how small to medium repositories are ingested.

The end goal is to decrease the potential issues of failure during loading large repositories (with large packfiles) and to allow the eventual next loading to pick up where the last loading failure occurred.

It's not perfect yet because it also depends on how the repository git graph connectivity (for example, if it happens that first 200 references are fully connected, then we will retrieve everything in one round anyway).

Implementation wise, this adapts the current graph walker (which is the one resolving the missing local references from the remote references) so it won't walk over already fetched references when multiple iterations is needed.

This also makes the loader git explicitely create partial visit when fetching packfiles. That is, the loader now creates partial visits with snapshot after each packfile consumed. The end goal being to decrease the work the loader would have to do again if the initial visit would not complete for some reasons.

Related to #3625 (closed)

Test Plan

  • tox failing without swh.loader.core release with swh-loader-core!417 (merged)

  • pytest (happy)

  • docker-compose (happy) tests that we do ingest with the same snapshot. The memory usage is consistenly smaller than the existing master code.

Large repositories ingestion ongoing.


Migrated from D6386 (view on Phabricator)

Assignee
Assign to
Reviewers
Request review from
Time tracking
Source branch: generated-differential-D6386-source