Skip to content
GitLab
Projects Groups Snippets
  • /
  • Help
    • Help
    • Support
    • Community forum
    • Submit feedback
    • Contribute to GitLab
  • Register
  • Sign in
  • S swh-provenance
  • Project information
    • Project information
    • Activity
    • Labels
    • Members
  • Repository
    • Repository
    • Files
    • Commits
    • Branches
    • Tags
    • Contributors
    • Graph
    • Compare
  • Issues 7
    • Issues 7
    • List
    • Boards
    • Service Desk
    • Milestones
  • Merge requests 0
    • Merge requests 0
  • CI/CD
    • CI/CD
    • Pipelines
    • Jobs
    • Schedules
  • Deployments
    • Deployments
    • Environments
    • Releases
  • Monitor
    • Monitor
    • Incidents
  • Analytics
    • Analytics
    • Value stream
    • CI/CD
    • Repository
  • Wiki
    • Wiki
  • Snippets
    • Snippets
  • Activity
  • Graph
  • Create a new issue
  • Jobs
  • Commits
  • Issue Boards
Collapse sidebar
  • Platform
  • Development
  • swh-provenance
  • Issues
  • #4526
Closed
Open
Issue created Sep 09, 2022 by David Douard@douarddaMaintainer

New model for the origin layer

The origin layer is the index in swh.provenance allowing to answer the question "in which origin can the revision R be found?" and eventually "what is the most probable origin O in which the revision R has been created?"

Rationale

The current implementation of the origin layer is pretty basic and simple, but it performs poorly, both in terms of ingestion bandwidth and storage usage.

It consists in 2 tables:

  • Revision in Origin (RiO) : a table with 2 columns (revision, origin).
  • Revision before Revision (RbR): a table with 2 columns (prev, next)

The table RiO stores the head revisions found in a origin, ie. the revisions that have been identified as a branch in a snapshot of a visit of the given origin.

The table RbR stores a "synthetic" vision of revision's topological graph; i.e. it stores (non recursively) couples of revision ids where the second column is the id of any revision in an origin, and the first column is the id of a head revision in that origin.

For example, a repo like:

git_origin_layer.svg git_origin_layer

with only one visit producing a single snapshot with h1, h2 and h3 heads (branch/tags) would end up filling the tables described above like:

for the RiO table :

| Orig | Rev | | O1 | r5 | | O1 | r4 | | O1 | r2 |

and for the RbR table

| Rev (next) | Rev (prev) | | r2 | r1 | | r4 | r2 | | r4 | r1 | | r5 | r4 | | r5 | r3 | | r5 | r2 | | r5 | r1 |

This data model lacks storage efficiency (some informations on the topological structure of the git history are duplicated) and ingestion efficiency (each head is ingested one at a time, thus the whole git history is recomputed for each head).

Proposal

We can make the required storage a bit more efficient and improve the efficiency of the ingestion algorithm using the following data model.

Use 3 tables:

  • Head in Origin (HiO), same as the RiO above, but we make it explicit that revisions stored in this table are actually heads,
  • Head before Head (HbH), similar to the RbR above, but only for head revisions
  • Rev before Head (RbH) to store the location of all other (non head) revisions with regard to heads.

Using such a model, the example above would become:

HiO: | Orig | Head | | O1 | r5 | | O1 | r4 | | O1 | r2 |

HbH: | Head (next) | Head (prev) | | r5 | r4 | | r5 | r2 | | r4 | r2 |

RbH: | Head | Rev | | r5 | r3 | | r2 | r1 |


Migrated from T4526 (view on Phabricator)

Edited Jan 07, 2023 by Phabricator Migration user
Assignee
Assign to
Time tracking