New model for the origin layer
The origin layer is the index in swh.provenance allowing to answer the question "in which origin can the revision R be found?" and eventually "what is the most probable origin O in which the revision R has been created?"
Rationale
The current implementation of the origin layer is pretty basic and simple, but it performs poorly, both in terms of ingestion bandwidth and storage usage.
It consists in 2 tables:
- Revision in Origin (
RiO
) : a table with 2 columns (revision, origin). - Revision before Revision (
RbR
): a table with 2 columns (prev, next)
The table RiO
stores the head revisions found in a origin, ie. the revisions that have been identified as a branch in a snapshot of a visit of the given origin.
The table RbR
stores a "synthetic" vision of revision's topological graph; i.e. it stores (non recursively) couples of revision ids where the second column is the id of any revision in an origin, and the first column is the id of a head revision in that origin.
For example, a repo like:
with only one visit producing a single snapshot with h1
, h2
and h3
heads (branch/tags) would end up filling the tables described above like:
for the RiO
table :
| Orig | Rev | | O1 | r5 | | O1 | r4 | | O1 | r2 |
and for the RbR
table
| Rev (next) | Rev (prev) | | r2 | r1 | | r4 | r2 | | r4 | r1 | | r5 | r4 | | r5 | r3 | | r5 | r2 | | r5 | r1 |
This data model lacks storage efficiency (some informations on the topological structure of the git history are duplicated) and ingestion efficiency (each head is ingested one at a time, thus the whole git history is recomputed for each head).
Proposal
We can make the required storage a bit more efficient and improve the efficiency of the ingestion algorithm using the following data model.
Use 3 tables:
- Head in Origin (
HiO
), same as theRiO
above, but we make it explicit that revisions stored in this table are actually heads, - Head before Head (
HbH
), similar to theRbR
above, but only for head revisions - Rev before Head (
RbH
) to store the location of all other (non head) revisions with regard to heads.
Using such a model, the example above would become:
HiO
:
| Orig | Head |
| O1 | r5 |
| O1 | r4 |
| O1 | r2 |
HbH
:
| Head (next) | Head (prev) |
| r5 | r4 |
| r5 | r2 |
| r4 | r2 |
RbH
:
| Head | Rev |
| r5 | r3 |
| r2 | r1 |
Migrated from T4526 (view on Phabricator)