# New model for the origin layer

The origin layer is the index in swh.provenance allowing to answer the question "in which origin can the revision R be found?" and eventually "what is the most probable origin O in which the revision R has been created?"

# Rationale

The current implementation of the origin layer is pretty basic and simple, but it performs poorly, both in terms of ingestion bandwidth and storage usage.

It consists in 2 tables:

- Revision in Origin (
`RiO`

) : a table with 2 columns (revision, origin). - Revision before Revision (
`RbR`

): a table with 2 columns (prev, next)

The table `RiO`

stores the head revisions found in a origin, ie. the revisions that have been identified as a branch in a snapshot of a visit of the given origin.

The table `RbR`

stores a "synthetic" vision of revision's topological graph; i.e. it stores (non recursively) couples of revision ids where the second column is the id of any revision in an origin, and the first column is the id of a head revision in that origin.

For example, a repo like:

with only one visit producing a single snapshot with `h1`

, `h2`

and `h3`

heads (branch/tags) would end up filling the tables described above like:

for the `RiO`

table :

| Orig | Rev | | O1 | r5 | | O1 | r4 | | O1 | r2 |

and for the `RbR`

table

| Rev (next) | Rev (prev) | | r2 | r1 | | r4 | r2 | | r4 | r1 | | r5 | r4 | | r5 | r3 | | r5 | r2 | | r5 | r1 |

This data model lacks storage efficiency (some informations on the topological structure of the git history are duplicated) and ingestion efficiency (each head is ingested one at a time, thus the whole git history is recomputed for each head).

## Proposal

We can make the required storage a bit more efficient and improve the efficiency of the ingestion algorithm using the following data model.

Use 3 tables:

- Head in Origin (
`HiO`

), same as the`RiO`

above, but we make it explicit that revisions stored in this table are actually heads, - Head before Head (
`HbH`

), similar to the`RbR`

above, but only for head revisions - Rev before Head (
`RbH`

) to store the location of all other (non head) revisions with regard to heads.

Using such a model, the example above would become:

`HiO`

:
| Orig | Head |
| O1 | r5 |
| O1 | r4 |
| O1 | r2 |

`HbH`

:
| Head (next) | Head (prev) |
| r5 | r4 |
| r5 | r2 |
| r4 | r2 |

`RbH`

:
| Head | Rev |
| r5 | r3 |
| r2 | r1 |

*Migrated from T4526 (view on Phabricator)*