Azure prototype: Content provenance information API

Ack on all the above. Just a precision on the revisions→origin mapping.

At the first visit of a (new) origin, all the revisions we find will be marked as having been seen at the moment of the first visit. At subsequent visits (and in 99.99% of the cases) we will see new revisions as well as revisions that we have already seen in the past. There are at least two ways to go about what we store in the revisions→origin mapping for subsequent visits:

transitive closure: at each visit, we store in the mapping //all// revisions that are reachable starting from repository roots at the time of visit

new revisions only: at each visit, we store in the mapping only the new revisions that we haven't seen in the past. Two variants of this are possible:

global cache: we consider "revisions we haven't seen" to be revisions not seen anywhere in the Software Heritage archive

local cache: we consider "revisions we haven't seen" to be revisions not seen in the past //for a specific origin// (the one being visited)

As per yesterday's F2F discussion, we are going to experiment (first) with 2.B (new revisions only with local cache).

The rationale is twofold:

there is no loss of information with it (if we want, we can always further "unroll" transitive revisions later)
at subsequent visits our loaders (both git and svn) only process new revisions anyhow, so we can't really promise anything about past revisions. Those revisions will be exactly the same as before only //up to what the VCS guarantees//, e.g., up to SHA1 collisions for git and up to repository tampering for svn

! In #547 (closed), @zack wrote:

local cache: we consider "revisions we haven't seen" to be revisions not seen in the past //for a specific origin// (the one being visited)

As per yesterday's F2F discussion, we are going to experiment (first) with 2.B (new revisions only with local cache).

The rationale is twofold:

there is no loss of information with it (if we want, we can always further "unroll" transitive revisions later)

In a second step, we could also look into storing ranges of visits, which would reduce duplication a lot.

at subsequent visits our loaders (both git and svn) only process new revisions anyhow, so we can't really promise anything about past revisions. Those revisions will be exactly the same as before only //up to what the VCS guarantees//, e.g., up to SHA1 collisions for git and up to repository tampering for svn

... although this would still hold, so the first visit would be the only one where we have really seen a new revision...

marked this issue as related to swh/devel/swh-storage#550 (closed)

marked this issue as related to #551 (closed)

marked this issue as related to swh/devel/swh-web#553 (closed)

marked this issue as related to swh/devel/swh-storage#554 (closed)

marked this issue as related to swh/devel/swh-storage#598 (closed)

changed title from Prototype: Content provenance information API to Azure prototype: Content provenance information API

added priority:Low label and removed priority:High label

we're taking a different route for this now, based on @grouss WIP

assigned to @zack

added state:wontfix label

closed

Azure prototype: Content provenance information API

Child items ...

Activity

transitive closure: at each visit, we store in the mapping //all// revisions that are reachable starting from repository roots at the time of visit

new revisions only: at each visit, we store in the mapping only the new revisions that we haven't seen in the past. Two variants of this are possible:

global cache: we consider "revisions we haven't seen" to be revisions not seen anywhere in the Software Heritage archive

local cache: we consider "revisions we haven't seen" to be revisions not seen in the past //for a specific origin// (the one being visited)

local cache: we consider "revisions we haven't seen" to be revisions not seen in the past //for a specific origin// (the one being visited)