git loader: enable "partial" global deduplication of revisions via the extid mapping table
We discussed with @olasd the possibility to lift the extid mapping table to make the loader git work even less. That would definitely be helpful to compute less ids especially when ingesting forks.
Currently, without this ^, we only reuse the previous snapshot references to avoid ingesting known references again. So when we ingest another fork of something known, we actually do the work again.
All in all, our deduplication exists at the storage level (db or objstorage) but not completely at the computation level.
For this to happen, we need to start using the extid table using to map the snapshot branches references to their corresponding sha1_git ids.
So the loader git adaptations would be to:
-
at the end of the loading, after the snapshot creation of the visit. At this point, we know we don't have dangling references. So it's fine to rely on this to skip some work.
-
retrieve the branches references of that snapshot (filtering aliases) and push for each reference the mapping (version 0, sha1-git of the commit/tag, revision/release id) into the extid table
-
at the beginning of the loading, adapt the reading of unknown references (for the origin) to filter out actual known references through the extid table. If they are present, we know them, we can skip the work. Nonetheless, those references should end up in the final snapshot.
That's actually what's been done recently with the mercurial loader.
Migrated from T3635 (view on Phabricator)