generalize usage of SWHID for referencing SWH archive objects

mentioned in merge request !258 (closed)

mentioned in merge request !259 (closed)

mentioned in merge request !260 (closed)

changed title from Generalise usage of SWHID for storing edges (relations) of the SWH archive graph to generalize usage of SWHID for referencing SWH archive objects

So, there's still a few separate issues in this task which I'll try to spell out (at least for my sake) :

extending the current SWHID v1 spec for origins

In that regard, the cat is out of the bag already, and even if we try not to leak these to the public, in practice //swh.graph// and its interface with //swh.web// are enforcing a definition for "v1 SWHID of an Origin" already (using a fixed size 20-byte sha1 of the origin url/iri encoded as UTF-8), so we should document them and make them official. Any change to this definition is, AFAICT from @zack's objections, a non-starter as //swh.graph// absolutely needs a fixed size identifier for all the nodes it has to process

blessing these as SWHIDs opens up the somewhat tangential question of defining a normalization process for the origin URL/IRIs;
by that point we should probably also bless the RawExtrinsicMetadata intrinisic id as a "v1 SWHID" too, even if they're meant to be internal to SWH.

Within the Python data model of a node of the graph, consistent typing of references to other nodes

The status quo of having - Union[SWHID, OriginUrl] attributes on some objects, where we actually disable some of the features of the union members (SWHID qualifiers) - (target_type: Enum, target: bytes) attribute pairs on some objects

is unfortunate.

We have a few proposals to move forward:

introducing a new explicit SwhRef type
- overlapping with the current SWHID type, without qualifiers, for objects with a SWHID
- supporting explicit origin url references (rather than using a hashed swh:1:ori: SWHID)
or using
- the (normalized) Origin url, inlined, if that's the only possible type of node referenced (f.e. in origin visits / origin visit statuses)
- core SWHIDs directly everywhere else, and fully blessing references using hashed swh:1:ori: SWHIDs

This second option is growing on me, because of the uniformity, and because we're and avoiding the introduction of a new, very similar type. We may want to introduce a CoreSWHID type which disables qualifiers, but it would be a plain subset of SWHID, which doesn't have the funky smell of the current Union + attribute validator combo.

Within the storage backends (SQL / Cassandra), consistent storage of references to other nodes of the graph, origins included

For reference, the status quo is: - some tables have a (target bytes, target_type enum) pair of columns, equivalent to the binary storage for a core SWHID v1 - some tables have a (target str, target_type enum) pair of columns, where the target is either the hexadecimal string representation or a SWHID v1, or an origin url - some tables (directory entries) are split across the target type and only use the (target bytes) column to reference the other node

We would like any future storage of these columns to be:

consistent (i.e. use the same column type/set of columns to store references in all tables)
compact (i.e. use bytes for hashes)
future-proof (i.e. not having to redesign the database for the migration to SWHIDv2)

It would also make sense to gradually migrate the current storage to the new schema.

If we decide to only use core SWHIDs in the model layer, we can store them in a consistent composite (version short, type enum, id bytes) column

If we decide to introduce a new composite type in the model layer, we can probably use the same composite (version short, type enum, id bytes) column type, fudging the version/type items to unambiguously store references to origins using their full IRI, encoded as bytes

Bonus track : serialization of references to other nodes in the Journal

I guess that's what @douardda has in mind when he says that resolving the current discussion is necessary for the archival of SWH to Vitam.

As far as I can tell, this issue is more intricately linked to the decisions we're taking in the Python data model, than to the storage backend considerations.

Once we have the updated attributes in //swh.model//, we will want to rewrite the contents of the journal to use the new schema, but we will likely have written a conversion layer to support the deserialization of the old entries stored in //swh.journal//.

In either case, we'll probably want to use the consistency-improved version of the swh.model objects before serializing them for long term archival in Vitam.

Personal conclusion

After rewriting all of this, I have a much better sense of how all these decisions are tangled with one another.

My preference would be:

bless the hashed-url version of SWHID v1 of origins (and of the RawExtrinsicMetadata object id while we're at it)
implement a CoreSWHID type in //swh.model//, which would be the current SWHID, sans qualifiers
define the //swh.journal// serialization of CoreSWHID
gradually migrate all //swh.model// (target, target type) attribute pairs to CoreSWHIDs.

Now:
- RawExtrinsicMetadata is the obvious first target
- ExtId too
Soon:
- Release.(target, target_type)
- SnapshotBranch.(target, target_type) (plausibly introducing a new type for branch aliases ?)
Eventually:
- Revision.directory
- Directory.entries

Gradually migrate storage of CoreSWHID attributes to a new composite column type

We already have a storage -> //swh.model// mapping layer for all backends, so the migrations in //swh.storage// and //swh.model// don't have to be entangled. We can convert all existing data stored in //swh.storage// to generate CoreSWHID objects from the current target/target_type columns and vice versa.

changed the description

(I've finally caught up with the backlog in this task, sorry I'm late to the party.)

In short:

==== Personal conclusion ====

After rewriting all of this, I have a much better sense of how all these decisions are tangled with one another.

My preference would be:

bless the hashed-url version of SWHID v1 of origins (and of the RawExtrinsicMetadata object id while we're at it)

implement a CoreSWHID type in //swh.model//, which would be the current SWHID, sans qualifiers

define the //swh.journal// serialization of CoreSWHID

gradually migrate all //swh.model// (target, target type) attribute pairs to CoreSWHIDs.

Gradually migrate storage of CoreSWHID attributes to a new composite column type

This plan is also my preference, and the course of action also LGTM.

Two minor caveats:

I'm not so sure about the idea of pseudo-SWHIDs for RawExtrinsicMetadata (REM), but maybe it's because I'm missing/not understanding a few details. In particular, it is one thing to need a stable serialization format and an intrinsic ID on them, another to need a SWHID to reference them. Who needs to reference REMs in the first place?
for the storage migration, this seems profound enough that, once ready, it will warrant a "stop the world" → "migrate everything" → restart approach? But maybe that's what you have in mind and the gradual part only applies to the code (which of course will be gradual anyway)

! In #3034, @zack wrote: Who needs to reference REMs in the first place?

I'll let @vlorentz give a better answer, but I think this is needed at least for the metadata-only deposit feature.

! In #3034, @zack wrote: Who needs to reference REMs in the first place?

Option 3 in swh-deposit#2779 (closed)

mentioned in commit 690b7f82

mentioned in commit eba8d84d

mentioned in commit d4b20dcd

marked this issue as related to #3074 (closed)

mentioned in merge request !330 (merged)

generalize usage of SWHID for referencing SWH archive objects

Current situation

The problem

Storing relations to SWH objects

Storing SWHID

Possible solutions

Designs

Child items ...

Activity

extending the current SWHID v1 spec for origins

Within the Python data model of a node of the graph, consistent typing of references to other nodes

Within the storage backends (SQL / Cassandra), consistent storage of references to other nodes of the graph, origins included

Bonus track : serialization of references to other nodes in the Journal

Personal conclusion