generalize usage of SWHID for referencing SWH archive objects
//Note: This is a partial copy/summary of a discussion on the devel mailing-list.//
TL;DR; we may want to generalize the idea of internally referencing any kind of object in the SWH archive a uniform and consistent way, especially Origin objects;
Current situation
We currently have a SWHID
object type defined in the SWH data model. It allows to define a SWHID as an entity with attributes:
class SWHID:
namespace
scheme_version
object_type
object_id
metadata
(metadata being a dict-like structure to store the qualifiers part of a SWHID).
This SWHID
entity type is currently used in the data model only by the RawExtrinsicMetadadata
object:
class RawExtrinsicMetadata(BaseModel):
# target object
type = Enum
target = Union[str, SWHID]
"""URL if type=MetadataTargetType.ORIGIN, else core SWHID"""
[...]
# context
origin = Optional[str]
visit = Optional[int]
snapshot = Optional[SWHID]
release = Optional[SWHID]
revision = Optional[SWHID]
path = Optional[bytes]
directory = Optional[SWHID]
"references" to other "core" SWH entity type are also found in the Release
object (under the target_type
/target
couple of attributes), and somehow in the Snapshot
object via the target_type
/target
attributes of the SnapshotBranch
object. This later however extends this notion because of the presence of the ALIAS
target type for branches.
It is also intrinsically present in all other relations of the Merkle DAG, but since the target type is fixed, there is no need to store a target_type.
The problem
Storing relations to SWH objects
The lack of SWHID on some internal objects, especially on Origin
, make it necessary to use the kind of workaround used for RawExtrinsicMetadata.target
(aka "URL if type=MetadataTargetType.ORIGIN, else core SWHID"), which is not very satisfying for several reasons:
- it comes with many
if/else
snippets in the code as soon a one need to deal with aRawExtrinsicMetadata
object, - it requires to store SWHID as strings, which is not very efficient (double the space due to the hash being represented as hex, filter on target's type impractical)
- it forces to have a column to discriminate Origin target but is meaningless for all other target types,
- (opinionated argument) it's overall quite inelegant.
This situation is currently concerning only the RawExtrinsicMetadata
object model but it might be present in more cases in the future (see for example the case of the support for ExtID under development as !148 (closed) (where it has been suggested to use SWHID instead of a couple (target_type
/target
) ).
Also note that all the "context" part of the RawExtrinsicMetadata
(origin
, visit
, snapshot
, release', 'revision
, path
and directory
attributes) can be seen as an extended way of storing the target
SWHID.
Storing SWHID
As already stated above, a related topic is how we store reference to other SWH objects, and especially SWHID, in the backend database. The current solution (store SWHID string representation for SWHID, ad hoc multi column otherwise) is not ideal for the reasons already listed above.
Possible solutions
The idea is then to normalize (as "improve consistency") as much as possible the SWH data model by using SWHID everywhere it makes sense in terms of modelization, and possibly improve the way we store SWHID objects.
To do so, possible tasks would be:
-
Extend current SWHID to Origin
, using one of:- hash of the origin URL as identifier: keeps the fixed-size ids property, already used in some parts of swh; this would require to store origin hashes in a (computed) column of the origin table (in pg) and get rid of the "index-on-sha1-of-column hack."
- the "hexlified" URL as identifier: keep the "resolvable origins" property. This is probably unacceptable since it would break existing swh:1:ori swhids, even if these are not not be used outside swh-graph, and the breakage of the fixed-size id os the SWHID itself would require a SWHIDv2 spec
-
or reify the notion of relation to a SWH object in the archive using a new dedicated object type (eg. SWHRef
) that consists mainly in a triplet (version, type, id) ; for origins, same choice is to be made for the computation of the id. But since there is no bw compatibility issue there, it is perfectly acceptable to use the url itself ad identifier part of a SWHRef object.
As @olasd wrote:
[using a new a SWHRef object] we can have :
- origins stored as (0, , <origin_url encoded as utf-8>)
- core SWHIDs v1 stored as (1, , <byte array for swhid "id">)
This also allows us to store an edge to a "SWHID v1 for the hash of an origin", without any ambiguity. And this allows us to decode the origin urls without going through another table.
Below, SWHRef
makes reference to the chosen solution above (either a new SWHRef object or an "extended" SWHID one).
Then:
-
Refactor the RawExtrinsicMetadata
to use only a SWHReftarget
attribute (get rid of thetype
) -
Refactor the RawExtrinsicMetadata
to use a SWHReforigin
attribute (for consistency) -
Use a composite custom type to store SWHRef in Postgresql and the equivalent for Cassandra -
Refactor the Release
object to use aSWHRef
astarget
-
Refactor the SnapshotBranch
to use aSWHRef
astarget
; this would require to have alias in a dedicated attribute of theSnapshotBranch
object or introduce a dedicatedSnapshotAliasBranch
or similar dedicated to aliases, depending on the choice made for the SWHRef model oject. -
Refactor all the Merkle-DAG-related objects to use SWHRef as "references".
//Note: Obviously the last 3 points above are rather radical and imply, if applied as is. a major migration of the database and are listed here mostly for the sake of completeness. //
Migrated from T3034 (view on Phabricator)