Define and specify extrinsic origin metadata

marked this issue as related to #1344 (closed)

marked this issue as related to #1737 (closed)

added Metadata workflow priority:Normal labels

changed title from Define and specify origin extrinsic metadata to Define and specify extrinsic origin metadata

After the discussion on extrinsic_metadata this morning with @zack, @vlorentz, @olasd and @douardda, here is a quick recap of the discussion, feel free to add, comment and rewrite...

the notion of attached extrinsic metadata and independent extrinsic metadata was introduced for comprehension purposes without the necessity to implement
the term metadata provider was discussed and shouldn't be used, we will continue with authority
the distinction between the authority providing the metadata and the tool making it possible, should be documented and reflected in the specs and implementation
two options on how extrinsic metadata can be fetched and kept:

crawling a source (authority) and keeping all metadata with its source |metadata_url|metadata_loader| |github.com/foo/bar| github_metadata_loader |gitlab.inria.org/foo/bar | gitlab_metadata_loader
fetching metadata that describes a code repository (potentially in our archive) | origin_url | authority | tool | time-stamp | raw_metadata| | github.com/foo/bar | github.com | github_loader | ts | rm| | github.com/foo/bar | wikidata | wikidata_gatherer | ts | rm| |github.com/foo/bar | fsf.org | fsf_gatherer | ts | rm|

types of authorities:

code hosts
deposit clients
registries

points to clarify:

metadata found from a mirror that kept the data from a different authority (Antelink scenario)
do we want to keep metadata found without the associated origin (a.k.a code repository)
do we want to document extrinsic metadata about other granularity levels (content, directory, revision, snapshot)? is this type of metadata exists?
should we create new tools for lister/loader type named metadata_loader/lister or refactor existing tools?
this was not discussed, but there is a table for the providers now called authoritywhere metadata about the providers should be kept in a know metadata schema (swh/devel/swh-storage!905 (closed)#inline-9677)

Actions on swh/devel/swh-storage!905 (closed): I propose to abort with this diff and relaunch a new diff taking into account the comments from the discussion, and in particular: - specify authority instead of provider and add description of tools - specify that raw metadata will be kept anyway before syntax and semantic translation

Here is the current implementation: swh:1:cnt:ea4b149cd76c67c304425771caa67ec5641a1b64;lines=381-428

Thanks a lot for this recap Morane !

Regarding the two options you mention, I'm pretty sure we decided to go for the second (the 5 columns table, at least conceptually). I'm not sure I understand the first option, nor if it is alternative or in addition to the second.

As an additional note: we discussed that tool needed a version and/or configuration—similar to what we have for intrinsic metadata indexers (although maybe the actual representation can be improved), and unlike what we do with loading content into the archive (but we agreed that, in theory, we should have version/configuration also for code loaders, the fact we don't is a bug that we do not want to replicate here).

As discussed F2F, I concur we can restart from scratch with swh/devel/swh-storage!905 (closed), and I'll be happy to review its reincarnation when ready.

@vlorentz I think we can resolve this due to swh/devel/swh-storage!221 (closed)?

Indeed

closed

Define and specify extrinsic origin metadata

Child items ...

Activity