Skip to content

Add helper functions to easily retrieve metadata associated to one origin URL

The discussion around swh-core!349 (merged) makes me believe that there's a gap in our tooling that swh.loader.metadata could fill in a more generic way: fetching (and parsing) the extrinsic metadata associated to one origin URL, so that it can be exploited.

To do so, we need to implement the following helpers :

  • mapping a origin URL to metadata fetchers
    • either via a heuristic based on the url (e.g. github.com, gitlab.*, etc.)
    • or via access to the scheduler database, seeded with information fetched from the lister
  • For each metadata fetcher, do lightweight parsing of the fetched raw metadata (e.g. translate JSON to a Python dict)
  • Optionally run the Codemeta / MeSoCoRe mapping on the fetched metadata if that exists

Would this make sense? Should this set of helpers live in swh-loader-metadata?