From 22d85b22d6b210d6fb1507be320c717170d22d4c Mon Sep 17 00:00:00 2001 From: Valentin Lorentz <vlorentz@softwareheritage.org> Date: Wed, 23 Jan 2019 14:34:22 +0100 Subject: [PATCH 1/2] Document the metadata translation process. --- docs/metadata-workflow.rst | 34 ++++++++++++++++++++++++++++++++++ 1 file changed, 34 insertions(+) diff --git a/docs/metadata-workflow.rst b/docs/metadata-workflow.rst index 7b443494..895ab52d 100644 --- a/docs/metadata-workflow.rst +++ b/docs/metadata-workflow.rst @@ -11,6 +11,9 @@ In order to deduplicate the work between origins, we split this work between multiple indexers, which coordinate with each other and save their results at each step in the indexer storage. +Indexer architecture +-------------------- + .. thumbnail:: images/tasks-metadata-indexers.svg @@ -42,6 +45,7 @@ To do so, it lists files in that directory, and looks for known names, such as `codemeta.json`, `package.json`, or `pom.xml`. If there are any, it runs the Content Metadata Indexer on them, which in turn fetches their contents and runs them through extraction dictionaries/mappings. +See below for details. Their results are saved in a database (the indexer storage), associated with the content and revision hashes. @@ -62,3 +66,33 @@ The reason for this is to be able to perform searches on metadata, and efficiently find out which origins matched the pattern. Running that search on the `revision_metadata` table would require either a reverse lookup from revisions to origins, which is costly. + + +Translation from language-specific metadata to CodeMeta +------------------------------------------------------- + +Intrinsic metadata are extracted from files provided with a project's source +code, and translated using `CodeMeta`_'s `crosswalk table`_. + +All input formats supported so far are straightforward dictionaries (eg. JSON) +or can be accessed as such (eg. XML); and the first part of the translation is +to map their keys to a term in the CodeMeta vocabulary. +This is done by parsing the crosswalk table's `CSV file`_ and using it as a +map between these two vocabularies; and this does not require any +format-specific code in the indexers. + +The second part is to normalize values. As language-specific metadata files +each have their way(s) of formating these values, we need to turn them into +the data type required by CodeMeta. +This normalization makes up for most of the code of +:py:mod:`swh.indexer.metadata_dictionary`. + + +Supported intrinsic metadata +---------------------------- + + + +.. _CodeMeta: https://codemeta.github.io/ +.. _crosswalk table: https://codemeta.github.io/crosswalk/ +.. _CSV file: https://github.com/codemeta/codemeta/blob/master/crosswalk.csv -- GitLab From 80db120043cb0e0cb2017c6a21511efc7ea8586b Mon Sep 17 00:00:00 2001 From: Valentin Lorentz <vlorentz@softwareheritage.org> Date: Thu, 24 Jan 2019 11:37:20 +0100 Subject: [PATCH 2/2] List supported metadata sources. --- docs/metadata-workflow.rst | 18 +++++++++++++++--- 1 file changed, 15 insertions(+), 3 deletions(-) diff --git a/docs/metadata-workflow.rst b/docs/metadata-workflow.rst index 895ab52d..8260deff 100644 --- a/docs/metadata-workflow.rst +++ b/docs/metadata-workflow.rst @@ -87,12 +87,24 @@ the data type required by CodeMeta. This normalization makes up for most of the code of :py:mod:`swh.indexer.metadata_dictionary`. +.. _CodeMeta: https://codemeta.github.io/ +.. _crosswalk table: https://codemeta.github.io/crosswalk/ +.. _CSV file: https://github.com/codemeta/codemeta/blob/master/crosswalk.csv + Supported intrinsic metadata ---------------------------- +The following sources of intrinsic metadata are supported: +* CodeMeta's `codemeta.json`_, +* Maven's `pom.xml`_, +* NPM's `package.json`_, +* Python's `PKG-INFO`_, +* Ruby's `.gemspec`_ -.. _CodeMeta: https://codemeta.github.io/ -.. _crosswalk table: https://codemeta.github.io/crosswalk/ -.. _CSV file: https://github.com/codemeta/codemeta/blob/master/crosswalk.csv +.. _codemeta.json: https://codemeta.github.io/terms/ +.. _pom.xml: https://maven.apache.org/pom.html +.. _package.json: https://docs.npmjs.com/files/package.json +.. _PKG-INFO: https://www.python.org/dev/peps/pep-0314/ +.. _.gemspec: https://guides.rubygems.org/specification-reference/ -- GitLab