docs: Explain the indexation workflow for extrinsic metadata
Closed
requested to merge generated-differential-D8087-source into generated-differential-D8087-target
3 unresolved threads
Merge request reports
Activity
Filter activity
Build is green
Patch application report for D8087 (id=29192)
Could not rebase; Attempt merge onto c2742b5b...
Updating c2742b5..847bc29 Fast-forward docs/images/tasks-metadata-indexers.uml | 57 +++++----- docs/metadata-workflow.rst | 118 ++++++++++++++++----- swh/indexer/codemeta.py | 57 ++++++---- swh/indexer/metadata.py | 10 +- swh/indexer/metadata_detector.py | 4 +- swh/indexer/metadata_dictionary/__init__.py | 12 ++- swh/indexer/metadata_dictionary/base.py | 63 +++++++---- swh/indexer/metadata_dictionary/cff.py | 4 +- swh/indexer/metadata_dictionary/codemeta.py | 4 +- swh/indexer/metadata_dictionary/composer.py | 4 +- swh/indexer/metadata_dictionary/github.py | 68 ++++++++++-- swh/indexer/metadata_dictionary/maven.py | 4 +- swh/indexer/metadata_dictionary/npm.py | 4 +- swh/indexer/metadata_dictionary/python.py | 4 +- swh/indexer/metadata_dictionary/ruby.py | 11 +- .../tests/metadata_dictionary/test_github.py | 26 ++++- 16 files changed, 307 insertions(+), 143 deletions(-)
Changes applied before test
commit 847bc29eb5e77bc70380336c8dce480be5019251 Author: Valentin Lorentz <vlorentz@softwareheritage.org> Date: Wed Jul 6 16:17:22 2022 +0200 docs: Explain the indexation workflow for extrinsic metadata commit 724034de625f3a388a261e1eed3e6a2c9620c539 Author: Valentin Lorentz <vlorentz@softwareheritage.org> Date: Tue Jul 5 17:46:33 2022 +0200 docs: Update description of the metadata workflow 1. indexers call themselves directly instead of going through the scheduler 2. metadata is attached to directories instead of revisions commit 3458892274226aabe490a795abe5d6fce990be99 Author: Valentin Lorentz <vlorentz@softwareheritage.org> Date: Tue Jul 5 16:43:02 2022 +0200 github: Translate stargazers_count and watchers_count commit e177c77baf48b69e420a3eed6b9125a7f209947f Author: Valentin Lorentz <vlorentz@softwareheritage.org> Date: Mon Jul 4 18:26:24 2022 +0200 Simplify codemeta.make_absolute_uri() commit dd9adebeca15c697cc27011693c8d84f6ec1544e Author: Valentin Lorentz <vlorentz@softwareheritage.org> Date: Mon Jul 4 18:25:25 2022 +0200 Document codemeta.make_absolute_uri() commit 358ee08416dd847d7ebbddd0c721d7a287149175 Author: Valentin Lorentz <vlorentz@softwareheritage.org> Date: Mon Jul 4 13:55:19 2022 +0200 Use compact URIs for ForgeFed and ActivityStreams It makes resulting documents (usually) shorter, and tests more readable. commit d41f26eef0561fd41932eb688bc6908f2253ef4c Author: Valentin Lorentz <vlorentz@softwareheritage.org> Date: Mon Jul 4 13:37:58 2022 +0200 Use separate base classes for intrinsic and extrinsic mappings detect_metadata_files and extrinsic_metadata_formats (respectively) are somewhat mutually exclusive, so it does not make much sense to have them in the same class and MAPPINGS dict
See https://jenkins.softwareheritage.org/job/DCIDX/job/tests-on-diff/343/ for more details.
89 89 .. _CSV file: https://github.com/codemeta/codemeta/blob/master/crosswalk.csv 90 90 91 91 92 Extrinsic metadata 93 ------------------ 94 95 The :term:`extrinsic metadata` indexer works very differently from 96 the :term:`intrinsic metadata` indexers we saw above. 97 While the latter extract metadata from software artefacts (files and directories) 98 which are already a core part of the archive, the former extracts such data from 99 API calls pulled from forges and package managers, or pushed via the 106 but in the "raw extrinsic metadata" form, which needs to be translated to a common 107 vocabulary to be useful, as with intrinsic metadata. 108 109 The common vocabulary we chose is JSON-LD, with both CodeMeta and 110 `ForgeFed's vocabulary`_ (including `ActivityStream's vocabulary`_) 111 112 .. _ForgeFed's vocabulary: https://forgefed.org/vocabulary.html 113 .. _ActivityStream's vocabulary: https://www.w3.org/TR/activitystreams-vocabulary/ 114 115 Instead of the four-step architecture above, the extrinsic-metadata indexer 116 is standalone: it reads "raw extrinsic metadata" from the :ref:`swh-journal`, 117 and produces new indexed entries in the database as they come. 118 119 The caveat is that, while intrinsic metadata are always unambiguously authoritative 120 (they are contained by their own origin repository, therefore they were added by 121 the origin's "owners"), extrinsic metadata can be authored by third-parties. 118 170 :nostderr: 119 171 120 172 121 Adding support for additional ecosystem-specific metadata 122 --------------------------------------------------------- 173 174 175 Tutorials 176 --------- 177 178 The rest of this page is made of two tutorials: one to index 179 :term:`intrinsic metadata` (ie. from a file in a VCS or in a tarball), 180 and one to index :term:`extrinsic metadata` (ie. obtained via external means, 181 such as GitHub's or GitLab's APIs). Build is green
Patch application report for D8087 (id=29289)
Rebasing onto 724034de...
Current branch diff-target is up to date.
Changes applied before test
commit 2dd2be9c1b17cdd8ac5520acc3935a2f34be98c8 Author: Valentin Lorentz <vlorentz@softwareheritage.org> Date: Wed Jul 6 16:17:22 2022 +0200 docs: Explain the indexation workflow for extrinsic metadata
See https://jenkins.softwareheritage.org/job/DCIDX/job/tests-on-diff/356/ for more details.
Please register or sign in to reply