Skip to content
Snippets Groups Projects

docs: Explain the indexation workflow for extrinsic metadata

3 unresolved threads

Depends on !347 (closed).


Migrated from D8087 (view on Phabricator)

Merge request reports

Closed by Phabricator Migration userPhabricator Migration user 2 years ago (Jul 11, 2022 3:30pm UTC)

Loading

Activity

Filter activity
  • Approvals
  • Assignees & reviewers
  • Comments (from bots)
  • Comments (from users)
  • Commits & branches
  • Edits
  • Labels
  • Lock status
  • Mentions
  • Merge request status
  • Tracking
  • Build is green

    Patch application report for D8087 (id=29192)

    Could not rebase; Attempt merge onto c2742b5b...

    Updating c2742b5..847bc29
    Fast-forward
     docs/images/tasks-metadata-indexers.uml            |  57 +++++-----
     docs/metadata-workflow.rst                         | 118 ++++++++++++++++-----
     swh/indexer/codemeta.py                            |  57 ++++++----
     swh/indexer/metadata.py                            |  10 +-
     swh/indexer/metadata_detector.py                   |   4 +-
     swh/indexer/metadata_dictionary/__init__.py        |  12 ++-
     swh/indexer/metadata_dictionary/base.py            |  63 +++++++----
     swh/indexer/metadata_dictionary/cff.py             |   4 +-
     swh/indexer/metadata_dictionary/codemeta.py        |   4 +-
     swh/indexer/metadata_dictionary/composer.py        |   4 +-
     swh/indexer/metadata_dictionary/github.py          |  68 ++++++++++--
     swh/indexer/metadata_dictionary/maven.py           |   4 +-
     swh/indexer/metadata_dictionary/npm.py             |   4 +-
     swh/indexer/metadata_dictionary/python.py          |   4 +-
     swh/indexer/metadata_dictionary/ruby.py            |  11 +-
     .../tests/metadata_dictionary/test_github.py       |  26 ++++-
     16 files changed, 307 insertions(+), 143 deletions(-)
    Changes applied before test
    commit 847bc29eb5e77bc70380336c8dce480be5019251
    Author: Valentin Lorentz <vlorentz@softwareheritage.org>
    Date:   Wed Jul 6 16:17:22 2022 +0200
    
        docs: Explain the indexation workflow for extrinsic metadata
    
    commit 724034de625f3a388a261e1eed3e6a2c9620c539
    Author: Valentin Lorentz <vlorentz@softwareheritage.org>
    Date:   Tue Jul 5 17:46:33 2022 +0200
    
        docs: Update description of the metadata workflow
        
        1. indexers call themselves directly instead of going through the scheduler
        2. metadata is attached to directories instead of revisions
    
    commit 3458892274226aabe490a795abe5d6fce990be99
    Author: Valentin Lorentz <vlorentz@softwareheritage.org>
    Date:   Tue Jul 5 16:43:02 2022 +0200
    
        github: Translate stargazers_count and watchers_count
    
    commit e177c77baf48b69e420a3eed6b9125a7f209947f
    Author: Valentin Lorentz <vlorentz@softwareheritage.org>
    Date:   Mon Jul 4 18:26:24 2022 +0200
    
        Simplify codemeta.make_absolute_uri()
    
    commit dd9adebeca15c697cc27011693c8d84f6ec1544e
    Author: Valentin Lorentz <vlorentz@softwareheritage.org>
    Date:   Mon Jul 4 18:25:25 2022 +0200
    
        Document codemeta.make_absolute_uri()
    
    commit 358ee08416dd847d7ebbddd0c721d7a287149175
    Author: Valentin Lorentz <vlorentz@softwareheritage.org>
    Date:   Mon Jul 4 13:55:19 2022 +0200
    
        Use compact URIs for ForgeFed and ActivityStreams
        
        It makes resulting documents (usually) shorter, and tests more readable.
    
    commit d41f26eef0561fd41932eb688bc6908f2253ef4c
    Author: Valentin Lorentz <vlorentz@softwareheritage.org>
    Date:   Mon Jul 4 13:37:58 2022 +0200
    
        Use separate base classes for intrinsic and extrinsic mappings
        
        detect_metadata_files and extrinsic_metadata_formats (respectively) are somewhat
        mutually exclusive, so it does not make much sense to have them in the same
        class and MAPPINGS dict

    See https://jenkins.softwareheritage.org/job/DCIDX/job/tests-on-diff/343/ for more details.

89 89 .. _CSV file: https://github.com/codemeta/codemeta/blob/master/crosswalk.csv
90 90
91 91
92 Extrinsic metadata
93 ------------------
94
95 The :term:`extrinsic metadata` indexer works very differently from
96 the :term:`intrinsic metadata` indexers we saw above.
97 While the latter extract metadata from software artefacts (files and directories)
98 which are already a core part of the archive, the former extracts such data from
99 API calls pulled from forges and package managers, or pushed via the
  • 106 but in the "raw extrinsic metadata" form, which needs to be translated to a common
    107 vocabulary to be useful, as with intrinsic metadata.
    108
    109 The common vocabulary we chose is JSON-LD, with both CodeMeta and
    110 `ForgeFed's vocabulary`_ (including `ActivityStream's vocabulary`_)
    111
    112 .. _ForgeFed's vocabulary: https://forgefed.org/vocabulary.html
    113 .. _ActivityStream's vocabulary: https://www.w3.org/TR/activitystreams-vocabulary/
    114
    115 Instead of the four-step architecture above, the extrinsic-metadata indexer
    116 is standalone: it reads "raw extrinsic metadata" from the :ref:`swh-journal`,
    117 and produces new indexed entries in the database as they come.
    118
    119 The caveat is that, while intrinsic metadata are always unambiguously authoritative
    120 (they are contained by their own origin repository, therefore they were added by
    121 the origin's "owners"), extrinsic metadata can be authored by third-parties.
  • 118 170 :nostderr:
    119 171
    120 172
    121 Adding support for additional ecosystem-specific metadata
    122 ---------------------------------------------------------
    173
    174
    175 Tutorials
    176 ---------
    177
    178 The rest of this page is made of two tutorials: one to index
    179 :term:`intrinsic metadata` (ie. from a file in a VCS or in a tarball),
    180 and one to index :term:`extrinsic metadata` (ie. obtained via external means,
    181 such as GitHub's or GitLab's APIs).
  • Merge request was accepted

  • Antoine Lambert approved this merge request

    approved this merge request

  • Author Maintainer

    Fix typos

  • Author Maintainer

    Merge request was merged

  • closed

  • Build is green

    Patch application report for D8087 (id=29289)

    Rebasing onto 724034de...

    Current branch diff-target is up to date.
    Changes applied before test
    commit 2dd2be9c1b17cdd8ac5520acc3935a2f34be98c8
    Author: Valentin Lorentz <vlorentz@softwareheritage.org>
    Date:   Wed Jul 6 16:17:22 2022 +0200
    
        docs: Explain the indexation workflow for extrinsic metadata

    See https://jenkins.softwareheritage.org/job/DCIDX/job/tests-on-diff/356/ for more details.

  • Please register or sign in to reply
    Loading