Compare revisions

9777195f · 1e802dea · 47957952 · 6e4c4f5d · 7d7891c6 · 38bae983
--- a/.copier-answers.yml
+++ b/.copier-answers.yml
+# Changes here will be overwritten by Copier
+_commit: v0.3.5
+_src_path: https://gitlab.softwareheritage.org/swh/devel/swh-py-template.git
+description: Software Heritage indexer
+distribution_name: swh-indexer
+have_cli: true
+have_workers: true
+package_root: swh/indexer
+project_name: swh.indexer
+python_minimal_version: '3.9'
+readme_format: rst
--- a/.git-blame-ignore-revs
+++ b/.git-blame-ignore-revs
+# python: Reformat code with black
+5aa97ccd6ce29d6f66eb093c5d06e9030d7449fd
+0f847f6119195649fe4108b776b9244940ebdb46
+2e9f1d3e896062ae6b3cd99dc1a5d4148beebbf7
--- a/.gitignore
+++ b/.gitignore
+*.egg-info/
 *.pyc
-*.sw?
-*~
-/.coverage
-/.coverage.*
+.coverage
 .eggs/
+.hypothesis
+.mypy_cache
+.tox
 __pycache__
-*.egg-info/
 build/
 dist/
-version.txt
-/sql/createdb-stamp
-/sql/filldb-stamp
-.tox/
-.hypothesis/
-.mypy_cache/
-.vscode/
\ No newline at end of file
+# these are symlinks created by a hook in swh-docs' main sphinx conf.py
+docs/README.rst
+docs/README.md
+# this should be a symlink for people who want to build the sphinx doc
+# without using tox, generally created by the swh-env/bin/update script
+docs/Makefile.sphinx
--- a/.pre-commit-config.yaml
+++ b/.pre-commit-config.yaml
 repos:
  - repo: https://github.com/pre-commit/pre-commit-hooks
-  rev: v2.4.0
+    rev: v6.0.0
    hooks:
      - id: trailing-whitespace
      - id: check-json
      - id: check-yaml

- repo: https://gitlab.com/pycqa/flake8
-  rev: 3.8.3
+  - repo: https://github.com/python/black
+    rev: 25.9.0
+    hooks:
+      - id: black
+
+  - repo: https://github.com/PyCQA/isort
+    rev: 6.1.0
+    hooks:
+      - id: isort
+
+  - repo: https://github.com/pycqa/flake8
+    rev: 7.3.0
    hooks:
      - id: flake8
+        additional_dependencies: [flake8-bugbear==24.12.12, flake8-pyproject]

  - repo: https://github.com/codespell-project/codespell
-  rev: v1.16.0
+    rev: v2.4.1
    hooks:
      - id: codespell
-    exclude: ^(swh/indexer/data/codemeta/crosswalk.csv)$
+        name: Check source code spelling
+        args: [-L assertIn]
+        exclude: ^(swh/indexer/data/)
+        stages: [pre-commit]
+      - id: codespell
+        name: Check commit message spelling
+        stages: [commit-msg]
+
  - repo: local
    hooks:
      - id: mypy
@@ -25,25 +43,13 @@ repos:
        pass_filenames: false
        language: system
        types: [python]
-
- repo: https://github.com/PyCQA/isort
-  rev: 5.5.2
-  hooks:
-  - id: isort
-
- repo: https://github.com/python/black
-  rev: 19.10b0
-  hooks:
-  - id: black
-
-# unfortunately, we are far from being able to enable this...
-# - repo: https://github.com/PyCQA/pydocstyle.git
-#   rev: 4.0.0
-#   hooks:
-#   - id: pydocstyle
-#     name: pydocstyle
-#     description: pydocstyle is a static analysis tool for checking compliance with Python docstring conventions.
-#     entry: pydocstyle --convention=google
-#     language: python
-#     types: [python]
-
+      - id: twine-check
+        name: twine check
+        description: call twine check when pushing an annotated release tag
+        entry: bash -c "ref=$(git describe) &&
+          [[ $ref =~ ^v[0-9]+\.[0-9]+\.[0-9]+$ ]] &&
+          (python3 -m build --sdist && twine check $(ls -t dist/* | head -1)) || true"
+        pass_filenames: false
+        stages: [pre-push]
+        language: python
+        additional_dependencies: [twine, build]
--- a/CODE_OF_CONDUCT.md
+++ b/CODE_OF_CONDUCT.md
@@ -6,7 +6,7 @@ In the interest of fostering an open and welcoming environment, we as Software
 Heritage contributors and maintainers pledge to making participation in our
 project and our community a harassment-free experience for everyone, regardless
 of age, body size, disability, ethnicity, sex characteristics, gender identity
-and expression, level of experience, education, socio-economic status,
+and expression, level of experience, education, socioeconomic status,
 nationality, personal appearance, race, religion, or sexual identity and
 orientation.


--- a/CONTRIBUTORS
+++ b/CONTRIBUTORS
 Kumar Shivendu
 Siddharth Ravikumar
 Thibault Allançon
+Satvik Vemuganti
--- a/MANIFEST.in
+++ b/MANIFEST.in
-include README.md
-include Makefile
-include requirements*.txt
-include version.txt
-include conftest.py
-recursive-include sql *
-recursive-include swh/indexer/sql *.sql
-recursive-include swh/indexer/data *
-recursive-include swh py.typed
--- a/Makefile.local
+++ b/Makefile.local
-TESTFLAGS=--hypothesis-profile=fast
+TESTFLAGS += --hypothesis-profile=fast
--- a/README.md
+++ b/README.md
-swh-indexer
-============
+Software Heritage - Indexer
+===========================

 Tools to compute multiple indexes on SWH's raw contents:
+
 - content:
+
  - mimetype
-  - ctags
-  - language
  - fossology-license
  - metadata
- revision:
-  - metadata
+
+- origin:
+
+  - metadata (intrinsic, using the content indexer; and extrinsic)

 An indexer is in charge of:
+
 - looking up objects
 - extracting information from those objects
 - store those information in the swh-indexer db

 There are multiple indexers working on different object types:
+
  - content indexer: works with content sha1 hashes
  - revision indexer: works with revision sha1 hashes
  - origin indexer: works with origin identifiers

 Indexation procedure:
+
 - receive batch of ids
 - retrieve the associated data depending on object type
 - compute for that object some index
@@ -32,18 +37,13 @@ Current content indexers:
 - mimetype (queue swh_indexer_content_mimetype): detect the encoding
  and mimetype

- language (queue swh_indexer_content_language): detect the
-  programming language
-
- ctags (queue swh_indexer_content_ctags): compute tags information
-
 - fossology-license (queue swh_indexer_fossology_license): compute the
  license

- metadata: translate file into translated_metadata dict
+- metadata: translate file from an ecosystem-specific formats to JSON-LD
+  (using schema.org/CodeMeta vocabulary)

-Current revision indexers:
+Current origin indexers:

- metadata: detects files containing metadata and retrieves translated_metadata
-  in content_metadata table in storage or run content indexer to translate
-  files.
+- metadata: translate file from an ecosystem-specific formats to JSON-LD
+  (using schema.org/CodeMeta and ForgeFed vocabularies)
--- a/conftest.py
+++ b/conftest.py
-# Copyright (C) 2020  The Software Heritage developers
+# Copyright (C) 2020-2025  The Software Heritage developers
 # See the AUTHORS file at the top-level directory of this distribution
 # License: GNU General Public License version 3, or any later version
 # See top-level LICENSE file for more information

 from hypothesis import settings
-import pytest

 # define tests profile. Full documentation is at:
 # https://hypothesis.readthedocs.io/en/latest/settings.html#settings-profiles
@@ -18,14 +17,6 @@ collect_ignore = ["swh/indexer/storage/api/wsgi.py"]

 # we use the various swh fixtures
 pytest_plugins = [
-    "swh.scheduler.pytest_plugin",
+    "swh.journal.pytest_plugin",
    "swh.storage.pytest_plugin",
-    "swh.core.db.pytest_plugin",
-]
-
-
-@pytest.fixture(scope="session")
-def swh_scheduler_celery_includes(swh_scheduler_celery_includes):
-    return swh_scheduler_celery_includes + [
-        "swh.indexer.tasks",
 ]
--- a/docs/Makefile
+++ b/docs/Makefile
-include ../../swh-docs/Makefile.sphinx
 -include Makefile.local
-
+include Makefile.sphinx
--- a/docs/README.md
+++ b/docs/README.md
-../README.md
\ No newline at end of file
--- a/docs/dev-info.rst
+++ b/docs/dev-info.rst
-Hacking on swh-indexer
-======================
-
-This tutorial will guide you through the hacking on the swh-indexer.
-If you do not have a local copy of the Software Heritage archive, go to the
-`getting started tutorial
-<https://docs.softwareheritage.org/devel/getting-started.html>`_
-
-Configuration files
-------------------
-You will need the following YAML configuration files to run the swh-indexer
-commands:
-
- Orchestrator at
-  ``~/.config/swh/indexer/orchestrator.yml``
-
-.. code-block:: yaml
-
-  indexers:
-    mimetype:
-      check_presence: false
-      batch_size: 100
-
- Orchestrator-text at
-  ``~/.config/swh/indexer/orchestrator-text.yml``
-
-.. code-block:: yaml
-
-  indexers:
-    # language:
-    #   batch_size: 10
-    #   check_presence: false
-    fossology_license:
-      batch_size: 10
-      check_presence: false
-    # ctags:
-    #   batch_size: 2
-    #   check_presence: false
-
- Mimetype indexer at
-  ``~/.config/swh/indexer/mimetype.yml``
-
-.. code-block:: yaml
-
-    # storage to read sha1's metadata (path)
-  	# storage:
-  	#   cls: local
-  	#   db: "service=swh-dev"
-  	#   objstorage:
-  	#     cls: pathslicing
-  	#     root: /home/storage/swh-storage/
-  	#     slicing: 0:1/1:5
-
-  	storage:
-  	  cls: remote
-	    url: http://localhost:5002/
-
-  	indexer_storage:
-  	  cls: remote
-  	  args:
-  	    url: http://localhost:5007/
-
-  	# storage to read sha1's content
-  	# adapt this to your need
-  	# locally: this needs to match your storage's setup
-  	objstorage:
-  	  cls: pathslicing
-	    slicing: 0:1/1:5
- 	    root: /home/storage/swh-storage/
-
-  	destination_task: swh.indexer.tasks.SWHOrchestratorTextContentsTask
-  	rescheduling_task: swh.indexer.tasks.SWHContentMimetypeTask
-
-
- Fossology indexer at
-  ``~/.config/swh/indexer/fossology_license.yml``
-
-.. code-block:: yaml
-
-    # storage to read sha1's metadata (path)
-  	# storage:
-  	#   cls: local
-  	#   db: "service=swh-dev"
-  	#   objstorage:
-  	#     cls: pathslicing
-  	#     root: /home/storage/swh-storage/
-  	#     slicing: 0:1/1:5
-
-  	storage:
-  	  cls: remote
-  	  url: http://localhost:5002/
-
-  	indexer_storage:
-  	  cls: remote
-  	  args:
-  	    url: http://localhost:5007/
-
-  	# storage to read sha1's content
-  	# adapt this to your need
-  	# locally: this needs to match your storage's setup
-  	objstorage:
-  	  cls: pathslicing
-	    slicing: 0:1/1:5
-	    root: /home/storage/swh-storage/
-
-  	workdir: /tmp/swh/worker.indexer/license/
-
-  	tools:
-  	  name: 'nomos'
-  	  version: '3.1.0rc2-31-ga2cbb8c'
-  	  configuration:
-  	    command_line: 'nomossa <filepath>'
-
-
- Worker at
-  ``~/.config/swh/worker.yml``
-
-.. code-block:: yaml
-
-  task_broker: amqp://guest@localhost//
-  	task_modules:
-  	  - swh.loader.svn.tasks
-  	  - swh.loader.tar.tasks
-  	  - swh.loader.git.tasks
-  	  - swh.storage.archiver.tasks
-  	  - swh.indexer.tasks
-  	  - swh.indexer.orchestrator
-  	task_queues:
-  	  - swh_loader_svn
-  	  - swh_loader_tar
-  	  - swh_reader_git_to_azure_archive
-  	  - swh_storage_archive_worker_to_backend
-  	  - swh_indexer_orchestrator_content_all
-  	  - swh_indexer_orchestrator_content_text
-  	  - swh_indexer_content_mimetype
-  	  - swh_indexer_content_language
-  	  - swh_indexer_content_ctags
-  	  - swh_indexer_content_fossology_license
-  	  - swh_loader_svn_mount_and_load
-  	  - swh_loader_git_express
-  	  - swh_loader_git_archive
-  	  - swh_loader_svn_archive
-  	task_soft_time_limit: 0
-
-
-Database
--------
-
-swh-indxer uses a database to store the indexed content. The default
-db is expected to be called swh-indexer-dev.
-
-Create or add  ``swh-dev`` and ``swh-indexer-dev`` to
-the ``~/.pg_service.conf`` and ``~/.pgpass`` files, which are postgresql's
-configuration files.
-
-Add data to local DB
--------------------
-from within the ``swh-environment``, run the following command::
-
-  make rebuild-testdata
-
-and fetch some real data to work with, using::
-
-   python3 -m swh.loader.git.updater --origin-url <github url>
-
-Then you can list all content files using this script::
-
-  #!/usr/bin/env bash
-
-  psql service=swh-dev -c "copy (select sha1 from content) to stdin" | sed -e 's/^\\\\x//g'
-
-Run the indexers
-----------------
-Use the list off contents to feed the indexers with with the
-following command::
-
-  ./list-sha1.sh | python3 -m swh.indexer.producer --batch 100 --task-name orchestrator_all
-
-Activate the workers
--------------------
-To send messages to different queues using rabbitmq
-(which should already be installed through dependencies installation),
-run the following command in a dedicated terminal::
-
-  python3 -m celery worker --app=swh.scheduler.celery_backend.config.app \
-                 --pool=prefork \
-                 --concurrency=1 \
-                 -Ofair \
-                 --loglevel=info \
-                 --without-gossip \
-                 --without-mingle \
-                 --without-heartbeat 2>&1
-
-With this command rabbitmq will consume message using the worker
-configuration file.
-
-Note: for the fossology_license indexer, you need a package fossology-nomossa
-which is in our `public debian repository
-<https://wiki.softwareheritage.org/index.php?title=Debian_packaging#Package_repository>`_.
--- a/docs/images/.gitignore
+++ b/docs/images/.gitignore
-tasks-metadata-indexers.svg
+*.svg
--- a/docs/images/Makefile
+++ b/docs/images/Makefile
@@ -2,10 +2,16 @@
 UML_DIAGS_SRC = $(wildcard *.uml)
 UML_DIAGS = $(patsubst %.uml,%.svg,$(UML_DIAGS_SRC))

-all: $(UML_DIAGS)
+DOT_DIAGS_SRC = $(wildcard *.dot)
+DOT_DIAGS = $(patsubst %.dot,%.svg,$(DOT_DIAGS_SRC))
+
+all: $(UML_DIAGS) $(DOT_DIAGS)

 %.svg: %.uml
 	DISPLAY="" plantuml -tsvg $<

+%.svg: %.dot
+	dot $< -T svg -o $@
+
 clean:
-	-rm -f $(DEP_GRAPHS) $(UML_DIAGS)
+	-rm -f $(DEP_GRAPHS) $(UML_DIAGS) $(DOT_DIAGS)
--- a/docs/images/metadata-flow.dot
+++ b/docs/images/metadata-flow.dot
+digraph metadata_flow {
+    subgraph cluster_forges {
+        style=invis;
+        origin_vcs [label="Version Control Systems\n(Git, SVN, ...)"];
+        origin_pm [label="Package Managers\n(NPM, PyPI, Debian, ...)"];
+    }
+    subgraph internet {
+        rank=same;
+        deposit_client [label="Deposit Clients\n(HAL, IPOL, eLife, Intel, ...)"];
+        registries [label="Registries\n(Wikidata, ...)"];
+    }
+
+    subgraph cluster_SWH {
+        label="Software Heritage";
+        labeljust="r";
+        labelloc="b";
+        loader_vcs [label="VCS loader", shape="box"];
+        loader_pm [label="PM loader", shape="box"];
+        deposit_server [label="Deposit server", shape="box"];
+        indexer_extr [label="extrinsic metadata indexer\n(translate to Codemeta)", shape="box"];
+        indexer_intr [label="intrinsic metadata indexer\n(translate to Codemeta)", shape="box"];
+        registry_fetcher[label="?", style="dashed", shape="box"];
+
+        storage [label="\nMain Storage\n(swh-storage and\nswh-objstorage)", shape=cylinder];
+        remd_storage [label="\nRaw Extrinsic\nMetadata Storage", shape=cylinder];
+        indexed_storage [label="\nIndexed\nMetadata Storage\n(search, idx-storage)", shape=cylinder];
+
+        webapp [label="Web Interface", shape="box"];
+    }
+
+    subgraph users {
+        browser [label="Web Browser", shape="box"]
+    }
+
+    origin_vcs -> loader_vcs [label="pull"];
+    loader_vcs -> storage;
+    origin_pm -> loader_pm [label="pull"]
+    loader_pm -> {storage, remd_storage};
+    deposit_client -> deposit_server [label="push\n(SWORD + Codemeta)"];
+    deposit_server -> {storage, remd_storage};
+
+    registries -> registry_fetcher -> remd_storage [style="dashed"];
+
+    storage -> indexer_intr [label="all kinds of\nmetadata formats"];
+    indexer_intr -> indexed_storage [label="only Codemeta"];
+    remd_storage -> indexer_extr [label="all kinds of\nmetadata formats"];
+    indexer_extr-> indexed_storage;
+
+    {storage, remd_storage, indexed_storage} -> webapp;
+    webapp -> browser [label="search, display,\nBibTeX export,\ndownload, ..."];
+}
--- a/docs/images/tasks-extrinsic-metadata-indexers.uml
+++ b/docs/images/tasks-extrinsic-metadata-indexers.uml
+@startuml
+  participant LOADERS as "Metadata Loaders"
+  participant STORAGE as "Graph Storage"
+  participant JOURNAL as "Journal"
+  participant IDX_REM_META as "REM Indexer"
+  participant IDX_STORAGE as "Indexer Storage"
+
+  activate IDX_STORAGE
+  activate STORAGE
+  activate JOURNAL
+  activate LOADERS
+
+  LOADERS->>STORAGE: new REM (Raw Extrinsic Metadata) object\n for Origin http://example.org/repo.git\nor object swh:1:dir:...
+  STORAGE->>JOURNAL: new REM object
+  deactivate LOADERS
+
+  JOURNAL->>IDX_REM_META: run indexers on REM object
+  activate IDX_REM_META
+
+  IDX_REM_META->>IDX_REM_META: recognize REM object (gitea/github/deposit/...)
+
+  IDX_REM_META->>IDX_REM_META: parse REM object
+
+  alt If the REM object describe an origin
+    IDX_REM_META->>IDX_STORAGE: origin_extrinsic_metadata_add(id="http://example.org/repo.git", {author: "Jane Doe", ...})
+    IDX_STORAGE->>IDX_REM_META: ok
+  end
+
+  alt If the REM object describe a directory
+    IDX_REM_META->>IDX_STORAGE: directory_extrinsic_metadata_add(id="swh:1:dir:...", {author: "Jane Doe", ...})
+    IDX_STORAGE->>IDX_REM_META: ok
+  end
+
+  deactivate IDX_REM_META
+
+
+@enduml
--- a/docs/images/tasks-metadata-indexers.uml
+++ b/docs/images/tasks-metadata-indexers.uml
 @startuml
  participant LOADERS as "Loaders"
+  participant STORAGE as "Graph Storage"
  participant JOURNAL as "Journal"
-  participant SCHEDULER as "Scheduler"
+  participant IDX_ORIG_META as "Origin Metadata Indexer"
  participant IDX_ORIG_HEAD as "Origin-Head Indexer"
-  participant IDX_REV_META as "Revision Metadata Indexer"
+  participant IDX_DIR_META as "Directory Metadata Indexer"
  participant IDX_CONT_META as "Content Metadata Indexer"
-  participant IDX_ORIG_META as "Origin Metadata Indexer"
  participant IDX_STORAGE as "Indexer Storage"
-  participant STORAGE as "Graph Storage"
  participant OBJ_STORAGE as "Object Storage"

  activate OBJ_STORAGE
  activate IDX_STORAGE
  activate STORAGE
  activate JOURNAL
-  activate SCHEDULER
+  activate IDX_ORIG_META

  activate LOADERS

-  LOADERS->>JOURNAL: Origin 42 was added/revisited
+  LOADERS->>STORAGE: Repository content
+  LOADERS->>STORAGE: Origin http://example.org/repo.git\nwas added/revisited
+  STORAGE->>JOURNAL: Origin http://example.org/repo.git\nwas added/revisited
  deactivate LOADERS

-  JOURNAL->>SCHEDULER: run indexers on origin 42
+  JOURNAL->>IDX_ORIG_META: run indexers on origin\nhttp://example.org/repo.git

-  SCHEDULER->>IDX_ORIG_HEAD: Find HEAD revision of 42
+  IDX_ORIG_META->>IDX_ORIG_HEAD: Find HEAD revision of\nhttp://example.org/repo.git
  activate IDX_ORIG_HEAD

-  IDX_ORIG_HEAD->>STORAGE: snapshot_get_latest(origin=42)
+  IDX_ORIG_HEAD->>STORAGE: snapshot_get_latest(origin="http://example.org/repo.git")

  STORAGE->>IDX_ORIG_HEAD: branches

-  IDX_ORIG_HEAD->>SCHEDULER: run Revision Metadata Indexer\non revision 42abcdef\n(head of origin 42)
+  IDX_ORIG_HEAD->>IDX_ORIG_META: run Revision Metadata Indexer\non revision 42abcdef (head of origin\nhttp://example.org/repo.git)
  deactivate IDX_ORIG_HEAD

-  SCHEDULER->>IDX_REV_META: Index revision 42abcdef\n(head of origin 42)
-  activate IDX_REV_META
+  IDX_ORIG_META->>STORAGE: revision_get(sha1=42abcdef)
+  STORAGE->>IDX_ORIG_META: {id: 42abcdef, message: "Commit message", directory: 456789ab, ...}

-  IDX_REV_META->>STORAGE: revision_get(sha1=42abcdef)
-  STORAGE->>IDX_REV_META: {id: 42abcdef, message: "Commit message", directory: 456789ab, ...}
+  IDX_ORIG_META->>IDX_DIR_META: Index directory 456789ab\n(head of origin http://example.org/repo.git)
+  activate IDX_DIR_META

-  IDX_REV_META->>STORAGE: directory_ls(sha1=456789ab)
-  STORAGE->>IDX_REV_META: [{id: 1234cafe, name: "package.json", type: file, ...}, {id: cafe4321, name: "README", type: file, ...}, ...]
+  IDX_DIR_META->>STORAGE: directory_ls(sha1=456789ab)
+  STORAGE->>IDX_DIR_META: [{id: 1234cafe, name: "package.json", type: file, ...}, {id: cafe4321, name: "README", type: file, ...}, ...]

-  IDX_REV_META->>IDX_REV_META: package.json is a metadata file
+  IDX_DIR_META->>IDX_DIR_META: package.json is a metadata file

-  IDX_REV_META->>IDX_STORAGE: content_metadata_get(sha1=1234cafe)
-  IDX_STORAGE->>IDX_REV_META: none / {author: "Jane Doe", ...}
+  IDX_DIR_META->>IDX_STORAGE: content_metadata_get(sha1=1234cafe)
+  IDX_STORAGE->>IDX_DIR_META: none / {author: "Jane Doe", ...}

  alt If the storage answered "none"
-    IDX_REV_META->>IDX_CONT_META: Index file 1234cafe as an NPM metadata file
+    IDX_DIR_META->>IDX_CONT_META: Index file 1234cafe as an NPM metadata file
    activate IDX_CONT_META

    IDX_CONT_META->>OBJ_STORAGE: content_get 1234cafe
@@ -60,23 +61,17 @@
    IDX_CONT_META->>IDX_STORAGE: content_metadata_add(sha1=1234cafe, {author: "Jane Doe", ...})
    IDX_STORAGE->>IDX_CONT_META: ok

-    IDX_CONT_META->>IDX_REV_META: extracted: {author: "Jane Doe", ...}
+    IDX_CONT_META->>IDX_DIR_META: extracted: {author: "Jane Doe", ...}
    deactivate IDX_CONT_META
-  end
-
-  IDX_REV_META->>IDX_STORAGE: revision_metadata_add(sha1=42abcdef, {author: "Jane Doe", ...})
-  IDX_STORAGE->>IDX_REV_META: ok
-
-  IDX_REV_META->>SCHEDULER: run Origin Metadata Indexer\non origin 42; the head is 42abcdef
-  deactivate IDX_REV_META

-  SCHEDULER->>IDX_ORIG_META: Index origin 42; the head is 42abcdef
-  activate IDX_ORIG_META
+    IDX_DIR_META->>IDX_STORAGE: directory_metadata_add(sha1=456789ab, {author: "Jane Doe", ...})
+    IDX_STORAGE->>IDX_DIR_META: ok
+  end

-  IDX_ORIG_META->>IDX_STORAGE: revision_metadata_get(sha1=42abcdef)
-  IDX_STORAGE->>IDX_ORIG_META: {author: "Jane Doe", ...}
+  IDX_DIR_META->>IDX_ORIG_META: extracted: {author: "Jane Doe", ...}
+  deactivate IDX_DIR_META

-  IDX_ORIG_META->>IDX_STORAGE: origin_metadata_add(id=42, {author: "Jane Doe", ...})
+  IDX_ORIG_META->>IDX_STORAGE: origin_metadata_add(id="http://example.org/repo.git", {author: "Jane Doe", ...}, from_directory=456789ab)
  IDX_STORAGE->>IDX_ORIG_META: ok
  deactivate IDX_ORIG_META


--- a/docs/index.rst
+++ b/docs/index.rst
 .. _swh-indexer:

-Software Heritage - Indexer
-===========================
-
-Tools and workers used to mine the content of the archive and extract derived
-information from archive source code artifacts.
-
+.. include:: README.rst

 .. toctree::
   :maxdepth: 1
   :caption: Contents:

   README.md
-   dev-info.rst
   metadata-workflow.rst
+   swhpkg.rst


 Reference Documentation
@@ -23,4 +18,12 @@ Reference Documentation
   :maxdepth: 2

   cli
-   /apidoc/swh.indexer
+
+.. only:: standalone_package_doc
+
+   Indices and tables
+   ------------------
+
+   * :ref:`genindex`
+   * :ref:`modindex`
+   * :ref:`search`
--- a/docs/metadata-workflow.rst
+++ b/docs/metadata-workflow.rst
@@ -12,13 +12,22 @@ multiple indexers, which coordinate with each other and save their results
 at each step in the indexer storage.

 Indexer architecture
--------------------
+^^^^^^^^^^^^^^^^^^^^

-.. thumbnail:: images/tasks-metadata-indexers.svg
+Sequence diagram
+""""""""""""""""
+
+.. thumbnail:: images/tasks-intrinsic-metadata-indexers.svg
+
+
+Data flow and storage
+"""""""""""""""""""""
+
+.. thumbnail:: images/metadata-flow.svg


 Origin-Head Indexer
-___________________
+^^^^^^^^^^^^^^^^^^^

 First, the Origin-Head indexer gets called externally, with an origin as
 argument (or multiple origins, that are handled sequentially).
@@ -30,49 +39,50 @@ branch of origin is (the "Head branch") and what revision it points to
 (the "Head").
 Intrinsic metadata for that origin will be extracted from that revision.

-It schedules a Revision Metadata Indexer task for that revision, with a
-hint that the revision is the Head of that particular origin.
+It schedules a Directory Metadata Indexer task for the root directory of
+that revision.


-Revision and Content Metadata Indexers
-______________________________________
+Directory and Content Metadata Indexers
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

-These two indexers do the hard part of the work. The Revision Metadata
+These two indexers do the hard part of the work. The Directory Metadata
 Indexer fetches the root directory associated with a revision, then extracts
 the metadata from that directory.

 To do so, it lists files in that directory, and looks for known names, such
-as `codemeta.json`, `package.json`, or `pom.xml`. If there are any, it
+as :file:`codemeta.json`, :file:`package.json`, or :file:`pom.xml`. If there are any, it
 runs the Content Metadata Indexer on them, which in turn fetches their
 contents and runs them through extraction dictionaries/mappings.
 See below for details.

 Their results are saved in a database (the indexer storage), associated with
-the content and revision hashes.
-
-If it received a hint that this revision is the head of an origin, the
-Revision Metadata Indexer then schedules the Origin Metadata Indexer
-to run on that origin.
+the content and directory hashes.


 Origin Metadata Indexer
-_______________________
+^^^^^^^^^^^^^^^^^^^^^^^

 The job of this indexer is very simple: it takes an origin identifier and
-a revision hash, and copies the metadata of the former to a new table, to
-associate it with the latter.
+uses the Origin-Head and Directory indexers to get metadata from the head
+directory of an origin, and copies the metadata of the former to a new table,
+to associate it with the latter.

 The reason for this is to be able to perform searches on metadata, and
 efficiently find out which origins matched the pattern.
-Running that search on the `revision_metadata` table would require either
-a reverse lookup from revisions to origins, which is costly.
+Running that search on the ``directory_metadata`` table would require either
+a reverse lookup from directories to origins, which is costly.


-Translation from language-specific metadata to CodeMeta
-------------------------------------------------------
+Translation from ecosystem-specific metadata to CodeMeta
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

-Intrinsic metadata are extracted from files provided with a project's source
-code, and translated using `CodeMeta`_'s `crosswalk table`_.
+Intrinsic metadata is extracted from files provided with a project's source
+code, and translated using `CodeMeta`_'s `crosswalk table`_; which is vendored
+in :file:`swh/indexer/data/codemeta/codemeta.csv`.
+Ecosystems not yet included in Codemeta's crosswalk have their own
+:file:`swh/indexer/data/*.csv` file, with one row for each CodeMeta property,
+even when not supported by the ecosystem.

 All input formats supported so far are straightforward dictionaries (eg. JSON)
 or can be accessed as such (eg. XML); and the first part of the translation is
@@ -92,52 +102,125 @@ This normalization makes up for most of the code of
 .. _CSV file: https://github.com/codemeta/codemeta/blob/master/crosswalk.csv


+Extrinsic metadata
+------------------
+
+Indexer architecture
+^^^^^^^^^^^^^^^^^^^^
+
+.. thumbnail:: images/tasks-extrinsic-metadata-indexers.svg
+
+The :term:`extrinsic metadata` indexer works very differently from
+the :term:`intrinsic metadata` indexers we saw above.
+While the latter extract metadata from software artefacts (files and directories)
+which are already a core part of the archive, the former extracts such data from
+API calls pulled from forges and package managers, or pushed via the
+:ref:`SWORD deposit <swh-deposit>`.
+
+In order to preserve original information verbatim, the Software Heritage itself
+stores the result of these calls, independently of indexers, in their own archive
+as described in the :ref:`extrinsic-metadata-specification`.
+In this section, we assume this information is already present in the archive,
+but in the "raw extrinsic metadata" form, which needs to be translated to a common
+vocabulary to be useful, as with intrinsic metadata.
+
+The common vocabulary we chose is JSON-LD, with both CodeMeta and
+`ForgeFed's vocabulary`_ (including `ActivityStream's vocabulary`_)
+
+.. _ForgeFed's vocabulary: https://forgefed.org/spec/#vocab
+.. _ActivityStream's vocabulary: https://www.w3.org/TR/activitystreams-vocabulary/
+
+Instead of the four-step architecture above, the extrinsic-metadata indexer
+is standalone: it reads "raw extrinsic metadata" from the :ref:`swh-journal`,
+and produces new indexed entries in the database as they come.
+
+The caveat is that, while intrinsic metadata are always unambiguously authoritative
+(they are contained by their own origin repository, therefore they were added by
+the origin's "owners"), extrinsic metadata can be authored by third-parties.
+Support for third-party authorities is currently not implemented for this reason;
+so extrinsic metadata is only indexed when provided by the same
+forge/package-repository as the origin the metadata is about.
+Metadata on non-origin objects (typically, directories), is also ignored for
+this reason, for now.
+
+Assuming the metadata was provided by such an authority, it is then passed
+to metadata mappings; identified by a mimetype (or custom format name)
+they declared rather than filenames.
+
+
+Implementation status
+---------------------
+
 Supported intrinsic metadata
----------------------------
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^

 The following sources of intrinsic metadata are supported:

+* `CITATION.cff <https://citation-file-format.github.io/>`_
 * CodeMeta's `codemeta.json`_,
+* PHP's `composer.json <https://getcomposer.org/doc/04-schema.md>`_
 * Maven's `pom.xml`_,
 * NPM's `package.json`_,
+* NuGet's `.nuspec <https://learn.microsoft.com/en-us/nuget/reference/nuspec>`_
+* Pub.Dev's `pubspec.yaml <https://dart.dev/tools/pub/pubspec>`_
 * Python's `PKG-INFO`_,
 * Ruby's `.gemspec`_

 .. _codemeta.json: https://codemeta.github.io/terms/
 .. _pom.xml: https://maven.apache.org/pom.html
-.. _package.json: https://docs.npmjs.com/files/package.json
+.. _package.json: https://docs.npmjs.com/cli/v11/configuring-npm/package-json
 .. _PKG-INFO: https://www.python.org/dev/peps/pep-0314/
 .. _.gemspec: https://guides.rubygems.org/specification-reference/

+Supported extrinsic metadata
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+The following sources of extrinsic metadata are supported:
+
+* Codemeta documents sent by clients of :ref:`swh-deposit <swh-deposit>` (`HAL <https://hal.science/>`_, `eLife <https://elifesciences.org/>`_, `IPOL <https://www.ipol.im/>`_, ...)
+* Gitea's `"repoGet" API <https://docs.gitea.com/api/1.23/#tag/repository/operation/repoGet>`__
+* GitHub's `"repo" API <https://docs.github.com/en/rest/repos/repos#get-a-repository>`__
+
+

-Supported CodeMeta terms
------------------------
+Supported JSON-LD properties
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^

 The following terms may be found in the output of the metadata translation
 (other than the `codemeta` mapping, which is the identity function, and
-therefore supports all terms):
+therefore supports all properties):

-.. program-output:: python3 -m swh.indexer.cli mapping list-terms --exclude-mapping codemeta
+.. program-output:: python3 -m swh.indexer.cli mapping list-terms --exclude-mapping codemeta --exclude-mapping json-sword-codemeta --exclude-mapping sword-codemeta
    :nostderr:


-Adding support for additional ecosystem-specific metadata
---------------------------------------------------------
+
+
+Tutorials
+---------
+
+The rest of this page is made of two tutorials: one to index
+:term:`intrinsic metadata` (ie. from a file in a VCS or in a tarball),
+and one to index :term:`extrinsic metadata` (ie. obtained via external means,
+such as GitHub's or GitLab's APIs).
+
+Adding support for additional ecosystem-specific intrinsic metadata
+-------------------------------------------------------------------

 This section will guide you through adding code to the metadata indexer to
 detect and translate new metadata formats.

 First, you should start by picking one of the `CodeMeta crosswalks`_.
-Then create a new file in `swh-indexer/swh/indexer/metadata_dictionary/`, that
+Then create a new file in :file:`swh-indexer/swh/indexer/metadata_dictionary/`, that
 will contain your code, and create a new class that inherits from helper
 classes, with some documentation about your indexer:

 .. code-block:: python

-	from .base import DictMapping, SingleFileMapping
+	from .base import DictMapping, SingleFileIntrinsicMapping
 	from swh.indexer.codemeta import CROSSWALK_TABLE

-	class MyMapping(DictMapping, SingleFileMapping):
+	class MyMapping(DictMapping, SingleFileIntrinsicMapping):
 		"""Dedicated class for ..."""
 		name = 'my-mapping'
 		filename = b'the-filename'
@@ -145,7 +228,9 @@ classes, with some documentation about your indexer:

 .. _CodeMeta crosswalks: https://github.com/codemeta/codemeta/tree/master/crosswalks

-Then, add a `string_fields` attribute, that is the list of all keys whose
+And reference it from :const:`swh.indexer.metadata_dictionary.INTRINSIC_MAPPINGS`.
+
+Then, add a ``string_fields`` attribute, that is the list of all keys whose
 values are simple text values. For instance, to
 `translate Python PKG-INFO`_, it's:

@@ -158,12 +243,12 @@ values are simple text values. For instance, to
 These values will be automatically added to the above list of
 supported terms.

-.. _translate Python PKG-INFO: https://forge.softwareheritage.org/source/swh-indexer/browse/master/swh/indexer/metadata_dictionary/python.py
+.. _translate Python PKG-INFO: https://gitlab.softwareheritage.org/swh/devel/swh-indexer/-/blob/master/swh/indexer/metadata_dictionary/python.py

-Last step to get your code working: add a `translate` method that will
+Last step to get your code working: add a ``translate`` method that will
 take a single byte string as argument, turn it into a Python dictionary,
 whose keys are the ones of the input document, and pass it to
-`_translate_dict`.
+``_translate_dict``.

 For instance, if the input document is in JSON, it can be as simple as:

@@ -174,13 +259,13 @@ For instance, if the input document is in JSON, it can be as simple as:
        content_dict = json.loads(raw_content)  # str to dict
        return self._translate_dict(content_dict)  # convert to CodeMeta

-`_translate_dict` will do the heavy work of reading the crosswalk table for
-each of `string_fields`, read the corresponding value in the `content_dict`,
+``_translate_dict`` will do the heavy work of reading the crosswalk table for
+each of ``string_fields``, read the corresponding value in the ``content_dict``,
 and build a CodeMeta dictionary with the corresponding names from the
 crosswalk table.

 One last thing to run your code: add it to the list in
-`swh-indexer/swh/indexer/metadata_dictionary/__init__.py`, so the rest of the
+:file:`swh-indexer/swh/indexer/metadata_dictionary/__init__.py`, so the rest of the
 code is aware of it.

 Now, you can run it:
@@ -195,14 +280,19 @@ If it works, well done!

 You can now improve your translation code further, by adding methods that
 will do more advanced conversion. For example, if there is a field named
-`license` containing an SPDX identifier, you must convert it to an URI,
+``license`` containing an SPDX identifier, you must convert it to an URI,
 like this:

 .. code-block:: python

    def normalize_license(self, s):
        if isinstance(s, str):
-            return {"@id": "https://spdx.org/licenses/" + s}
+            return rdflib.URIRef("https://spdx.org/licenses/" + s)
+
+This method will automatically get called by ``_translate_dict`` when it
+finds a ``license`` field in ``content_dict``.
+
+Adding support for additional ecosystem-specific extrinsic metadata
+-------------------------------------------------------------------

-This method will automatically get called by `_translate_dict` when it
-finds a `license` field in `content_dict`.
+[this section is a work in progress]
No results found