Skip to content
Snippets Groups Projects

Compare revisions

Changes are shown as if the source revision was being merged into the target revision. Learn more about comparing revisions.

Source

Select target project
No results found
Select Git revision
  • codemeta-compact-workaround-pyld-issue
  • generated-differential-D7342-source
  • generated-differential-D8283-source
  • generated-differential-D8343-source
  • jsonld-in-sword
  • masking
  • master
  • mimetypes-mock1
  • 2.3.0
  • 2.7.0
  • v0.0.1
  • v0.0.10
  • v0.0.11
  • v0.0.118
  • v0.0.119
  • v0.0.12
  • v0.0.120
  • v0.0.121
  • v0.0.122
  • v0.0.123
  • v0.0.124
  • v0.0.125
  • v0.0.126
  • v0.0.127
  • v0.0.128
  • v0.0.129
  • v0.0.13
  • v0.0.130
  • v0.0.131
  • v0.0.132
  • v0.0.133
  • v0.0.134
  • v0.0.135
  • v0.0.136
  • v0.0.137
  • v0.0.138
  • v0.0.139
  • v0.0.14
  • v0.0.140
  • v0.0.141
  • v0.0.142
  • v0.0.143
  • v0.0.144
  • v0.0.145
  • v0.0.146
  • v0.0.147
  • v0.0.148
  • v0.0.149
  • v0.0.15
  • v0.0.150
  • v0.0.151
  • v0.0.152
  • v0.0.153
  • v0.0.154
  • v0.0.155
  • v0.0.156
  • v0.0.157
  • v0.0.158
  • v0.0.159
  • v0.0.16
  • v0.0.160
  • v0.0.161
  • v0.0.162
  • v0.0.163
  • v0.0.164
  • v0.0.165
  • v0.0.166
  • v0.0.167
  • v0.0.168
  • v0.0.169
  • v0.0.17
  • v0.0.170
  • v0.0.171
  • v0.0.18
  • v0.0.19
  • v0.0.2
  • v0.0.20
  • v0.0.21
  • v0.0.22
  • v0.0.23
  • v0.0.24
  • v0.0.25
  • v0.0.26
  • v0.0.27
  • v0.0.28
  • v0.0.29
  • v0.0.3
  • v0.0.30
  • v0.0.31
  • v0.0.32
  • v0.0.33
  • v0.0.34
  • v0.0.35
  • v0.0.36
  • v0.0.37
  • v0.0.38
  • v0.0.39
  • v0.0.4
  • v0.0.40
  • v0.0.41
  • v0.0.42
  • v0.0.43
  • v0.0.44
  • v0.0.45
  • v0.0.46
  • v0.0.47
  • v0.0.48
  • v0.0.49
108 results

Target

Select target project
  • vlorentz/swh-indexer
  • KShivendu/swh-indexer
  • lunar/swh-indexer
  • anlambert/swh-indexer
  • cmatrix/swh-indexer
  • ardumont/swh-indexer
  • swh/devel/swh-indexer
  • douardda/swh-indexer
  • olasd/swh-indexer
  • marmoute/swh-indexer
10 results
Select Git revision
  • debian/buster-swh
  • debian/stretch-swh
  • debian/unstable-swh
  • debian/upstream
  • generated-differential-D7342-source
  • generated-differential-D8283-source
  • generated-differential-D8343-source
  • master
  • mimetypes-mock1
  • pristine-tar
  • 2.3.0
  • 2.7.0
  • debian/0.0.125-1_swh1
  • debian/0.0.125-1_swh1_bpo9+1
  • debian/0.0.126-1_swh1
  • debian/0.0.126-1_swh1_bpo9+1
  • debian/0.0.127-1_swh1
  • debian/0.0.127-1_swh1_bpo9+1
  • debian/0.0.128-1_swh1
  • debian/0.0.128-1_swh1_bpo9+1
  • debian/0.0.129-1_swh1
  • debian/0.0.129-1_swh1_bpo9+1
  • debian/0.0.131-1_swh1
  • debian/0.0.131-1_swh1_bpo9+1
  • debian/0.0.132-1_swh1
  • debian/0.0.132-1_swh1_bpo9+1
  • debian/0.0.133-1_swh1
  • debian/0.0.133-1_swh1_bpo9+1
  • debian/0.0.134-1_swh1
  • debian/0.0.134-1_swh1_bpo9+1
  • debian/0.0.135-1_swh1
  • debian/0.0.135-1_swh1_bpo9+1
  • debian/0.0.136-1_swh1
  • debian/0.0.136-1_swh1_bpo9+1
  • debian/0.0.137-1_swh1
  • debian/0.0.137-1_swh1_bpo9+1
  • debian/0.0.138-1_swh1
  • debian/0.0.138-1_swh2
  • debian/0.0.139-1_swh1
  • debian/0.0.139-1_swh1_bpo9+1
  • debian/0.0.139-1_swh2
  • debian/0.0.139-1_swh2_bpo9+1
  • debian/0.0.140-1_swh1
  • debian/0.0.140-1_swh1_bpo9+1
  • debian/0.0.141-1_swh1
  • debian/0.0.141-1_swh1_bpo9+1
  • debian/0.0.142-1_swh1
  • debian/0.0.142-1_swh1_bpo9+1
  • debian/0.0.143-1_swh1
  • debian/0.0.143-1_swh1_bpo9+1
  • debian/0.0.144-1_swh1
  • debian/0.0.144-1_swh1_bpo9+1
  • debian/0.0.145-1_swh1
  • debian/0.0.145-1_swh1_bpo9+1
  • debian/0.0.146-1_swh1
  • debian/0.0.146-1_swh1_bpo9+1
  • debian/0.0.146-1_swh2
  • debian/0.0.146-1_swh2_bpo9+1
  • debian/0.0.147-1_swh1
  • debian/0.0.147-1_swh1_bpo9+1
  • debian/0.0.148-1_swh1
  • debian/0.0.148-1_swh2
  • debian/0.0.148-1_swh2_bpo9+1
  • debian/0.0.148-1_swh3
  • debian/0.0.149-1_swh1
  • debian/0.0.149-1_swh1_bpo9+1
  • debian/0.0.149-1_swh2
  • debian/0.0.149-1_swh2_bpo9+1
  • debian/0.0.150-1_swh1
  • debian/0.0.150-1_swh1_bpo9+1
  • debian/0.0.151-1_swh1
  • debian/0.0.151-1_swh1_bpo10+1
  • debian/0.0.151-1_swh1_bpo9+1
  • debian/0.0.152-1_swh1
  • debian/0.0.152-1_swh1_bpo10+1
  • debian/0.0.152-1_swh1_bpo9+1
  • debian/0.0.153-1_swh1
  • debian/0.0.153-1_swh1_bpo10+1
  • debian/0.0.153-1_swh1_bpo9+1
  • debian/0.0.154-1_swh1
  • debian/0.0.154-1_swh2
  • debian/0.0.154-1_swh2_bpo10+1
  • debian/0.0.154-1_swh2_bpo9+1
  • debian/0.0.155-1_swh1
  • debian/0.0.155-1_swh1_bpo10+1
  • debian/0.0.155-1_swh1_bpo9+1
  • debian/0.0.156-1_swh1
  • debian/0.0.157-1_swh1
  • debian/0.0.157-1_swh1_bpo10+1
  • debian/0.0.158-1_swh1
  • debian/0.0.158-1_swh1_bpo10+1
  • debian/0.0.159-1_swh1
  • debian/0.0.159-1_swh1_bpo10+1
  • debian/0.0.160-1_swh1
  • debian/0.0.160-1_swh1_bpo10+1
  • debian/0.0.161-1_swh1
  • debian/0.0.161-1_swh1_bpo10+1
  • debian/0.0.162-1_swh1
  • debian/0.0.162-1_swh1_bpo10+1
  • debian/0.0.163-1_swh1
  • debian/0.0.163-1_swh1_bpo10+1
  • debian/0.0.164-1_swh1
  • debian/0.0.164-1_swh1_bpo10+1
  • debian/0.0.165-1_swh1
  • debian/0.0.165-1_swh1_bpo10+1
  • debian/0.0.166-1_swh1
  • debian/0.0.166-1_swh1_bpo10+1
  • debian/0.0.167-1_swh1
  • debian/0.0.167-1_swh1_bpo10+1
  • debian/0.0.168-1_swh1
110 results
Show changes

Commits on Source 239

139 additional commits have been omitted to prevent performance issues.
Showing
with 419 additions and 351 deletions
# Changes here will be overwritten by Copier
_commit: v0.3.5
_src_path: https://gitlab.softwareheritage.org/swh/devel/swh-py-template.git
description: Software Heritage indexer
distribution_name: swh-indexer
have_cli: true
have_workers: true
package_root: swh/indexer
project_name: swh.indexer
python_minimal_version: '3.9'
readme_format: rst
# python: Reformat code with black
5aa97ccd6ce29d6f66eb093c5d06e9030d7449fd
0f847f6119195649fe4108b776b9244940ebdb46
2e9f1d3e896062ae6b3cd99dc1a5d4148beebbf7
*.egg-info/
*.pyc
*.sw?
*~
/.coverage
/.coverage.*
.coverage
.eggs/
.hypothesis
.mypy_cache
.tox
__pycache__
*.egg-info/
build/
dist/
version.txt
/sql/createdb-stamp
/sql/filldb-stamp
.tox/
.hypothesis/
.mypy_cache/
.vscode/
\ No newline at end of file
# these are symlinks created by a hook in swh-docs' main sphinx conf.py
docs/README.rst
docs/README.md
# this should be a symlink for people who want to build the sphinx doc
# without using tox, generally created by the swh-env/bin/update script
docs/Makefile.sphinx
repos:
- repo: https://github.com/pre-commit/pre-commit-hooks
rev: v2.4.0
rev: v6.0.0
hooks:
- id: trailing-whitespace
- id: check-json
- id: check-yaml
- repo: https://gitlab.com/pycqa/flake8
rev: 3.8.3
- repo: https://github.com/python/black
rev: 25.9.0
hooks:
- id: black
- repo: https://github.com/PyCQA/isort
rev: 6.1.0
hooks:
- id: isort
- repo: https://github.com/pycqa/flake8
rev: 7.3.0
hooks:
- id: flake8
additional_dependencies: [flake8-bugbear==24.12.12, flake8-pyproject]
- repo: https://github.com/codespell-project/codespell
rev: v1.16.0
rev: v2.4.1
hooks:
- id: codespell
exclude: ^(swh/indexer/data/codemeta/crosswalk.csv)$
name: Check source code spelling
args: [-L assertIn]
exclude: ^(swh/indexer/data/)
stages: [pre-commit]
- id: codespell
name: Check commit message spelling
stages: [commit-msg]
- repo: local
hooks:
- id: mypy
......@@ -25,25 +43,13 @@ repos:
pass_filenames: false
language: system
types: [python]
- repo: https://github.com/PyCQA/isort
rev: 5.5.2
hooks:
- id: isort
- repo: https://github.com/python/black
rev: 19.10b0
hooks:
- id: black
# unfortunately, we are far from being able to enable this...
# - repo: https://github.com/PyCQA/pydocstyle.git
# rev: 4.0.0
# hooks:
# - id: pydocstyle
# name: pydocstyle
# description: pydocstyle is a static analysis tool for checking compliance with Python docstring conventions.
# entry: pydocstyle --convention=google
# language: python
# types: [python]
- id: twine-check
name: twine check
description: call twine check when pushing an annotated release tag
entry: bash -c "ref=$(git describe) &&
[[ $ref =~ ^v[0-9]+\.[0-9]+\.[0-9]+$ ]] &&
(python3 -m build --sdist && twine check $(ls -t dist/* | head -1)) || true"
pass_filenames: false
stages: [pre-push]
language: python
additional_dependencies: [twine, build]
......@@ -6,7 +6,7 @@ In the interest of fostering an open and welcoming environment, we as Software
Heritage contributors and maintainers pledge to making participation in our
project and our community a harassment-free experience for everyone, regardless
of age, body size, disability, ethnicity, sex characteristics, gender identity
and expression, level of experience, education, socio-economic status,
and expression, level of experience, education, socioeconomic status,
nationality, personal appearance, race, religion, or sexual identity and
orientation.
......
Kumar Shivendu
Siddharth Ravikumar
Thibault Allançon
Satvik Vemuganti
include README.md
include Makefile
include requirements*.txt
include version.txt
include conftest.py
recursive-include sql *
recursive-include swh/indexer/sql *.sql
recursive-include swh/indexer/data *
recursive-include swh py.typed
TESTFLAGS=--hypothesis-profile=fast
TESTFLAGS += --hypothesis-profile=fast
swh-indexer
============
Software Heritage - Indexer
===========================
Tools to compute multiple indexes on SWH's raw contents:
- content:
- mimetype
- ctags
- language
- fossology-license
- metadata
- revision:
- metadata
- origin:
- metadata (intrinsic, using the content indexer; and extrinsic)
An indexer is in charge of:
- looking up objects
- extracting information from those objects
- store those information in the swh-indexer db
There are multiple indexers working on different object types:
- content indexer: works with content sha1 hashes
- revision indexer: works with revision sha1 hashes
- origin indexer: works with origin identifiers
Indexation procedure:
- receive batch of ids
- retrieve the associated data depending on object type
- compute for that object some index
......@@ -32,18 +37,13 @@ Current content indexers:
- mimetype (queue swh_indexer_content_mimetype): detect the encoding
and mimetype
- language (queue swh_indexer_content_language): detect the
programming language
- ctags (queue swh_indexer_content_ctags): compute tags information
- fossology-license (queue swh_indexer_fossology_license): compute the
license
- metadata: translate file into translated_metadata dict
- metadata: translate file from an ecosystem-specific formats to JSON-LD
(using schema.org/CodeMeta vocabulary)
Current revision indexers:
Current origin indexers:
- metadata: detects files containing metadata and retrieves translated_metadata
in content_metadata table in storage or run content indexer to translate
files.
- metadata: translate file from an ecosystem-specific formats to JSON-LD
(using schema.org/CodeMeta and ForgeFed vocabularies)
# Copyright (C) 2020 The Software Heritage developers
# Copyright (C) 2020-2025 The Software Heritage developers
# See the AUTHORS file at the top-level directory of this distribution
# License: GNU General Public License version 3, or any later version
# See top-level LICENSE file for more information
from hypothesis import settings
import pytest
# define tests profile. Full documentation is at:
# https://hypothesis.readthedocs.io/en/latest/settings.html#settings-profiles
......@@ -18,14 +17,6 @@ collect_ignore = ["swh/indexer/storage/api/wsgi.py"]
# we use the various swh fixtures
pytest_plugins = [
"swh.scheduler.pytest_plugin",
"swh.journal.pytest_plugin",
"swh.storage.pytest_plugin",
"swh.core.db.pytest_plugin",
]
@pytest.fixture(scope="session")
def swh_scheduler_celery_includes(swh_scheduler_celery_includes):
return swh_scheduler_celery_includes + [
"swh.indexer.tasks",
]
include ../../swh-docs/Makefile.sphinx
-include Makefile.local
include Makefile.sphinx
../README.md
\ No newline at end of file
Hacking on swh-indexer
======================
This tutorial will guide you through the hacking on the swh-indexer.
If you do not have a local copy of the Software Heritage archive, go to the
`getting started tutorial
<https://docs.softwareheritage.org/devel/getting-started.html>`_
Configuration files
-------------------
You will need the following YAML configuration files to run the swh-indexer
commands:
- Orchestrator at
``~/.config/swh/indexer/orchestrator.yml``
.. code-block:: yaml
indexers:
mimetype:
check_presence: false
batch_size: 100
- Orchestrator-text at
``~/.config/swh/indexer/orchestrator-text.yml``
.. code-block:: yaml
indexers:
# language:
# batch_size: 10
# check_presence: false
fossology_license:
batch_size: 10
check_presence: false
# ctags:
# batch_size: 2
# check_presence: false
- Mimetype indexer at
``~/.config/swh/indexer/mimetype.yml``
.. code-block:: yaml
# storage to read sha1's metadata (path)
# storage:
# cls: local
# db: "service=swh-dev"
# objstorage:
# cls: pathslicing
# root: /home/storage/swh-storage/
# slicing: 0:1/1:5
storage:
cls: remote
url: http://localhost:5002/
indexer_storage:
cls: remote
args:
url: http://localhost:5007/
# storage to read sha1's content
# adapt this to your need
# locally: this needs to match your storage's setup
objstorage:
cls: pathslicing
slicing: 0:1/1:5
root: /home/storage/swh-storage/
destination_task: swh.indexer.tasks.SWHOrchestratorTextContentsTask
rescheduling_task: swh.indexer.tasks.SWHContentMimetypeTask
- Fossology indexer at
``~/.config/swh/indexer/fossology_license.yml``
.. code-block:: yaml
# storage to read sha1's metadata (path)
# storage:
# cls: local
# db: "service=swh-dev"
# objstorage:
# cls: pathslicing
# root: /home/storage/swh-storage/
# slicing: 0:1/1:5
storage:
cls: remote
url: http://localhost:5002/
indexer_storage:
cls: remote
args:
url: http://localhost:5007/
# storage to read sha1's content
# adapt this to your need
# locally: this needs to match your storage's setup
objstorage:
cls: pathslicing
slicing: 0:1/1:5
root: /home/storage/swh-storage/
workdir: /tmp/swh/worker.indexer/license/
tools:
name: 'nomos'
version: '3.1.0rc2-31-ga2cbb8c'
configuration:
command_line: 'nomossa <filepath>'
- Worker at
``~/.config/swh/worker.yml``
.. code-block:: yaml
task_broker: amqp://guest@localhost//
task_modules:
- swh.loader.svn.tasks
- swh.loader.tar.tasks
- swh.loader.git.tasks
- swh.storage.archiver.tasks
- swh.indexer.tasks
- swh.indexer.orchestrator
task_queues:
- swh_loader_svn
- swh_loader_tar
- swh_reader_git_to_azure_archive
- swh_storage_archive_worker_to_backend
- swh_indexer_orchestrator_content_all
- swh_indexer_orchestrator_content_text
- swh_indexer_content_mimetype
- swh_indexer_content_language
- swh_indexer_content_ctags
- swh_indexer_content_fossology_license
- swh_loader_svn_mount_and_load
- swh_loader_git_express
- swh_loader_git_archive
- swh_loader_svn_archive
task_soft_time_limit: 0
Database
--------
swh-indxer uses a database to store the indexed content. The default
db is expected to be called swh-indexer-dev.
Create or add ``swh-dev`` and ``swh-indexer-dev`` to
the ``~/.pg_service.conf`` and ``~/.pgpass`` files, which are postgresql's
configuration files.
Add data to local DB
--------------------
from within the ``swh-environment``, run the following command::
make rebuild-testdata
and fetch some real data to work with, using::
python3 -m swh.loader.git.updater --origin-url <github url>
Then you can list all content files using this script::
#!/usr/bin/env bash
psql service=swh-dev -c "copy (select sha1 from content) to stdin" | sed -e 's/^\\\\x//g'
Run the indexers
-----------------
Use the list off contents to feed the indexers with with the
following command::
./list-sha1.sh | python3 -m swh.indexer.producer --batch 100 --task-name orchestrator_all
Activate the workers
--------------------
To send messages to different queues using rabbitmq
(which should already be installed through dependencies installation),
run the following command in a dedicated terminal::
python3 -m celery worker --app=swh.scheduler.celery_backend.config.app \
--pool=prefork \
--concurrency=1 \
-Ofair \
--loglevel=info \
--without-gossip \
--without-mingle \
--without-heartbeat 2>&1
With this command rabbitmq will consume message using the worker
configuration file.
Note: for the fossology_license indexer, you need a package fossology-nomossa
which is in our `public debian repository
<https://wiki.softwareheritage.org/index.php?title=Debian_packaging#Package_repository>`_.
tasks-metadata-indexers.svg
*.svg
......@@ -2,10 +2,16 @@
UML_DIAGS_SRC = $(wildcard *.uml)
UML_DIAGS = $(patsubst %.uml,%.svg,$(UML_DIAGS_SRC))
all: $(UML_DIAGS)
DOT_DIAGS_SRC = $(wildcard *.dot)
DOT_DIAGS = $(patsubst %.dot,%.svg,$(DOT_DIAGS_SRC))
all: $(UML_DIAGS) $(DOT_DIAGS)
%.svg: %.uml
DISPLAY="" plantuml -tsvg $<
%.svg: %.dot
dot $< -T svg -o $@
clean:
-rm -f $(DEP_GRAPHS) $(UML_DIAGS)
-rm -f $(DEP_GRAPHS) $(UML_DIAGS) $(DOT_DIAGS)
digraph metadata_flow {
subgraph cluster_forges {
style=invis;
origin_vcs [label="Version Control Systems\n(Git, SVN, ...)"];
origin_pm [label="Package Managers\n(NPM, PyPI, Debian, ...)"];
}
subgraph internet {
rank=same;
deposit_client [label="Deposit Clients\n(HAL, IPOL, eLife, Intel, ...)"];
registries [label="Registries\n(Wikidata, ...)"];
}
subgraph cluster_SWH {
label="Software Heritage";
labeljust="r";
labelloc="b";
loader_vcs [label="VCS loader", shape="box"];
loader_pm [label="PM loader", shape="box"];
deposit_server [label="Deposit server", shape="box"];
indexer_extr [label="extrinsic metadata indexer\n(translate to Codemeta)", shape="box"];
indexer_intr [label="intrinsic metadata indexer\n(translate to Codemeta)", shape="box"];
registry_fetcher[label="?", style="dashed", shape="box"];
storage [label="\nMain Storage\n(swh-storage and\nswh-objstorage)", shape=cylinder];
remd_storage [label="\nRaw Extrinsic\nMetadata Storage", shape=cylinder];
indexed_storage [label="\nIndexed\nMetadata Storage\n(search, idx-storage)", shape=cylinder];
webapp [label="Web Interface", shape="box"];
}
subgraph users {
browser [label="Web Browser", shape="box"]
}
origin_vcs -> loader_vcs [label="pull"];
loader_vcs -> storage;
origin_pm -> loader_pm [label="pull"]
loader_pm -> {storage, remd_storage};
deposit_client -> deposit_server [label="push\n(SWORD + Codemeta)"];
deposit_server -> {storage, remd_storage};
registries -> registry_fetcher -> remd_storage [style="dashed"];
storage -> indexer_intr [label="all kinds of\nmetadata formats"];
indexer_intr -> indexed_storage [label="only Codemeta"];
remd_storage -> indexer_extr [label="all kinds of\nmetadata formats"];
indexer_extr-> indexed_storage;
{storage, remd_storage, indexed_storage} -> webapp;
webapp -> browser [label="search, display,\nBibTeX export,\ndownload, ..."];
}
@startuml
participant LOADERS as "Metadata Loaders"
participant STORAGE as "Graph Storage"
participant JOURNAL as "Journal"
participant IDX_REM_META as "REM Indexer"
participant IDX_STORAGE as "Indexer Storage"
activate IDX_STORAGE
activate STORAGE
activate JOURNAL
activate LOADERS
LOADERS->>STORAGE: new REM (Raw Extrinsic Metadata) object\n for Origin http://example.org/repo.git\nor object swh:1:dir:...
STORAGE->>JOURNAL: new REM object
deactivate LOADERS
JOURNAL->>IDX_REM_META: run indexers on REM object
activate IDX_REM_META
IDX_REM_META->>IDX_REM_META: recognize REM object (gitea/github/deposit/...)
IDX_REM_META->>IDX_REM_META: parse REM object
alt If the REM object describe an origin
IDX_REM_META->>IDX_STORAGE: origin_extrinsic_metadata_add(id="http://example.org/repo.git", {author: "Jane Doe", ...})
IDX_STORAGE->>IDX_REM_META: ok
end
alt If the REM object describe a directory
IDX_REM_META->>IDX_STORAGE: directory_extrinsic_metadata_add(id="swh:1:dir:...", {author: "Jane Doe", ...})
IDX_STORAGE->>IDX_REM_META: ok
end
deactivate IDX_REM_META
@enduml
@startuml
participant LOADERS as "Loaders"
participant STORAGE as "Graph Storage"
participant JOURNAL as "Journal"
participant SCHEDULER as "Scheduler"
participant IDX_ORIG_META as "Origin Metadata Indexer"
participant IDX_ORIG_HEAD as "Origin-Head Indexer"
participant IDX_REV_META as "Revision Metadata Indexer"
participant IDX_DIR_META as "Directory Metadata Indexer"
participant IDX_CONT_META as "Content Metadata Indexer"
participant IDX_ORIG_META as "Origin Metadata Indexer"
participant IDX_STORAGE as "Indexer Storage"
participant STORAGE as "Graph Storage"
participant OBJ_STORAGE as "Object Storage"
activate OBJ_STORAGE
activate IDX_STORAGE
activate STORAGE
activate JOURNAL
activate SCHEDULER
activate IDX_ORIG_META
activate LOADERS
LOADERS->>JOURNAL: Origin 42 was added/revisited
LOADERS->>STORAGE: Repository content
LOADERS->>STORAGE: Origin http://example.org/repo.git\nwas added/revisited
STORAGE->>JOURNAL: Origin http://example.org/repo.git\nwas added/revisited
deactivate LOADERS
JOURNAL->>SCHEDULER: run indexers on origin 42
JOURNAL->>IDX_ORIG_META: run indexers on origin\nhttp://example.org/repo.git
SCHEDULER->>IDX_ORIG_HEAD: Find HEAD revision of 42
IDX_ORIG_META->>IDX_ORIG_HEAD: Find HEAD revision of\nhttp://example.org/repo.git
activate IDX_ORIG_HEAD
IDX_ORIG_HEAD->>STORAGE: snapshot_get_latest(origin=42)
IDX_ORIG_HEAD->>STORAGE: snapshot_get_latest(origin="http://example.org/repo.git")
STORAGE->>IDX_ORIG_HEAD: branches
IDX_ORIG_HEAD->>SCHEDULER: run Revision Metadata Indexer\non revision 42abcdef\n(head of origin 42)
IDX_ORIG_HEAD->>IDX_ORIG_META: run Revision Metadata Indexer\non revision 42abcdef (head of origin\nhttp://example.org/repo.git)
deactivate IDX_ORIG_HEAD
SCHEDULER->>IDX_REV_META: Index revision 42abcdef\n(head of origin 42)
activate IDX_REV_META
IDX_ORIG_META->>STORAGE: revision_get(sha1=42abcdef)
STORAGE->>IDX_ORIG_META: {id: 42abcdef, message: "Commit message", directory: 456789ab, ...}
IDX_REV_META->>STORAGE: revision_get(sha1=42abcdef)
STORAGE->>IDX_REV_META: {id: 42abcdef, message: "Commit message", directory: 456789ab, ...}
IDX_ORIG_META->>IDX_DIR_META: Index directory 456789ab\n(head of origin http://example.org/repo.git)
activate IDX_DIR_META
IDX_REV_META->>STORAGE: directory_ls(sha1=456789ab)
STORAGE->>IDX_REV_META: [{id: 1234cafe, name: "package.json", type: file, ...}, {id: cafe4321, name: "README", type: file, ...}, ...]
IDX_DIR_META->>STORAGE: directory_ls(sha1=456789ab)
STORAGE->>IDX_DIR_META: [{id: 1234cafe, name: "package.json", type: file, ...}, {id: cafe4321, name: "README", type: file, ...}, ...]
IDX_REV_META->>IDX_REV_META: package.json is a metadata file
IDX_DIR_META->>IDX_DIR_META: package.json is a metadata file
IDX_REV_META->>IDX_STORAGE: content_metadata_get(sha1=1234cafe)
IDX_STORAGE->>IDX_REV_META: none / {author: "Jane Doe", ...}
IDX_DIR_META->>IDX_STORAGE: content_metadata_get(sha1=1234cafe)
IDX_STORAGE->>IDX_DIR_META: none / {author: "Jane Doe", ...}
alt If the storage answered "none"
IDX_REV_META->>IDX_CONT_META: Index file 1234cafe as an NPM metadata file
IDX_DIR_META->>IDX_CONT_META: Index file 1234cafe as an NPM metadata file
activate IDX_CONT_META
IDX_CONT_META->>OBJ_STORAGE: content_get 1234cafe
......@@ -60,23 +61,17 @@
IDX_CONT_META->>IDX_STORAGE: content_metadata_add(sha1=1234cafe, {author: "Jane Doe", ...})
IDX_STORAGE->>IDX_CONT_META: ok
IDX_CONT_META->>IDX_REV_META: extracted: {author: "Jane Doe", ...}
IDX_CONT_META->>IDX_DIR_META: extracted: {author: "Jane Doe", ...}
deactivate IDX_CONT_META
end
IDX_REV_META->>IDX_STORAGE: revision_metadata_add(sha1=42abcdef, {author: "Jane Doe", ...})
IDX_STORAGE->>IDX_REV_META: ok
IDX_REV_META->>SCHEDULER: run Origin Metadata Indexer\non origin 42; the head is 42abcdef
deactivate IDX_REV_META
SCHEDULER->>IDX_ORIG_META: Index origin 42; the head is 42abcdef
activate IDX_ORIG_META
IDX_DIR_META->>IDX_STORAGE: directory_metadata_add(sha1=456789ab, {author: "Jane Doe", ...})
IDX_STORAGE->>IDX_DIR_META: ok
end
IDX_ORIG_META->>IDX_STORAGE: revision_metadata_get(sha1=42abcdef)
IDX_STORAGE->>IDX_ORIG_META: {author: "Jane Doe", ...}
IDX_DIR_META->>IDX_ORIG_META: extracted: {author: "Jane Doe", ...}
deactivate IDX_DIR_META
IDX_ORIG_META->>IDX_STORAGE: origin_metadata_add(id=42, {author: "Jane Doe", ...})
IDX_ORIG_META->>IDX_STORAGE: origin_metadata_add(id="http://example.org/repo.git", {author: "Jane Doe", ...}, from_directory=456789ab)
IDX_STORAGE->>IDX_ORIG_META: ok
deactivate IDX_ORIG_META
......
.. _swh-indexer:
Software Heritage - Indexer
===========================
Tools and workers used to mine the content of the archive and extract derived
information from archive source code artifacts.
.. include:: README.rst
.. toctree::
:maxdepth: 1
:caption: Contents:
README.md
dev-info.rst
metadata-workflow.rst
swhpkg.rst
Reference Documentation
......@@ -23,4 +18,12 @@ Reference Documentation
:maxdepth: 2
cli
/apidoc/swh.indexer
.. only:: standalone_package_doc
Indices and tables
------------------
* :ref:`genindex`
* :ref:`modindex`
* :ref:`search`
......@@ -12,13 +12,22 @@ multiple indexers, which coordinate with each other and save their results
at each step in the indexer storage.
Indexer architecture
--------------------
^^^^^^^^^^^^^^^^^^^^
.. thumbnail:: images/tasks-metadata-indexers.svg
Sequence diagram
""""""""""""""""
.. thumbnail:: images/tasks-intrinsic-metadata-indexers.svg
Data flow and storage
"""""""""""""""""""""
.. thumbnail:: images/metadata-flow.svg
Origin-Head Indexer
___________________
^^^^^^^^^^^^^^^^^^^
First, the Origin-Head indexer gets called externally, with an origin as
argument (or multiple origins, that are handled sequentially).
......@@ -30,49 +39,50 @@ branch of origin is (the "Head branch") and what revision it points to
(the "Head").
Intrinsic metadata for that origin will be extracted from that revision.
It schedules a Revision Metadata Indexer task for that revision, with a
hint that the revision is the Head of that particular origin.
It schedules a Directory Metadata Indexer task for the root directory of
that revision.
Revision and Content Metadata Indexers
______________________________________
Directory and Content Metadata Indexers
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
These two indexers do the hard part of the work. The Revision Metadata
These two indexers do the hard part of the work. The Directory Metadata
Indexer fetches the root directory associated with a revision, then extracts
the metadata from that directory.
To do so, it lists files in that directory, and looks for known names, such
as `codemeta.json`, `package.json`, or `pom.xml`. If there are any, it
as :file:`codemeta.json`, :file:`package.json`, or :file:`pom.xml`. If there are any, it
runs the Content Metadata Indexer on them, which in turn fetches their
contents and runs them through extraction dictionaries/mappings.
See below for details.
Their results are saved in a database (the indexer storage), associated with
the content and revision hashes.
If it received a hint that this revision is the head of an origin, the
Revision Metadata Indexer then schedules the Origin Metadata Indexer
to run on that origin.
the content and directory hashes.
Origin Metadata Indexer
_______________________
^^^^^^^^^^^^^^^^^^^^^^^
The job of this indexer is very simple: it takes an origin identifier and
a revision hash, and copies the metadata of the former to a new table, to
associate it with the latter.
uses the Origin-Head and Directory indexers to get metadata from the head
directory of an origin, and copies the metadata of the former to a new table,
to associate it with the latter.
The reason for this is to be able to perform searches on metadata, and
efficiently find out which origins matched the pattern.
Running that search on the `revision_metadata` table would require either
a reverse lookup from revisions to origins, which is costly.
Running that search on the ``directory_metadata`` table would require either
a reverse lookup from directories to origins, which is costly.
Translation from language-specific metadata to CodeMeta
-------------------------------------------------------
Translation from ecosystem-specific metadata to CodeMeta
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Intrinsic metadata are extracted from files provided with a project's source
code, and translated using `CodeMeta`_'s `crosswalk table`_.
Intrinsic metadata is extracted from files provided with a project's source
code, and translated using `CodeMeta`_'s `crosswalk table`_; which is vendored
in :file:`swh/indexer/data/codemeta/codemeta.csv`.
Ecosystems not yet included in Codemeta's crosswalk have their own
:file:`swh/indexer/data/*.csv` file, with one row for each CodeMeta property,
even when not supported by the ecosystem.
All input formats supported so far are straightforward dictionaries (eg. JSON)
or can be accessed as such (eg. XML); and the first part of the translation is
......@@ -92,52 +102,125 @@ This normalization makes up for most of the code of
.. _CSV file: https://github.com/codemeta/codemeta/blob/master/crosswalk.csv
Extrinsic metadata
------------------
Indexer architecture
^^^^^^^^^^^^^^^^^^^^
.. thumbnail:: images/tasks-extrinsic-metadata-indexers.svg
The :term:`extrinsic metadata` indexer works very differently from
the :term:`intrinsic metadata` indexers we saw above.
While the latter extract metadata from software artefacts (files and directories)
which are already a core part of the archive, the former extracts such data from
API calls pulled from forges and package managers, or pushed via the
:ref:`SWORD deposit <swh-deposit>`.
In order to preserve original information verbatim, the Software Heritage itself
stores the result of these calls, independently of indexers, in their own archive
as described in the :ref:`extrinsic-metadata-specification`.
In this section, we assume this information is already present in the archive,
but in the "raw extrinsic metadata" form, which needs to be translated to a common
vocabulary to be useful, as with intrinsic metadata.
The common vocabulary we chose is JSON-LD, with both CodeMeta and
`ForgeFed's vocabulary`_ (including `ActivityStream's vocabulary`_)
.. _ForgeFed's vocabulary: https://forgefed.org/spec/#vocab
.. _ActivityStream's vocabulary: https://www.w3.org/TR/activitystreams-vocabulary/
Instead of the four-step architecture above, the extrinsic-metadata indexer
is standalone: it reads "raw extrinsic metadata" from the :ref:`swh-journal`,
and produces new indexed entries in the database as they come.
The caveat is that, while intrinsic metadata are always unambiguously authoritative
(they are contained by their own origin repository, therefore they were added by
the origin's "owners"), extrinsic metadata can be authored by third-parties.
Support for third-party authorities is currently not implemented for this reason;
so extrinsic metadata is only indexed when provided by the same
forge/package-repository as the origin the metadata is about.
Metadata on non-origin objects (typically, directories), is also ignored for
this reason, for now.
Assuming the metadata was provided by such an authority, it is then passed
to metadata mappings; identified by a mimetype (or custom format name)
they declared rather than filenames.
Implementation status
---------------------
Supported intrinsic metadata
----------------------------
^^^^^^^^^^^^^^^^^^^^^^^^^^^^
The following sources of intrinsic metadata are supported:
* `CITATION.cff <https://citation-file-format.github.io/>`_
* CodeMeta's `codemeta.json`_,
* PHP's `composer.json <https://getcomposer.org/doc/04-schema.md>`_
* Maven's `pom.xml`_,
* NPM's `package.json`_,
* NuGet's `.nuspec <https://learn.microsoft.com/en-us/nuget/reference/nuspec>`_
* Pub.Dev's `pubspec.yaml <https://dart.dev/tools/pub/pubspec>`_
* Python's `PKG-INFO`_,
* Ruby's `.gemspec`_
.. _codemeta.json: https://codemeta.github.io/terms/
.. _pom.xml: https://maven.apache.org/pom.html
.. _package.json: https://docs.npmjs.com/files/package.json
.. _package.json: https://docs.npmjs.com/cli/v11/configuring-npm/package-json
.. _PKG-INFO: https://www.python.org/dev/peps/pep-0314/
.. _.gemspec: https://guides.rubygems.org/specification-reference/
Supported extrinsic metadata
^^^^^^^^^^^^^^^^^^^^^^^^^^^^
The following sources of extrinsic metadata are supported:
* Codemeta documents sent by clients of :ref:`swh-deposit <swh-deposit>` (`HAL <https://hal.science/>`_, `eLife <https://elifesciences.org/>`_, `IPOL <https://www.ipol.im/>`_, ...)
* Gitea's `"repoGet" API <https://docs.gitea.com/api/1.23/#tag/repository/operation/repoGet>`__
* GitHub's `"repo" API <https://docs.github.com/en/rest/repos/repos#get-a-repository>`__
Supported CodeMeta terms
------------------------
Supported JSON-LD properties
^^^^^^^^^^^^^^^^^^^^^^^^^^^^
The following terms may be found in the output of the metadata translation
(other than the `codemeta` mapping, which is the identity function, and
therefore supports all terms):
therefore supports all properties):
.. program-output:: python3 -m swh.indexer.cli mapping list-terms --exclude-mapping codemeta
.. program-output:: python3 -m swh.indexer.cli mapping list-terms --exclude-mapping codemeta --exclude-mapping json-sword-codemeta --exclude-mapping sword-codemeta
:nostderr:
Adding support for additional ecosystem-specific metadata
---------------------------------------------------------
Tutorials
---------
The rest of this page is made of two tutorials: one to index
:term:`intrinsic metadata` (ie. from a file in a VCS or in a tarball),
and one to index :term:`extrinsic metadata` (ie. obtained via external means,
such as GitHub's or GitLab's APIs).
Adding support for additional ecosystem-specific intrinsic metadata
-------------------------------------------------------------------
This section will guide you through adding code to the metadata indexer to
detect and translate new metadata formats.
First, you should start by picking one of the `CodeMeta crosswalks`_.
Then create a new file in `swh-indexer/swh/indexer/metadata_dictionary/`, that
Then create a new file in :file:`swh-indexer/swh/indexer/metadata_dictionary/`, that
will contain your code, and create a new class that inherits from helper
classes, with some documentation about your indexer:
.. code-block:: python
from .base import DictMapping, SingleFileMapping
from .base import DictMapping, SingleFileIntrinsicMapping
from swh.indexer.codemeta import CROSSWALK_TABLE
class MyMapping(DictMapping, SingleFileMapping):
class MyMapping(DictMapping, SingleFileIntrinsicMapping):
"""Dedicated class for ..."""
name = 'my-mapping'
filename = b'the-filename'
......@@ -145,7 +228,9 @@ classes, with some documentation about your indexer:
.. _CodeMeta crosswalks: https://github.com/codemeta/codemeta/tree/master/crosswalks
Then, add a `string_fields` attribute, that is the list of all keys whose
And reference it from :const:`swh.indexer.metadata_dictionary.INTRINSIC_MAPPINGS`.
Then, add a ``string_fields`` attribute, that is the list of all keys whose
values are simple text values. For instance, to
`translate Python PKG-INFO`_, it's:
......@@ -158,12 +243,12 @@ values are simple text values. For instance, to
These values will be automatically added to the above list of
supported terms.
.. _translate Python PKG-INFO: https://forge.softwareheritage.org/source/swh-indexer/browse/master/swh/indexer/metadata_dictionary/python.py
.. _translate Python PKG-INFO: https://gitlab.softwareheritage.org/swh/devel/swh-indexer/-/blob/master/swh/indexer/metadata_dictionary/python.py
Last step to get your code working: add a `translate` method that will
Last step to get your code working: add a ``translate`` method that will
take a single byte string as argument, turn it into a Python dictionary,
whose keys are the ones of the input document, and pass it to
`_translate_dict`.
``_translate_dict``.
For instance, if the input document is in JSON, it can be as simple as:
......@@ -174,13 +259,13 @@ For instance, if the input document is in JSON, it can be as simple as:
content_dict = json.loads(raw_content) # str to dict
return self._translate_dict(content_dict) # convert to CodeMeta
`_translate_dict` will do the heavy work of reading the crosswalk table for
each of `string_fields`, read the corresponding value in the `content_dict`,
``_translate_dict`` will do the heavy work of reading the crosswalk table for
each of ``string_fields``, read the corresponding value in the ``content_dict``,
and build a CodeMeta dictionary with the corresponding names from the
crosswalk table.
One last thing to run your code: add it to the list in
`swh-indexer/swh/indexer/metadata_dictionary/__init__.py`, so the rest of the
:file:`swh-indexer/swh/indexer/metadata_dictionary/__init__.py`, so the rest of the
code is aware of it.
Now, you can run it:
......@@ -195,14 +280,19 @@ If it works, well done!
You can now improve your translation code further, by adding methods that
will do more advanced conversion. For example, if there is a field named
`license` containing an SPDX identifier, you must convert it to an URI,
``license`` containing an SPDX identifier, you must convert it to an URI,
like this:
.. code-block:: python
def normalize_license(self, s):
if isinstance(s, str):
return {"@id": "https://spdx.org/licenses/" + s}
return rdflib.URIRef("https://spdx.org/licenses/" + s)
This method will automatically get called by ``_translate_dict`` when it
finds a ``license`` field in ``content_dict``.
Adding support for additional ecosystem-specific extrinsic metadata
-------------------------------------------------------------------
This method will automatically get called by `_translate_dict` when it
finds a `license` field in `content_dict`.
[this section is a work in progress]