Skip to content
Snippets Groups Projects

Compare revisions

Changes are shown as if the source revision was being merged into the target revision. Learn more about comparing revisions.

Source

Select target project
No results found

Target

Select target project
  • anlambert/swh-model
  • lunar/swh-model
  • franckbret/swh-model
  • douardda/swh-model
  • olasd/swh-model
  • swh/devel/swh-model
  • Alphare/swh-model
  • samplet/swh-model
  • marmoute/swh-model
  • rboyer/swh-model
10 results
Show changes
MERKLE_DAG = swh-merkle-dag.pdf swh-merkle-dag.svg
BUILD_TARGETS =
BUILD_TARGETS += $(MERKLE_DAG)
all: $(BUILD_TARGETS)
%.svg: %.dia
inkscape -l $@ $<
%.pdf: %.dia
inkscape -A $@ $<
clean:
-rm -f $(BUILD_TARGETS)
File deleted
.. _swh-model:
Software Heritage - Data model
==============================
Implementation of the :ref:`data-model` to archive source code artifacts.
.. toctree::
:caption: Overview:
:titlesonly:
data-model
persistent-identifiers
cli
/apidoc/swh.model
.. _persistent-identifiers:
================================================
SoftWare Heritage persistent IDentifiers (SWHID)
================================================
**version 1.2**
Description
===========
You can point to objects present in the Software Heritage archive by the means
of **SoftWare Heritage persistent IDentifiers**, or **SWHID** for short, that
are guaranteed to remain stable (persistent) over time. Their syntax, meaning,
and usage is described below. Note that they are identifiers and not URLs, even
though an URL-based resolver for Software Heritage persistent identifiers is
also provided.
A SWHID can point to any software artifact (or "object") available in the
Software Heritage archive. Objects come in different types, and most notably:
* contents
* directories
* revisions
* releases
* snapshots
Each object is identified by an intrinsic, type-specific object identifier that
is embedded in its SWHID as described below. SWHIDs are strong cryptographic
hashes computed on the entire set of object properties to form a `Merkle
structure <https://en.wikipedia.org/wiki/Merkle_tree>`_.
See the :ref:`Software Heritage data model <data-model>` for an overview of
object types and how they are linked together. See
:py:mod:`swh.model.identifiers` for details on how SWHIDs are computed.
Syntax
------
Syntactically, SWHIDs are generated by the ``<identifier>`` entry point of the
grammar:
.. code-block:: bnf
<identifier> ::= "swh" ":" <scheme_version> ":" <object_type> ":" <object_id> ;
<scheme_version> ::= "1" ;
<object_type> ::=
"snp" (* snapshot *)
| "rel" (* release *)
| "rev" (* revision *)
| "dir" (* directory *)
| "cnt" (* content *)
;
<object_id> ::= 40 * <hex_digit> ; (* intrinsic object id, as hex-encoded SHA1 *)
<dec_digit> ::= "0" | "1" | "2" | "3" | "4" | "5" | "6" | "7" | "8" | "9"
<hex_digit> ::= <dec_digit> | "a" | "b" | "c" | "d" | "e" | "f" ;
Semantics
---------
``:`` is used as separator between the logical parts of SWHIDs. The ``swh``
prefix makes explicit that these identifiers are related to *SoftWare
Heritage*. ``1`` (``<scheme_version>``) is the current version of this
identifier *scheme*; future editions will use higher version numbers, possibly
breaking backward compatibility (but without breaking the resolvability of
SWHIDs that conform to previous versions of the scheme).
A SWHID points to a single object, whose type is explicitly captured by
``<object_type>``:
* ``snp`` to **snapshots**,
* ``rel`` to **releases**,
* ``rev`` to **revisions**,
* ``dir`` to **directories**,
* ``cnt`` to **contents**.
The actual object pointed to is identified by the intrinsic identifier
``<object_id>``, which is a hex-encoded (using lowercase ASCII characters) SHA1
computed on the content and metadata of the object itself, as follows:
* for **snapshots**, intrinsic identifiers are computed as per
:py:func:`swh.model.identifiers.snapshot_identifier`
* for **releases**, as per
:py:func:`swh.model.identifiers.release_identifier`
* for **revisions**, as per
:py:func:`swh.model.identifiers.revision_identifier`
* for **directories**, as per
:py:func:`swh.model.identifiers.directory_identifier`
* for **contents**, the intrinsic identifier is the ``sha1_git`` hash of the
multiple hashes returned by
:py:func:`swh.model.identifiers.content_identifier`, i.e., the SHA1 of a byte
sequence obtained by juxtaposing the ASCII string ``"blob"`` (without
quotes), a space, the length of the content as decimal digits, a NULL byte,
and the actual content of the file.
Git compatibility
~~~~~~~~~~~~~~~~~
SWHIDs for contents, directories, revisions, and releases are, at present,
compatible with the `Git <https://git-scm.com/>`_ way of `computing identifiers
<https://git-scm.com/book/en/v2/Git-Internals-Git-Objects>`_ for its objects.
A SWHID for a content object will correspond (in its ``<object_id>`` part) to a
Git blob identifier of any file with the same content; a SWHID for a revision
will correspond to the Git commit identifier for the same revision, etc. This
is not the case for snapshot identifiers, as Git does not have a corresponding
object type.
Note that Git compatibility is incidental and is not guaranteed to be
maintained in future versions of this scheme (or Git).
Examples
--------
* ``swh:1:cnt:94a9ed024d3859793618152ea559a168bbcbb5e2`` points to the content
of a file containing the full text of the GPL3 license
* ``swh:1:dir:d198bc9d7a6bcf6db04f476d29314f157507d505`` points to a directory
containing the source code of the Darktable photography application as it was
at some point on 4 May 2017
* ``swh:1:rev:309cf2674ee7a0749978cf8265ab91a60aea0f7d`` points to a commit in
the development history of Darktable, dated 16 January 2017, that added
undo/redo supports for masks
* ``swh:1:rel:22ece559cc7cc2364edc5e5593d63ae8bd229f9f`` points to Darktable
release 2.3.0, dated 24 December 2016
* ``swh:1:snp:c7c108084bc0bf3d81436bf980b46e98bd338453`` points to a snapshot
of the entire Darktable Git repository taken on 4 May 2017 from GitHub
Contextual information
======================
The SWHIDs as described above are *intrinsic identifiers*, as they are computed
from the designated object itself, and it is often useful to provide
*contextual information* about a particular occurrence of the object, like the
origin from where the object has been found. To this end, SWHIDs can be
coupled with **qualifiers** that capture such *contextual information*.
Qualifiers come in different kinds:
* origin
* visit
* anchor
* path
* lines
Syntax
------
The full-syntax to complement SWHIDs with contextual information is given by
the ``<identifier_with_context>`` entry point of the grammar:
.. code-block:: bnf
<identifier_with_context> ::= <identifier> [ <qualifierlist> ]
<qualifierlist> := <qualifier> [ <qualifierlist> ]
<qualifier> ::= <origin_ctxt> | <visit_ctxt> | <anchor_ctxt> | <path_ctxt> |<lines_ctxt>
<origin_ctxt> ::= ";" "origin" "=" <url>
<visit_ctxt> ::= ";" "visit" "=" <identifier>
<anchor_ctxt> ::= ";" "anchor" "=" <identifier>
<path_ctxt> ::= ";" "path" "=" <path_absolute_escaped>
<lines_ctxt> ::= ";" "lines" "=" <line_number> ["-" <line_number>]
<line_number> ::= <dec_digit> +
<url> ::= (* RFC 3986 compliant URLs *)
<path_absolute_escaped> ::= (* RFC 3986 compliant absolute file path, percent-escaped *)
Here ``<path_absolute_escaped>`` is the ``<path_absolute>`` in `Section 3.3 of
RFC 3986 <https://tools.ietf.org/html/rfc3986#section-3.3>`_ where all
occurrences of ``;`` and ``%`` must be percent-encoded (as `%3B` and `%25`
respectively).
Semantics
---------
``;`` is used as separator between SWHIDs and the optional contextual
information qualifiers. Each contextual information qualifier is specified as a
key/value pair, using ``=`` as a separator.
The following piece of contextual information are supported:
* **origin** : the *software origin* where an object has been found or observed
in the wild, as an URI;
* **visit** : persistent identifier of a *snapshot* corresponding to a specific
*visit* of a repository containing the designated object;
* **anchor** : a *designated node* in the Merkle DAG relative to which a *path
to the object* is specified, as a persistent identifier of a directory, a
revision, a release or a snapshot;
* **path** : the *absolute file path*, from the *root directory* associated to
the *anchor node*, to the object; when the anchor denotes a directory or a
revision, and almost always when it's a release, the root directory is
uniquely determined; when the anchor denotes a snapshot, the root directory
is the one pointed to by ``HEAD`` (possibly indirectly), and undefined if
such a reference is missing;
* **lines** : *line number(s)* of interest, usually within a content object
We recommend to equip identifiers meant to be shared with as many qualifiers as
possible. While qualifiers may be listed in any order, it is good practice to
present them in the order given above, i.e., ``origin``, ``visit``, ``anchor``,
``path``, ``lines``. Redundant information should be omitted: for example, if
the *visit* is present, and the *path* is relative to the snapshot indicated
there, then the *anchor* qualifier is superfluous.
Example
-------
The following `fully qualified SWHID
<https://archive.softwareheritage.org/swh:1:cnt:4d99d2d18326621ccdd70f5ea66c2e2ac236ad8b;;origin=https://gitorious.org/ocamlp3l/ocamlp3l_cvs.git;visit=swh:1:snp:d7f1b9eb7ccb596c2622c4780febaa02549830f9;anchor=swh:1:rev:2db189928c94d62a3b4757b3eec68f0a4d4113f0;path=/Examples/SimpleFarm/simplefarm.ml;lines=9-15>`_
denotes the lines 9 to 15 of a file content that can be found at absolute path
``/Examples/SimpleFarm/simplefarm.ml`` from the root directory of the revision
``swh:1:rev:2db189928c94d62a3b4757b3eec68f0a4d4113f0`` that is contained in the
snapshot ``swh:1:snp:d7f1b9eb7ccb596c2622c4780febaa02549830f9`` taken from the
origin ``https://gitorious.org/ocamlp3l/ocamlp3l_cvs.git``.
.. code-block:: url
swh:1:cnt:4d99d2d18326621ccdd70f5ea66c2e2ac236ad8b;
origin=https://gitorious.org/ocamlp3l/ocamlp3l_cvs.git;
visit=swh:1:snp:d7f1b9eb7ccb596c2622c4780febaa02549830f9;
anchor=swh:1:rev:2db189928c94d62a3b4757b3eec68f0a4d4113f0;
path=/Examples/SimpleFarm/simplefarm.ml;
lines=9-15
And this is an example of `a fully qualified SWHID with a percent escaped file
path
<https://archive.softwareheritage.org/swh:1:cnt:f10371aa7b8ccabca8479196d6cd640676fd4a04;origin=https://github.com/web-platform-tests/wpt;visit=swh:1:snp:b37d435721bbd450624165f334724e3585346499;anchor=swh:1:rev:259d0612af038d14f2cd889a14a3adb6c9e96d96;path=/html/semantics/document-metadata/the-meta-element/pragma-directives/attr-meta-http-equiv-refresh/support/x%3Burl=foo/>`_
.. code-block:: url
swh:1:cnt:f10371aa7b8ccabca8479196d6cd640676fd4a04;
origin=https://github.com/web-platform-tests/wpt;
visit=swh:1:snp:b37d435721bbd450624165f334724e3585346499;
anchor=swh:1:rev:259d0612af038d14f2cd889a14a3adb6c9e96d96;
path=/html/semantics/document-metadata/the-meta-element/pragma-directives/attr-meta-http-equiv-refresh/support/x%3Burl=foo/
Resolution
==========
Dedicated resolvers
-------------------
SWHIDs can be resolved using the Software Heritage Web application (see
:py:mod:`swh.web`). In particular, the **root endpoint** ``/`` can be given a
SWHID and will lead to the browsing page of the corresponding object, like
this: ``https://archive.softwareheritage.org/<identifier>``.
A **dedicated** ``/resolve`` **endpoint** of the HTTP API is also available to
explicitly request SWHID resolution; see: :http:get:`/api/1/resolve/(swh_id)/`.
Examples:
* `<https://archive.softwareheritage.org/swh:1:cnt:94a9ed024d3859793618152ea559a168bbcbb5e2>`_
* `<https://archive.softwareheritage.org/swh:1:dir:d198bc9d7a6bcf6db04f476d29314f157507d505>`_
* `<https://archive.softwareheritage.org/api/1/resolve/swh:1:rev:309cf2674ee7a0749978cf8265ab91a60aea0f7d>`_
* `<https://archive.softwareheritage.org/api/1/resolve/swh:1:rel:22ece559cc7cc2364edc5e5593d63ae8bd229f9f>`_
* `<https://archive.softwareheritage.org/api/1/resolve/swh:1:snp:c7c108084bc0bf3d81436bf980b46e98bd338453>`_
External resolvers
------------------
The following **independent resolvers** support resolution of SWHIDs:
* `Identifiers.org <https://identifiers.org>`_; see:
`<http://identifiers.org/swh/>`_ (registry identifier `MIR:00000655
<https://www.ebi.ac.uk/miriam/main/datatypes/MIR:00000655>`_).
* `Name-to-Thing (N2T) <https://n2t.net/>`_
Examples:
* `<https://identifiers.org/swh:1:cnt:94a9ed024d3859793618152ea559a168bbcbb5e2>`_
* `<https://identifiers.org/swh:1:dir:d198bc9d7a6bcf6db04f476d29314f157507d505>`_
* `<https://identifiers.org/swh:1:rev:309cf2674ee7a0749978cf8265ab91a60aea0f7d>`_
* `<https://n2t.net/swh:1:rel:22ece559cc7cc2364edc5e5593d63ae8bd229f9f>`_
* `<https://n2t.net/swh:1:snp:c7c108084bc0bf3d81436bf980b46e98bd338453>`_
Note that resolution via Identifiers.org does not support contextual
information, due to `syntactic incompatibilities
<http://identifiers.org/documentation#custom_requests>`_.
References
==========
* Roberto Di Cosmo, Morane Gruenpeter, Stefano Zacchiroli. `Identifiers for
Digital Objects: the Case of Software Source Code Preservation
<https://hal.archives-ouvertes.fr/hal-01865790v4>`_. In Proceedings of `iPRES
2018 <https://ipres2018.org/>`_: 15th International Conference on Digital
Preservation, Boston, MA, USA, September 2018, 9 pages.
* Roberto Di Cosmo, Morane Gruenpeter, Stefano Zacchiroli. `Referencing Source
Code Artifacts: a Separate Concern in Software Citation
<https://arxiv.org/abs/2001.08647>`_. In Computing in Science and
Engineering, volume 22, issue 2, pages 33-43. ISSN 1521-9615,
IEEE. March 2020.
[mypy]
namespace_packages = True
warn_unused_ignores = True
# 3rd party libraries without stubs (yet)
[mypy-attrs_strict.*] # a bit sad, but...
ignore_missing_imports = True
[mypy-django.*] # false positive, only used my hypotesis' extras
ignore_missing_imports = True
[mypy-dulwich.*]
ignore_missing_imports = True
[mypy-iso8601.*]
ignore_missing_imports = True
[mypy-pkg_resources.*]
ignore_missing_imports = True
[mypy-pyblake2.*]
ignore_missing_imports = True
[mypy-pytest.*]
ignore_missing_imports = True
[pytest]
addopts = --doctest-modules
norecursedirs = docs
[flake8]
# E203: whitespaces before ':' <https://github.com/psf/black/issues/315>
# E231: missing whitespace after ','
# W503: line break before binary operator <https://github.com/psf/black/issues/52>
ignore = E203,E231,W503
max-line-length = 88
[egg_info]
tag_build =
tag_date = 0
Metadata-Version: 2.1
Name: swh.model
Version: 0.0.67
Summary: Software Heritage data model
Home-page: https://forge.softwareheritage.org/diffusion/DMOD/
Author: Software Heritage developers
Author-email: swh-devel@inria.fr
License: UNKNOWN
Project-URL: Bug Reports, https://forge.softwareheritage.org/maniphest
Project-URL: Funding, https://www.softwareheritage.org/donate
Project-URL: Source, https://forge.softwareheritage.org/source/swh-model
Description: swh-model
=========
Implementation of the Data model of the Software Heritage project, used to
archive source code artifacts.
This module defines the notion of Persistent Identifier (PID) and provides
tools to compute them:
```sh
$ swh-identify fork.c kmod.c sched/deadline.c
swh:1:cnt:2e391c754ae730bd2d8520c2ab497c403220c6e3 fork.c
swh:1:cnt:0277d1216f80ae1adeed84a686ed34c9b2931fc2 kmod.c
swh:1:cnt:57b939c81bce5d06fa587df8915f05affbe22b82 sched/deadline.c
$ swh-identify --no-filename /usr/src/linux/kernel/
swh:1:dir:f9f858a48d663b3809c9e2f336412717496202ab
```
Platform: UNKNOWN
Classifier: Programming Language :: Python :: 3
Classifier: Intended Audience :: Developers
Classifier: License :: OSI Approved :: GNU General Public License v3 (GPLv3)
Classifier: Operating System :: OS Independent
Classifier: Development Status :: 5 - Production/Stable
Description-Content-Type: text/markdown
Provides-Extra: cli
Provides-Extra: testing
MANIFEST.in
Makefile
README.md
pyproject.toml
requirements-cli.txt
requirements-test.txt
requirements.txt
setup.cfg
setup.py
version.txt
swh/__init__.py
swh.model.egg-info/PKG-INFO
swh.model.egg-info/SOURCES.txt
swh.model.egg-info/dependency_links.txt
swh.model.egg-info/entry_points.txt
swh.model.egg-info/requires.txt
swh.model.egg-info/top_level.txt
swh/model/__init__.py
swh/model/cli.py
swh/model/exceptions.py
swh/model/from_disk.py
swh/model/hashutil.py
swh/model/hypothesis_strategies.py
swh/model/identifiers.py
swh/model/merkle.py
swh/model/model.py
swh/model/py.typed
swh/model/toposort.py
swh/model/validators.py
swh/model/fields/__init__.py
swh/model/fields/compound.py
swh/model/fields/hashes.py
swh/model/fields/simple.py
swh/model/tests/__init__.py
swh/model/tests/generate_testdata.py
swh/model/tests/generate_testdata_from_disk.py
swh/model/tests/test_cli.py
swh/model/tests/test_from_disk.py
swh/model/tests/test_generate_testdata.py
swh/model/tests/test_hashutil.py
swh/model/tests/test_hypothesis_strategies.py
swh/model/tests/test_identifiers.py
swh/model/tests/test_merkle.py
swh/model/tests/test_model.py
swh/model/tests/test_toposort.py
swh/model/tests/test_validators.py
swh/model/tests/data/dir-folders/sample-folder.tgz
swh/model/tests/data/repos/sample-repo.tgz
swh/model/tests/fields/__init__.py
swh/model/tests/fields/test_compound.py
swh/model/tests/fields/test_hashes.py
swh/model/tests/fields/test_simple.py
\ No newline at end of file
[console_scripts]
swh-identify=swh.model.cli:identify
[swh.cli.subcommands]
identify=swh.model.cli:identify
\ No newline at end of file
vcversioner
attrs
attrs_strict
hypothesis
python-dateutil
iso8601
[:python_version < "3.6"]
pyblake2
[cli]
swh.core
Click
dulwich
[testing]
Click
dulwich
pytest
pytz
swh
[tox]
envlist=black,flake8,mypy,py3
[testenv]
extras =
testing
deps =
pytest-cov
commands =
pytest --cov={envsitepackagesdir}/swh/model \
{envsitepackagesdir}/swh/model \
--cov-branch {posargs}
[testenv:black]
skip_install = true
deps =
black
commands =
{envpython} -m black --check swh
[testenv:flake8]
skip_install = true
deps =
flake8
commands =
{envpython} -m flake8
[testenv:mypy]
extras =
testing
deps =
mypy
commands =
mypy swh
v0.0.67-0-gd52549f
\ No newline at end of file