Skip to content
Snippets Groups Projects

Compare revisions

Changes are shown as if the source revision was being merged into the target revision. Learn more about comparing revisions.

Source

Select target project
No results found

Target

Select target project
  • anlambert/swh-model
  • lunar/swh-model
  • franckbret/swh-model
  • douardda/swh-model
  • olasd/swh-model
  • swh/devel/swh-model
  • Alphare/swh-model
  • samplet/swh-model
  • marmoute/swh-model
  • rboyer/swh-model
10 results
Show changes
.. _swh-model:
Software Heritage - Data model
==============================
Implementation of the :ref:`data-model` to archive source code artifacts.
.. toctree::
:caption: Overview:
:titlesonly:
data-model
persistent-identifiers
/apidoc/swh.model
.. _persistent-identifiers:
======================
Persistent identifiers
======================
Description
===========
You can point to objects present in the Software Heritage archive by the means
of **persistent identifiers** that are guaranteed to remain stable (persistent)
over time. Their syntax, meaning, and usage is described below. Note that they
are identifiers and not URLs, even though an URL-based resolver for Software
Heritage persistent identifiers is also provided.
A persistent identifier can point to any software artifact (or "object")
available in the Software Heritage archive. Objects come in different types,
and most notably:
* contents
* directories
* revisions
* releases
* snapshots
Each object is identified by an intrinsic, type-specific object identifier that
is embedded in its persistent identifier as described below. Object identifiers
are strong cryptographic hashes computed on the entire set of object properties
to form a `Merkle structure <https://en.wikipedia.org/wiki/Merkle_tree>`_.
See :ref:`data-model` for an overview of object types and how they are linked
together. See :py:mod:`swh.model.identifiers` for details on how intrinsic
object identifiers are computed.
Syntax
------
Syntactically, persistent identifiers are generated by the ``<identifier>``
entry point of the grammar:
.. code-block:: bnf
<identifier> ::= "swh" ":" <scheme_version> ":" <object_type> ":" <object_id> ;
<scheme_version> ::= "1" ;
<object_type> ::=
"ori" (* origin *)
| "snp" (* snapshot *)
| "rel" (* release *)
| "rev" (* revision *)
| "dir" (* directory *)
| "cnt" (* content *)
;
<object_id> ::= 40 * <hex_digit> ; (* intrinsic object id, as hex-encoded SHA1 *)
<dec_digit> ::= "0" | "1" | "2" | "3" | "4" | "5" | "6" | "7" | "8" | "9"
<hex_digit> ::= <dec_digit> | "a" | "b" | "c" | "d" | "e" | "f" ;
Semantics
---------
``:`` is used as separator between the logical parts of identifiers. The
``swh`` prefix makes explicit that these identifiers are related to *SoftWare
Heritage*. ``1`` (``<scheme_version>``) is the current version of this
identifier *scheme*; future editions will use higher version numbers, possibly
breaking backward compatibility (but without breaking the resolvability of
identifiers that conform to previous versions of the scheme).
A persistent identifier points to a single object, whose type is explicitly
captured by ``<object_type>``:
* ``ori`` identifiers point to **origins**
* ``snp`` to **snapshots**,
* ``rel`` to **releases**,
* ``rev`` to **revisions**,
* ``dir`` to **directories**,
* ``cnt`` to **contents**.
The actual object pointed to is identified by the intrinsic identifier
``<object_id>``, which is a hex-encoded (using lowercase ASCII characters) SHA1
computed on the content and metadata of the object itself, as follows:
* for **origins**, intrinsic identifiers are computed as per
:py:func:`swh.model.identifiers.origin_identifier`
* for **snapshots**, intrinsic identifiers are computed as per
:py:func:`swh.model.identifiers.snapshot_identifier`
* for **releases**, as per
:py:func:`swh.model.identifiers.release_identifier`
* for **revisions**, as per
:py:func:`swh.model.identifiers.revision_identifier`
* for **directories**, as per
:py:func:`swh.model.identifiers.directory_identifier`
* for **contents**, the intrinsic identifier is the ``sha1_git`` hash of the
multiple hashes returned by
:py:func:`swh.model.identifiers.content_identifier`, i.e., the SHA1 of a byte
sequence obtained by juxtaposing the ASCII string ``"blob"`` (without
quotes), a space, the length of the content as decimal digits, a NULL byte,
and the actual content of the file.
Git compatibility
~~~~~~~~~~~~~~~~~
Intrinsic object identifiers for contents, directories, revisions, and releases
are, at present, compatible with the `Git <https://git-scm.com/>`_ way of
`computing identifiers
<https://git-scm.com/book/en/v2/Git-Internals-Git-Objects>`_ for its objects.
A Software Heritage content identifier will be identical to a Git blob
identifier of any file with the same content, a Software Heritage revision
identifier will be identical to the corresponding Git commit identifier, etc.
This is not the case for snapshot identifiers as Git doesn't have a
corresponding object type.
Note that Git compatibility is incidental and is not guaranteed to be
maintained in future versions of this scheme (or Git).
Examples
--------
* ``swh:1:cnt:94a9ed024d3859793618152ea559a168bbcbb5e2`` points to the content
of a file containing the full text of the GPL3 license
* ``swh:1:dir:d198bc9d7a6bcf6db04f476d29314f157507d505`` points to a directory
containing the source code of the Darktable photography application as it was
at some point on 4 May 2017
* ``swh:1:rev:309cf2674ee7a0749978cf8265ab91a60aea0f7d`` points to a commit in
the development history of Darktable, dated 16 January 2017, that added
undo/redo supports for masks
* ``swh:1:rel:22ece559cc7cc2364edc5e5593d63ae8bd229f9f`` points to Darktable
release 2.3.0, dated 24 December 2016
* ``swh:1:snp:c7c108084bc0bf3d81436bf980b46e98bd338453`` points to a snapshot
of the entire Darktable Git repository taken on 4 May 2017 from GitHub
* ``swh:1:ori:b63a575fe3faab7692c9f38fb09d4bb45651bb0f`` points to the
repository https://github.com/torvalds/linux .
Contextual information
======================
It is often useful to complement persistent identifiers with **contextual
information** about where the identified object has been found as well as which
specific parts of it are of interest. To that end it is possible, via a
dedicated syntax, to extend persistent identifiers with the following pieces of
information:
* the **software origin** where an object has been found/observed
* the **line number(s)** of interest, usually within a content object
Syntax
------
The full-syntax to complement identifiers with contextual information is given
by the ``<identifier_with_context>`` entry point of the grammar:
.. code-block:: bnf
<identifier_with_context> ::= <identifier> [<lines_ctxt>] [<origin_ctxt>]
<lines_ctxt> ::= ";" "lines" "=" <line_number> ["-" <line_number>]
<origin_ctxt> ::= ";" "origin" "=" <url>
<line_number> ::= <dec_digit> +
<url> ::= (* RFC 3986 compliant URLs *)
Semantics
---------
``;`` is used as separator between persistent identifiers and additional
optional contextual information. Each piece of contextual information is
specified as a key/value pair, using ``=`` as a separator.
The following piece of contextual information are supported:
* line numbers: it is possible to specify a single line number or a line range,
separating two numbers with ``-``. Note that line numbers are purely
indicative and are not meant to be stable, as in some degenerate cases
(e.g., text files which mix different types of line terminators) it is
impossible to resolve them unambiguously.
* software origin: where a given object has been found or observed in the wild,
as the URI that was used by Software Heritage to ingest the object into the
archive
Resolution
==========
Dedicated resolvers
-------------------
Persistent identifiers can be resolved using the Software Heritage Web
application (see :py:mod:`swh.web`). In particular, the **root endpoint**
``/`` can be given a persistent identifier and will lead to the browsing page
of the corresponding object, like this:
``https://archive.softwareheritage.org/<identifier>``.
A **dedicated** ``/resolve`` **endpoint** of the HTTP API is also available to
explicitly request persistent identifier resolution; see:
:http:get:`/api/1/resolve/(swh_id)/`.
Examples:
* `<https://archive.softwareheritage.org/swh:1:cnt:94a9ed024d3859793618152ea559a168bbcbb5e2>`_
* `<https://archive.softwareheritage.org/swh:1:dir:d198bc9d7a6bcf6db04f476d29314f157507d505>`_
* `<https://archive.softwareheritage.org/api/1/resolve/swh:1:rev:309cf2674ee7a0749978cf8265ab91a60aea0f7d>`_
* `<https://archive.softwareheritage.org/api/1/resolve/swh:1:rel:22ece559cc7cc2364edc5e5593d63ae8bd229f9f>`_
* `<https://archive.softwareheritage.org/api/1/resolve/swh:1:snp:c7c108084bc0bf3d81436bf980b46e98bd338453>`_
External resolvers
------------------
The following **independent resolvers** support resolution of Software
Heritage persistent identifiers:
* `Identifiers.org <https://identifiers.org>`_; see:
`<http://identifiers.org/swh/>`_ (registry identifier `MIR:00000655
<https://www.ebi.ac.uk/miriam/main/datatypes/MIR:00000655>`_).
* `Name-to-Thing (N2T) <https://n2t.net/>`_
Examples:
* `<https://identifiers.org/swh:1:cnt:94a9ed024d3859793618152ea559a168bbcbb5e2>`_
* `<https://identifiers.org/swh:1:dir:d198bc9d7a6bcf6db04f476d29314f157507d505>`_
* `<https://identifiers.org/swh:1:rev:309cf2674ee7a0749978cf8265ab91a60aea0f7d>`_
* `<https://n2t.net/swh:1:rel:22ece559cc7cc2364edc5e5593d63ae8bd229f9f>`_
* `<https://n2t.net/swh:1:snp:c7c108084bc0bf3d81436bf980b46e98bd338453>`_
Note that resolution via Identifiers.org does not support contextual
information, due to `syntactic incompatibilities
<http://identifiers.org/documentation#custom_requests>`_.
References
==========
* Roberto Di Cosmo, Morane Gruenpeter, Stefano Zacchiroli. `Identifiers for
Digital Objects: the Case of Software Source Code Preservation
<https://hal.archives-ouvertes.fr/hal-01865790v4>`_. In Proceedings of `iPRES
2018 <https://ipres2018.org/>`_: 15th International Conference on Digital
Preservation, Boston, MA, USA, September 2018, 9 pages.
[pytest]
addopts = --doctest-modules
norecursedirs = docs
pytest
[egg_info]
tag_build =
tag_date = 0
Metadata-Version: 2.1
Name: swh.model
Version: 0.0.44
Summary: Software Heritage data model
Home-page: https://forge.softwareheritage.org/diffusion/DMOD/
Author: Software Heritage developers
Author-email: swh-devel@inria.fr
License: UNKNOWN
Project-URL: Funding, https://www.softwareheritage.org/donate
Project-URL: Source, https://forge.softwareheritage.org/source/swh-model
Project-URL: Bug Reports, https://forge.softwareheritage.org/maniphest
Description: swh-model
=========
Implementation of the Data model of the Software Heritage project, used to
archive source code artifacts.
This module defines the notion of Persistent Identifier (PID) and provides
tools to compute them:
```sh
$ swh-identify fork.c kmod.c sched/deadline.c
swh:1:cnt:2e391c754ae730bd2d8520c2ab497c403220c6e3 fork.c
swh:1:cnt:0277d1216f80ae1adeed84a686ed34c9b2931fc2 kmod.c
swh:1:cnt:57b939c81bce5d06fa587df8915f05affbe22b82 sched/deadline.c
$ swh-identify --no-filename /usr/src/linux/kernel/
swh:1:dir:f9f858a48d663b3809c9e2f336412717496202ab
```
Platform: UNKNOWN
Classifier: Programming Language :: Python :: 3
Classifier: Intended Audience :: Developers
Classifier: License :: OSI Approved :: GNU General Public License v3 (GPLv3)
Classifier: Operating System :: OS Independent
Classifier: Development Status :: 5 - Production/Stable
Description-Content-Type: text/markdown
Provides-Extra: testing
MANIFEST.in
Makefile
README.md
requirements.txt
setup.py
version.txt
swh/__init__.py
swh.model.egg-info/PKG-INFO
swh.model.egg-info/SOURCES.txt
swh.model.egg-info/dependency_links.txt
swh.model.egg-info/entry_points.txt
swh.model.egg-info/requires.txt
swh.model.egg-info/top_level.txt
swh/model/__init__.py
swh/model/cli.py
swh/model/exceptions.py
swh/model/from_disk.py
swh/model/hashutil.py
swh/model/hypothesis_strategies.py
swh/model/identifiers.py
swh/model/merkle.py
swh/model/model.py
swh/model/toposort.py
swh/model/validators.py
swh/model/fields/__init__.py
swh/model/fields/compound.py
swh/model/fields/hashes.py
swh/model/fields/simple.py
swh/model/tests/__init__.py
swh/model/tests/generate_testdata_from_disk.py
swh/model/tests/test_cli.py
swh/model/tests/test_from_disk.py
swh/model/tests/test_hashutil.py
swh/model/tests/test_hypothesis_strategies.py
swh/model/tests/test_identifiers.py
swh/model/tests/test_merkle.py
swh/model/tests/test_model.py
swh/model/tests/test_toposort.py
swh/model/tests/test_validators.py
swh/model/tests/data/dir-folders/sample-folder.tgz
swh/model/tests/fields/__init__.py
swh/model/tests/fields/test_compound.py
swh/model/tests/fields/test_hashes.py
swh/model/tests/fields/test_simple.py
\ No newline at end of file
[console_scripts]
swh-identify=swh.model.cli:identify
[swh.cli.subcommands]
identify=swh.model.cli:identify
\ No newline at end of file
vcversioner
Click
attrs
hypothesis
python-dateutil
[:python_version < "3.6"]
pyblake2
[testing]
pytest
swh
[tox]
envlist=flake8,py3
[testenv:py3]
deps =
.[testing]
pytest-cov
commands =
pytest --cov=swh --cov-branch {posargs}
[testenv:flake8]
skip_install = true
deps =
flake8
commands =
{envpython} -m flake8
v0.0.44-0-ge77c94d
\ No newline at end of file