Skip to content
Snippets Groups Projects
Commit b61c6665 authored by Stefano Zacchiroli's avatar Stefano Zacchiroli
Browse files

docs: document the naming scheme for persistent identifiers

Closes: T335
parent a01d81c3
No related branches found
No related tags found
No related merge requests found
......@@ -12,6 +12,7 @@ Overview
--------
* :ref:`data-model`
* :ref:`persistent-identifiers`
Indices and tables
......
.. _persistent-identifiers:
Persistent identifiers
======================
You can point to objects present in the Software Heritage archive by the means
of **persistent identifiers** that are guaranteed to remain stable (persistent)
over time. Their syntax, meaning, and usage is described below. Note that they
are identifiers and not URLs, even though an URL-based resolver for Software
Heritage persistent identifiers is also provided.
A persistent identifier can point to any software artifact (or "object")
available in the Software Heritage archive. Objects come in different types,
and most notably:
* contents
* directories
* revisions
* releases
* snapshots
Each object is identified by an intrinsic, type-specific object identifier that
is embedded in its persistent identifier as described below. Object identifiers
are strong cryptographic hashes computed on the entire set of object properties
to form a `Merkle structure <https://en.wikipedia.org/wiki/Merkle_tree>`_.
See :ref:`data-model` for an overview of object types and how they are linked
together. See :py:mod:`swh.model.identifiers` for details on how intrinsic
object identifiers are computed.
Syntax
------
Syntactically, persistent identifiers are generated by the ``<identifier>``
entry point of the grammar:
.. code-block:: bnf
<identifier> ::= "swh" ":" <scheme_version> ":" <object_type> ":" <object_id> ;
<scheme_version> ::= "1" ;
<object_type> ::=
"snp" (* snapshot *)
| "rel" (* release *)
| "rev" (* revision *)
| "dir" (* directory *)
| "cnt" (* content *)
;
<object_id> ::= 40 * <hex_digit> ; (* intrinsic object id, as hex-encoded SHA1 *)
<hex_digit> ::= "0" | "1" | "2" | "3" | "4" | "5" | "6" | "7" | "8" | "9"
| "a" | "b" | "c" | "d" | "e" | "f" ;
Semantics
---------
``:`` is used as separator between the logical parts of identifiers. The
``swh`` prefix makes explicit that these identifiers are related to *SoftWare
Heritage*. ``1`` (``<scheme_version>``) is the current version of this
identifier *scheme*; future editions will use higher version numbers, possibly
breaking backward compatibility (but without breaking the resolvability of
identifiers that conform to previous versions of the scheme).
A persistent identifier points to a single object, whose type is explicitly
captured by ``<object_type>``:
* ``snp`` identifiers points to **snapshots**,
* ``rel`` to **releases**,
* ``rev`` to **revisions**,
* ``dir`` to **directories**,
* ``cnt`` to **releases**.
The actual object pointed to is identified by the intrinsic identifier
``<object_id>``, which is a hex-encoded (using lowercase ASCII characters) SHA1
computed on the content and metadata of the object itself, as follows:
* for **snapshots**, intrinsic identifiers are computed as per
:py:func:`swh.model.identifiers.snapshot_identifier`
* for **releases**, as per
:py:func:`swh.model.identifiers.release_identifier`
* for **revisions**, as per
:py:func:`swh.model.identifiers.revision_identifier`
* for **directories**, as per
:py:func:`swh.model.identifiers.directory_identifier`
* for **contents**, the intrinsic identifier is the ``sha1_git`` hash of the
multiple hashes returned by
:py:func:`swh.model.identifiers.content_identifier`, i.e., the SHA1 of a byte
sequence obtained by juxtaposing the ASCII string ``"blob"`` (without
quotes), a space, the length of the content as decimal digits, a NULL byte,
and the actual content of the file.
Git compatibility
~~~~~~~~~~~~~~~~~
Intrinsic object identifiers for contents, directories, revisions, and releases
are, at present, compatible with the `Git <https://git-scm.com/>`_ way of
`computing identifiers
<https://git-scm.com/book/en/v2/Git-Internals-Git-Objects>`_ for its objects.
A Software Heritage content identifier will be identical to a Git blob
identifier of any file with the same content, a Software Heritage revision
identifier will be identical to the corresponding Git commit identifier, etc.
This is not the case for snapshot identifiers as Git doesn't have a
corresponding object type.
Note that Git compatibility is incidental and is not guaranteed to be
maintained in future versions of this scheme (or Git).
Examples
--------
* ``swh:1:cnt:94a9ed024d3859793618152ea559a168bbcbb5e2`` points to the content
of a file containing the full text of the GPL3 license
* ``swh:1:dir:d198bc9d7a6bcf6db04f476d29314f157507d505`` points to a directory
containing the source code of the Darktable photography application as it was
at some point on 4 May 2017
* ``swh:1:rev:309cf2674ee7a0749978cf8265ab91a60aea0f7d`` points to a commit in
the development history of Darktable, dated 16 January 2017, that added
undo/redo supports for masks
* ``swh:1:rel:22ece559cc7cc2364edc5e5593d63ae8bd229f9f`` points to Darktable
release 2.3.0, dated 24 December 2016
* ``swh:1:snp:c7c108084bc0bf3d81436bf980b46e98bd338453`` points to a snapshot
of the entire Darktable Git repository taken on 4 May 2017 from GitHub
Resolution
----------
Persistent identifiers can be resolved using the Software Heritage Web
application (see :py:mod:`swh.web`).
In particular, the ``/browse/`` endpoint can be given a persistent identifier
and will lead to the browsing page of the corresponding object, like this:
``https://archive.softwareheritage.org/browse/<identifier>``. For example:
* `<https://archive.softwareheritage.org/browse/swh:1:cnt:94a9ed024d3859793618152ea559a168bbcbb5e2>`_
* `<https://archive.softwareheritage.org/browse/swh:1:dir:d198bc9d7a6bcf6db04f476d29314f157507d505>`_
* `<https://archive.softwareheritage.org/browse/swh:1:rev:309cf2674ee7a0749978cf8265ab91a60aea0f7d>`_
* `<https://archive.softwareheritage.org/browse/swh:1:rel:22ece559cc7cc2364edc5e5593d63ae8bd229f9f>`_
* `<https://archive.softwareheritage.org/browse/swh:1:snp:c7c108084bc0bf3d81436bf980b46e98bd338453>`_
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment