diff --git a/docs/index.rst b/docs/index.rst index db680710600a20d9a7d119477bf8798f968fa46b..74756e7522401bb4e7fa63d9231fc401524f1c6f 100644 --- a/docs/index.rst +++ b/docs/index.rst @@ -12,6 +12,7 @@ Overview -------- * :ref:`data-model` +* :ref:`persistent-identifiers` Indices and tables diff --git a/docs/persistent-identifiers.rst b/docs/persistent-identifiers.rst new file mode 100644 index 0000000000000000000000000000000000000000..c796a808ff8670f9a13cc13f32b5018ccdfbc35d --- /dev/null +++ b/docs/persistent-identifiers.rst @@ -0,0 +1,145 @@ +.. _persistent-identifiers: + +Persistent identifiers +====================== + +You can point to objects present in the Software Heritage archive by the means +of **persistent identifiers** that are guaranteed to remain stable (persistent) +over time. Their syntax, meaning, and usage is described below. Note that they +are identifiers and not URLs, even though an URL-based resolver for Software +Heritage persistent identifiers is also provided. + +A persistent identifier can point to any software artifact (or "object") +available in the Software Heritage archive. Objects come in different types, +and most notably: + +* contents +* directories +* revisions +* releases +* snapshots + +Each object is identified by an intrinsic, type-specific object identifier that +is embedded in its persistent identifier as described below. Object identifiers +are strong cryptographic hashes computed on the entire set of object properties +to form a `Merkle structure <https://en.wikipedia.org/wiki/Merkle_tree>`_. + +See :ref:`data-model` for an overview of object types and how they are linked +together. See :py:mod:`swh.model.identifiers` for details on how intrinsic +object identifiers are computed. + + +Syntax +------ + +Syntactically, persistent identifiers are generated by the ``<identifier>`` +entry point of the grammar: + +.. code-block:: bnf + + <identifier> ::= "swh" ":" <scheme_version> ":" <object_type> ":" <object_id> ; + <scheme_version> ::= "1" ; + <object_type> ::= + "snp" (* snapshot *) + | "rel" (* release *) + | "rev" (* revision *) + | "dir" (* directory *) + | "cnt" (* content *) + ; + <object_id> ::= 40 * <hex_digit> ; (* intrinsic object id, as hex-encoded SHA1 *) + <hex_digit> ::= "0" | "1" | "2" | "3" | "4" | "5" | "6" | "7" | "8" | "9" + | "a" | "b" | "c" | "d" | "e" | "f" ; + + +Semantics +--------- + +``:`` is used as separator between the logical parts of identifiers. The +``swh`` prefix makes explicit that these identifiers are related to *SoftWare +Heritage*. ``1`` (``<scheme_version>``) is the current version of this +identifier *scheme*; future editions will use higher version numbers, possibly +breaking backward compatibility (but without breaking the resolvability of +identifiers that conform to previous versions of the scheme). + +A persistent identifier points to a single object, whose type is explicitly +captured by ``<object_type>``: + +* ``snp`` identifiers points to **snapshots**, +* ``rel`` to **releases**, +* ``rev`` to **revisions**, +* ``dir`` to **directories**, +* ``cnt`` to **releases**. + +The actual object pointed to is identified by the intrinsic identifier +``<object_id>``, which is a hex-encoded (using lowercase ASCII characters) SHA1 +computed on the content and metadata of the object itself, as follows: + +* for **snapshots**, intrinsic identifiers are computed as per + :py:func:`swh.model.identifiers.snapshot_identifier` + +* for **releases**, as per + :py:func:`swh.model.identifiers.release_identifier` + +* for **revisions**, as per + :py:func:`swh.model.identifiers.revision_identifier` + +* for **directories**, as per + :py:func:`swh.model.identifiers.directory_identifier` + +* for **contents**, the intrinsic identifier is the ``sha1_git`` hash of the + multiple hashes returned by + :py:func:`swh.model.identifiers.content_identifier`, i.e., the SHA1 of a byte + sequence obtained by juxtaposing the ASCII string ``"blob"`` (without + quotes), a space, the length of the content as decimal digits, a NULL byte, + and the actual content of the file. + + +Git compatibility +~~~~~~~~~~~~~~~~~ + +Intrinsic object identifiers for contents, directories, revisions, and releases +are, at present, compatible with the `Git <https://git-scm.com/>`_ way of +`computing identifiers +<https://git-scm.com/book/en/v2/Git-Internals-Git-Objects>`_ for its objects. +A Software Heritage content identifier will be identical to a Git blob +identifier of any file with the same content, a Software Heritage revision +identifier will be identical to the corresponding Git commit identifier, etc. +This is not the case for snapshot identifiers as Git doesn't have a +corresponding object type. + +Note that Git compatibility is incidental and is not guaranteed to be +maintained in future versions of this scheme (or Git). + + +Examples +-------- + +* ``swh:1:cnt:94a9ed024d3859793618152ea559a168bbcbb5e2`` points to the content + of a file containing the full text of the GPL3 license +* ``swh:1:dir:d198bc9d7a6bcf6db04f476d29314f157507d505`` points to a directory + containing the source code of the Darktable photography application as it was + at some point on 4 May 2017 +* ``swh:1:rev:309cf2674ee7a0749978cf8265ab91a60aea0f7d`` points to a commit in + the development history of Darktable, dated 16 January 2017, that added + undo/redo supports for masks +* ``swh:1:rel:22ece559cc7cc2364edc5e5593d63ae8bd229f9f`` points to Darktable + release 2.3.0, dated 24 December 2016 +* ``swh:1:snp:c7c108084bc0bf3d81436bf980b46e98bd338453`` points to a snapshot + of the entire Darktable Git repository taken on 4 May 2017 from GitHub + + +Resolution +---------- + +Persistent identifiers can be resolved using the Software Heritage Web +application (see :py:mod:`swh.web`). + +In particular, the ``/browse/`` endpoint can be given a persistent identifier +and will lead to the browsing page of the corresponding object, like this: +``https://archive.softwareheritage.org/browse/<identifier>``. For example: + +* `<https://archive.softwareheritage.org/browse/swh:1:cnt:94a9ed024d3859793618152ea559a168bbcbb5e2>`_ +* `<https://archive.softwareheritage.org/browse/swh:1:dir:d198bc9d7a6bcf6db04f476d29314f157507d505>`_ +* `<https://archive.softwareheritage.org/browse/swh:1:rev:309cf2674ee7a0749978cf8265ab91a60aea0f7d>`_ +* `<https://archive.softwareheritage.org/browse/swh:1:rel:22ece559cc7cc2364edc5e5593d63ae8bd229f9f>`_ +* `<https://archive.softwareheritage.org/browse/swh:1:snp:c7c108084bc0bf3d81436bf980b46e98bd338453>`_