diff --git a/docs/persistent-identifiers.rst b/docs/persistent-identifiers.rst index a25e4217cb1beda771fb6830538237fe1fc9cd93..f78aee37f261bb7e36a4ae661544461ff3f5e595 100644 --- a/docs/persistent-identifiers.rst +++ b/docs/persistent-identifiers.rst @@ -1,21 +1,24 @@ .. _persistent-identifiers: -====================== -Persistent identifiers -====================== +================================================ +SoftWare Heritage persistent IDentifiers (SWHID) +================================================ + +**version 1.2** + Description =========== You can point to objects present in the Software Heritage archive by the means -of **persistent identifiers** that are guaranteed to remain stable (persistent) -over time. Their syntax, meaning, and usage is described below. Note that they -are identifiers and not URLs, even though an URL-based resolver for Software -Heritage persistent identifiers is also provided. +of **SoftWare Heritage persistent IDentifiers**, or **SWHID** for short, that +are guaranteed to remain stable (persistent) over time. Their syntax, meaning, +and usage is described below. Note that they are identifiers and not URLs, even +though an URL-based resolver for Software Heritage persistent identifiers is +also provided. -A persistent identifier can point to any software artifact (or "object") -available in the Software Heritage archive. Objects come in different types, -and most notably: +A SWHID can point to any software artifact (or "object") available in the +Software Heritage archive. Objects come in different types, and most notably: * contents * directories @@ -24,20 +27,20 @@ and most notably: * snapshots Each object is identified by an intrinsic, type-specific object identifier that -is embedded in its persistent identifier as described below. Object identifiers -are strong cryptographic hashes computed on the entire set of object properties -to form a `Merkle structure <https://en.wikipedia.org/wiki/Merkle_tree>`_. +is embedded in its SWHID as described below. SWHIDs are strong cryptographic +hashes computed on the entire set of object properties to form a `Merkle +structure <https://en.wikipedia.org/wiki/Merkle_tree>`_. -See :ref:`data-model` for an overview of object types and how they are linked -together. See :py:mod:`swh.model.identifiers` for details on how intrinsic -object identifiers are computed. +See the :ref:`Software Heritage data model <data-model>` for an overview of +object types and how they are linked together. See +:py:mod:`swh.model.identifiers` for details on how SWHIDs are computed. Syntax ------ -Syntactically, persistent identifiers are generated by the ``<identifier>`` -entry point of the grammar: +Syntactically, SWHIDs are generated by the ``<identifier>`` entry point of the +grammar: .. code-block:: bnf @@ -58,15 +61,15 @@ entry point of the grammar: Semantics --------- -``:`` is used as separator between the logical parts of identifiers. The -``swh`` prefix makes explicit that these identifiers are related to *SoftWare +``:`` is used as separator between the logical parts of SWHIDs. The ``swh`` +prefix makes explicit that these identifiers are related to *SoftWare Heritage*. ``1`` (``<scheme_version>``) is the current version of this identifier *scheme*; future editions will use higher version numbers, possibly breaking backward compatibility (but without breaking the resolvability of -identifiers that conform to previous versions of the scheme). +SWHIDs conform to previous versions of the scheme). -A persistent identifier points to a single object, whose type is explicitly -captured by ``<object_type>``: +A SWHID points to a single object, whose type is explicitly captured by +``<object_type>``: * ``snp`` to **snapshots**, * ``rel`` to **releases**, @@ -101,15 +104,14 @@ computed on the content and metadata of the object itself, as follows: Git compatibility ~~~~~~~~~~~~~~~~~ -Intrinsic object identifiers for contents, directories, revisions, and releases -are, at present, compatible with the `Git <https://git-scm.com/>`_ way of -`computing identifiers +SWHIDs for contents, directories, revisions, and releases are, at present, +compatible with the `Git <https://git-scm.com/>`_ way of `computing identifiers <https://git-scm.com/book/en/v2/Git-Internals-Git-Objects>`_ for its objects. -A Software Heritage content identifier will be identical to a Git blob -identifier of any file with the same content, a Software Heritage revision -identifier will be identical to the corresponding Git commit identifier, etc. -This is not the case for snapshot identifiers as Git doesn't have a -corresponding object type. +A SWHID for a content object will correspond (in its ``<object_id>`` part) to a +Git blob identifier of any file with the same content; a SWHID for a revision +will correspond to the Git commit identifier for the same revision, etc. This +is not the case for snapshot identifiers, as Git does not have a corresponding +object type. Note that Git compatibility is incidental and is not guaranteed to be maintained in future versions of this scheme (or Git). @@ -135,10 +137,12 @@ Examples Contextual information ====================== -The Software Heritage persistent identifiers described above are *intrinsic identifiers*, as they are computed from the designated object itself, and it is often useful to provide *contextual information* about a particular -occurrence of the object, like the origin from where the object has been found. -To this end, persistent identifiers can be equipped with **qualifiers** that -contain this *contextual information*. Qualifiers come in different kinds : +The SWHIDs as described above are *intrinsic identifiers*, as they are computed +from the designated object itself, and it is often useful to provide +*contextual information* about a particular occurrence of the object, like the +origin from where the object has been found. To this end, SWHIDs can be +coupled with **qualifiers** that capture such *contextual information*. +Qualifiers come in different kinds: * origin * visit @@ -146,11 +150,12 @@ contain this *contextual information*. Qualifiers come in different kinds : * path * lines + Syntax ------ -The full-syntax to complement identifiers with contextual information is given -by the ``<identifier_with_context>`` entry point of the grammar: +The full-syntax to complement SWHIDs with contextual information is given by +the ``<identifier_with_context>`` entry point of the grammar: .. code-block:: bnf @@ -166,45 +171,54 @@ by the ``<identifier_with_context>`` entry point of the grammar: <url> ::= (* RFC 3986 compliant URLs *) <path_absolute_escaped> ::= (* RFC 3986 compliant absolute file path, percent-escaped *) -Here ``<path_absolute_escaped>`` is the ``<path_absolute>`` in `Section 3.3 of RFC 3986 <https://tools.ietf.org/html/rfc3986#section-3.3>`_ where all occurrences of ``;`` and ``%`` must be percent-encoded (as `%3B` and `%25` respectively). +Here ``<path_absolute_escaped>`` is the ``<path_absolute>`` in `Section 3.3 of +RFC 3986 <https://tools.ietf.org/html/rfc3986#section-3.3>`_ where all +occurrences of ``;`` and ``%`` must be percent-encoded (as `%3B` and `%25` +respectively). + Semantics --------- -``;`` is used as separator between persistent identifiers and the -optional contextual information qualifiers. Each contextual information qualifier is -specified as a key/value pair, using ``=`` as a separator. +``;`` is used as separator between SWHIDs and the optional contextual +information qualifiers. Each contextual information qualifier is specified as a +key/value pair, using ``=`` as a separator. The following piece of contextual information are supported: -* **origin** : the *software origin* where an object has been found or observed in the wild, - as an URI; -* **visit** : persistent identifier of a *snapshot* corresponding to a specific *visit* of a repository containing the designated object; -* **anchor** : a *designated node* in the Merkle DAG relative to which a *path to the object* is specified, - as a persistent identifier of a directory, a revision, a release or a snapshot; -* **path** : the *absolute file path*, from the *root directory* associated to the *anchor node*, to the object; - when the anchor denotes a directory or a revision, and almost always when it's a release, - the root directory is uniquely determined; when the anchor denotes a snapshot, the root - directory is the one pointed to by ``HEAD`` (possibly indirectly), - and undefined if such a reference is missing; +* **origin** : the *software origin* where an object has been found or observed + in the wild, as an URI; +* **visit** : persistent identifier of a *snapshot* corresponding to a specific + *visit* of a repository containing the designated object; +* **anchor** : a *designated node* in the Merkle DAG relative to which a *path + to the object* is specified, as a persistent identifier of a directory, a + revision, a release or a snapshot; +* **path** : the *absolute file path*, from the *root directory* associated to + the *anchor node*, to the object; when the anchor denotes a directory or a + revision, and almost always when it's a release, the root directory is + uniquely determined; when the anchor denotes a snapshot, the root directory + is the one pointed to by ``HEAD`` (possibly indirectly), and undefined if + such a reference is missing; * **lines** : *line number(s)* of interest, usually within a content object We recommend to equip identifiers meant to be shared with as many qualifiers as -possible. While qualifiers may be listed in any order, it is good practice -to present them in the order given above, i.e. ``origin``, ``visit``, ``anchor``, ``path``, ``lines``. -Redundant information should be omitted: for example, if the *visit* -is present, and the *path* is relative to the snapshot indicated there, then the -*anchor* qualifier is superfluous. +possible. While qualifiers may be listed in any order, it is good practice to +present them in the order given above, i.e., ``origin``, ``visit``, ``anchor``, +``path``, ``lines``. Redundant information should be omitted: for example, if +the *visit* is present, and the *path* is relative to the snapshot indicated +there, then the *anchor* qualifier is superfluous. + Example ------- -The following `fully qualified identifier <https://archive.softwareheritage.org/swh:1:cnt:4d99d2d18326621ccdd70f5ea66c2e2ac236ad8b;;origin=https://gitorious.org/ocamlp3l/ocamlp3l_cvs.git;visit=swh:1:snp:d7f1b9eb7ccb596c2622c4780febaa02549830f9;anchor=swh:1:rev:2db189928c94d62a3b4757b3eec68f0a4d4113f0;path=/Examples/SimpleFarm/simplefarm.ml;lines=9-15>`_ -denotes the lines 9 to 15 of a file content that -can be found at absolute path ``/Examples/SimpleFarm/simplefarm.ml`` from the root directory -of the revision ``swh:1:rev:2db189928c94d62a3b4757b3eec68f0a4d4113f0`` that is contained -in the snapshot ``swh:1:snp:d7f1b9eb7ccb596c2622c4780febaa02549830f9`` taken from -the origin ``https://gitorious.org/ocamlp3l/ocamlp3l_cvs.git``. +The following `fully qualified SWHID +<https://archive.softwareheritage.org/swh:1:cnt:4d99d2d18326621ccdd70f5ea66c2e2ac236ad8b;;origin=https://gitorious.org/ocamlp3l/ocamlp3l_cvs.git;visit=swh:1:snp:d7f1b9eb7ccb596c2622c4780febaa02549830f9;anchor=swh:1:rev:2db189928c94d62a3b4757b3eec68f0a4d4113f0;path=/Examples/SimpleFarm/simplefarm.ml;lines=9-15>`_ +denotes the lines 9 to 15 of a file content that can be found at absolute path +``/Examples/SimpleFarm/simplefarm.ml`` from the root directory of the revision +``swh:1:rev:2db189928c94d62a3b4757b3eec68f0a4d4113f0`` that is contained in the +snapshot ``swh:1:snp:d7f1b9eb7ccb596c2622c4780febaa02549830f9`` taken from the +origin ``https://gitorious.org/ocamlp3l/ocamlp3l_cvs.git``. .. code-block:: url @@ -216,7 +230,9 @@ the origin ``https://gitorious.org/ocamlp3l/ocamlp3l_cvs.git``. lines=9-15 -And this is an example of `a fully qualified identifier with a percent escaped file path <https://archive.softwareheritage.org/swh:1:cnt:f10371aa7b8ccabca8479196d6cd640676fd4a04;origin=https://github.com/web-platform-tests/wpt;visit=swh:1:snp:b37d435721bbd450624165f334724e3585346499;anchor=swh:1:rev:259d0612af038d14f2cd889a14a3adb6c9e96d96;path=/html/semantics/document-metadata/the-meta-element/pragma-directives/attr-meta-http-equiv-refresh/support/x%3Burl=foo/>`_ +And this is an example of `a fully qualified SWHID with a percent escaped file +path +<https://archive.softwareheritage.org/swh:1:cnt:f10371aa7b8ccabca8479196d6cd640676fd4a04;origin=https://github.com/web-platform-tests/wpt;visit=swh:1:snp:b37d435721bbd450624165f334724e3585346499;anchor=swh:1:rev:259d0612af038d14f2cd889a14a3adb6c9e96d96;path=/html/semantics/document-metadata/the-meta-element/pragma-directives/attr-meta-http-equiv-refresh/support/x%3Burl=foo/>`_ .. code-block:: url @@ -234,15 +250,13 @@ Resolution Dedicated resolvers ------------------- -Persistent identifiers can be resolved using the Software Heritage Web -application (see :py:mod:`swh.web`). In particular, the **root endpoint** -``/`` can be given a persistent identifier and will lead to the browsing page -of the corresponding object, like this: -``https://archive.softwareheritage.org/<identifier>``. +SWHIDs can be resolved using the Software Heritage Web application (see +:py:mod:`swh.web`). In particular, the **root endpoint** ``/`` can be given a +SWHID and will lead to the browsing page of the corresponding object, like +this: ``https://archive.softwareheritage.org/<identifier>``. A **dedicated** ``/resolve`` **endpoint** of the HTTP API is also available to -explicitly request persistent identifier resolution; see: -:http:get:`/api/1/resolve/(swh_id)/`. +explicitly request SWHID resolution; see: :http:get:`/api/1/resolve/(swh_id)/`. Examples: @@ -256,8 +270,7 @@ Examples: External resolvers ------------------ -The following **independent resolvers** support resolution of Software -Heritage persistent identifiers: +The following **independent resolvers** support resolution of SWHIDs: * `Identifiers.org <https://identifiers.org>`_; see: `<http://identifiers.org/swh/>`_ (registry identifier `MIR:00000655