Skip to content
Snippets Groups Projects
Commit 4c78d479 authored by Stefano Zacchiroli's avatar Stefano Zacchiroli
Browse files

PID doc: embrace the SWHID naming

parent 0ab482e7
No related branches found
No related tags found
No related merge requests found
.. _persistent-identifiers: .. _persistent-identifiers:
====================== ================================================
Persistent identifiers SoftWare Heritage persistent IDentifiers (SWHID)
====================== ================================================
**version 1.2**
Description Description
=========== ===========
You can point to objects present in the Software Heritage archive by the means You can point to objects present in the Software Heritage archive by the means
of **persistent identifiers** that are guaranteed to remain stable (persistent) of **SoftWare Heritage persistent IDentifiers**, or **SWHID** for short, that
over time. Their syntax, meaning, and usage is described below. Note that they are guaranteed to remain stable (persistent) over time. Their syntax, meaning,
are identifiers and not URLs, even though an URL-based resolver for Software and usage is described below. Note that they are identifiers and not URLs, even
Heritage persistent identifiers is also provided. though an URL-based resolver for Software Heritage persistent identifiers is
also provided.
A persistent identifier can point to any software artifact (or "object") A SWHID can point to any software artifact (or "object") available in the
available in the Software Heritage archive. Objects come in different types, Software Heritage archive. Objects come in different types, and most notably:
and most notably:
* contents * contents
* directories * directories
...@@ -24,20 +27,20 @@ and most notably: ...@@ -24,20 +27,20 @@ and most notably:
* snapshots * snapshots
Each object is identified by an intrinsic, type-specific object identifier that Each object is identified by an intrinsic, type-specific object identifier that
is embedded in its persistent identifier as described below. Object identifiers is embedded in its SWHID as described below. SWHIDs are strong cryptographic
are strong cryptographic hashes computed on the entire set of object properties hashes computed on the entire set of object properties to form a `Merkle
to form a `Merkle structure <https://en.wikipedia.org/wiki/Merkle_tree>`_. structure <https://en.wikipedia.org/wiki/Merkle_tree>`_.
See :ref:`data-model` for an overview of object types and how they are linked See the :ref:`Software Heritage data model <data-model>` for an overview of
together. See :py:mod:`swh.model.identifiers` for details on how intrinsic object types and how they are linked together. See
object identifiers are computed. :py:mod:`swh.model.identifiers` for details on how SWHIDs are computed.
Syntax Syntax
------ ------
Syntactically, persistent identifiers are generated by the ``<identifier>`` Syntactically, SWHIDs are generated by the ``<identifier>`` entry point of the
entry point of the grammar: grammar:
.. code-block:: bnf .. code-block:: bnf
...@@ -58,15 +61,15 @@ entry point of the grammar: ...@@ -58,15 +61,15 @@ entry point of the grammar:
Semantics Semantics
--------- ---------
``:`` is used as separator between the logical parts of identifiers. The ``:`` is used as separator between the logical parts of SWHIDs. The ``swh``
``swh`` prefix makes explicit that these identifiers are related to *SoftWare prefix makes explicit that these identifiers are related to *SoftWare
Heritage*. ``1`` (``<scheme_version>``) is the current version of this Heritage*. ``1`` (``<scheme_version>``) is the current version of this
identifier *scheme*; future editions will use higher version numbers, possibly identifier *scheme*; future editions will use higher version numbers, possibly
breaking backward compatibility (but without breaking the resolvability of breaking backward compatibility (but without breaking the resolvability of
identifiers that conform to previous versions of the scheme). SWHIDs conform to previous versions of the scheme).
A persistent identifier points to a single object, whose type is explicitly A SWHID points to a single object, whose type is explicitly captured by
captured by ``<object_type>``: ``<object_type>``:
* ``snp`` to **snapshots**, * ``snp`` to **snapshots**,
* ``rel`` to **releases**, * ``rel`` to **releases**,
...@@ -101,15 +104,14 @@ computed on the content and metadata of the object itself, as follows: ...@@ -101,15 +104,14 @@ computed on the content and metadata of the object itself, as follows:
Git compatibility Git compatibility
~~~~~~~~~~~~~~~~~ ~~~~~~~~~~~~~~~~~
Intrinsic object identifiers for contents, directories, revisions, and releases SWHIDs for contents, directories, revisions, and releases are, at present,
are, at present, compatible with the `Git <https://git-scm.com/>`_ way of compatible with the `Git <https://git-scm.com/>`_ way of `computing identifiers
`computing identifiers
<https://git-scm.com/book/en/v2/Git-Internals-Git-Objects>`_ for its objects. <https://git-scm.com/book/en/v2/Git-Internals-Git-Objects>`_ for its objects.
A Software Heritage content identifier will be identical to a Git blob A SWHID for a content object will correspond (in its ``<object_id>`` part) to a
identifier of any file with the same content, a Software Heritage revision Git blob identifier of any file with the same content; a SWHID for a revision
identifier will be identical to the corresponding Git commit identifier, etc. will correspond to the Git commit identifier for the same revision, etc. This
This is not the case for snapshot identifiers as Git doesn't have a is not the case for snapshot identifiers, as Git does not have a corresponding
corresponding object type. object type.
Note that Git compatibility is incidental and is not guaranteed to be Note that Git compatibility is incidental and is not guaranteed to be
maintained in future versions of this scheme (or Git). maintained in future versions of this scheme (or Git).
...@@ -135,10 +137,12 @@ Examples ...@@ -135,10 +137,12 @@ Examples
Contextual information Contextual information
====================== ======================
The Software Heritage persistent identifiers described above are *intrinsic identifiers*, as they are computed from the designated object itself, and it is often useful to provide *contextual information* about a particular The SWHIDs as described above are *intrinsic identifiers*, as they are computed
occurrence of the object, like the origin from where the object has been found. from the designated object itself, and it is often useful to provide
To this end, persistent identifiers can be equipped with **qualifiers** that *contextual information* about a particular occurrence of the object, like the
contain this *contextual information*. Qualifiers come in different kinds : origin from where the object has been found. To this end, SWHIDs can be
coupled with **qualifiers** that capture such *contextual information*.
Qualifiers come in different kinds:
* origin * origin
* visit * visit
...@@ -146,11 +150,12 @@ contain this *contextual information*. Qualifiers come in different kinds : ...@@ -146,11 +150,12 @@ contain this *contextual information*. Qualifiers come in different kinds :
* path * path
* lines * lines
Syntax Syntax
------ ------
The full-syntax to complement identifiers with contextual information is given The full-syntax to complement SWHIDs with contextual information is given by
by the ``<identifier_with_context>`` entry point of the grammar: the ``<identifier_with_context>`` entry point of the grammar:
.. code-block:: bnf .. code-block:: bnf
...@@ -166,45 +171,54 @@ by the ``<identifier_with_context>`` entry point of the grammar: ...@@ -166,45 +171,54 @@ by the ``<identifier_with_context>`` entry point of the grammar:
<url> ::= (* RFC 3986 compliant URLs *) <url> ::= (* RFC 3986 compliant URLs *)
<path_absolute_escaped> ::= (* RFC 3986 compliant absolute file path, percent-escaped *) <path_absolute_escaped> ::= (* RFC 3986 compliant absolute file path, percent-escaped *)
Here ``<path_absolute_escaped>`` is the ``<path_absolute>`` in `Section 3.3 of RFC 3986 <https://tools.ietf.org/html/rfc3986#section-3.3>`_ where all occurrences of ``;`` and ``%`` must be percent-encoded (as `%3B` and `%25` respectively). Here ``<path_absolute_escaped>`` is the ``<path_absolute>`` in `Section 3.3 of
RFC 3986 <https://tools.ietf.org/html/rfc3986#section-3.3>`_ where all
occurrences of ``;`` and ``%`` must be percent-encoded (as `%3B` and `%25`
respectively).
Semantics Semantics
--------- ---------
``;`` is used as separator between persistent identifiers and the ``;`` is used as separator between SWHIDs and the optional contextual
optional contextual information qualifiers. Each contextual information qualifier is information qualifiers. Each contextual information qualifier is specified as a
specified as a key/value pair, using ``=`` as a separator. key/value pair, using ``=`` as a separator.
The following piece of contextual information are supported: The following piece of contextual information are supported:
* **origin** : the *software origin* where an object has been found or observed in the wild, * **origin** : the *software origin* where an object has been found or observed
as an URI; in the wild, as an URI;
* **visit** : persistent identifier of a *snapshot* corresponding to a specific *visit* of a repository containing the designated object; * **visit** : persistent identifier of a *snapshot* corresponding to a specific
* **anchor** : a *designated node* in the Merkle DAG relative to which a *path to the object* is specified, *visit* of a repository containing the designated object;
as a persistent identifier of a directory, a revision, a release or a snapshot; * **anchor** : a *designated node* in the Merkle DAG relative to which a *path
* **path** : the *absolute file path*, from the *root directory* associated to the *anchor node*, to the object; to the object* is specified, as a persistent identifier of a directory, a
when the anchor denotes a directory or a revision, and almost always when it's a release, revision, a release or a snapshot;
the root directory is uniquely determined; when the anchor denotes a snapshot, the root * **path** : the *absolute file path*, from the *root directory* associated to
directory is the one pointed to by ``HEAD`` (possibly indirectly), the *anchor node*, to the object; when the anchor denotes a directory or a
and undefined if such a reference is missing; revision, and almost always when it's a release, the root directory is
uniquely determined; when the anchor denotes a snapshot, the root directory
is the one pointed to by ``HEAD`` (possibly indirectly), and undefined if
such a reference is missing;
* **lines** : *line number(s)* of interest, usually within a content object * **lines** : *line number(s)* of interest, usually within a content object
We recommend to equip identifiers meant to be shared with as many qualifiers as We recommend to equip identifiers meant to be shared with as many qualifiers as
possible. While qualifiers may be listed in any order, it is good practice possible. While qualifiers may be listed in any order, it is good practice to
to present them in the order given above, i.e. ``origin``, ``visit``, ``anchor``, ``path``, ``lines``. present them in the order given above, i.e., ``origin``, ``visit``, ``anchor``,
Redundant information should be omitted: for example, if the *visit* ``path``, ``lines``. Redundant information should be omitted: for example, if
is present, and the *path* is relative to the snapshot indicated there, then the the *visit* is present, and the *path* is relative to the snapshot indicated
*anchor* qualifier is superfluous. there, then the *anchor* qualifier is superfluous.
Example Example
------- -------
The following `fully qualified identifier <https://archive.softwareheritage.org/swh:1:cnt:4d99d2d18326621ccdd70f5ea66c2e2ac236ad8b;;origin=https://gitorious.org/ocamlp3l/ocamlp3l_cvs.git;visit=swh:1:snp:d7f1b9eb7ccb596c2622c4780febaa02549830f9;anchor=swh:1:rev:2db189928c94d62a3b4757b3eec68f0a4d4113f0;path=/Examples/SimpleFarm/simplefarm.ml;lines=9-15>`_ The following `fully qualified SWHID
denotes the lines 9 to 15 of a file content that <https://archive.softwareheritage.org/swh:1:cnt:4d99d2d18326621ccdd70f5ea66c2e2ac236ad8b;;origin=https://gitorious.org/ocamlp3l/ocamlp3l_cvs.git;visit=swh:1:snp:d7f1b9eb7ccb596c2622c4780febaa02549830f9;anchor=swh:1:rev:2db189928c94d62a3b4757b3eec68f0a4d4113f0;path=/Examples/SimpleFarm/simplefarm.ml;lines=9-15>`_
can be found at absolute path ``/Examples/SimpleFarm/simplefarm.ml`` from the root directory denotes the lines 9 to 15 of a file content that can be found at absolute path
of the revision ``swh:1:rev:2db189928c94d62a3b4757b3eec68f0a4d4113f0`` that is contained ``/Examples/SimpleFarm/simplefarm.ml`` from the root directory of the revision
in the snapshot ``swh:1:snp:d7f1b9eb7ccb596c2622c4780febaa02549830f9`` taken from ``swh:1:rev:2db189928c94d62a3b4757b3eec68f0a4d4113f0`` that is contained in the
the origin ``https://gitorious.org/ocamlp3l/ocamlp3l_cvs.git``. snapshot ``swh:1:snp:d7f1b9eb7ccb596c2622c4780febaa02549830f9`` taken from the
origin ``https://gitorious.org/ocamlp3l/ocamlp3l_cvs.git``.
.. code-block:: url .. code-block:: url
...@@ -216,7 +230,9 @@ the origin ``https://gitorious.org/ocamlp3l/ocamlp3l_cvs.git``. ...@@ -216,7 +230,9 @@ the origin ``https://gitorious.org/ocamlp3l/ocamlp3l_cvs.git``.
lines=9-15 lines=9-15
And this is an example of `a fully qualified identifier with a percent escaped file path <https://archive.softwareheritage.org/swh:1:cnt:f10371aa7b8ccabca8479196d6cd640676fd4a04;origin=https://github.com/web-platform-tests/wpt;visit=swh:1:snp:b37d435721bbd450624165f334724e3585346499;anchor=swh:1:rev:259d0612af038d14f2cd889a14a3adb6c9e96d96;path=/html/semantics/document-metadata/the-meta-element/pragma-directives/attr-meta-http-equiv-refresh/support/x%3Burl=foo/>`_ And this is an example of `a fully qualified SWHID with a percent escaped file
path
<https://archive.softwareheritage.org/swh:1:cnt:f10371aa7b8ccabca8479196d6cd640676fd4a04;origin=https://github.com/web-platform-tests/wpt;visit=swh:1:snp:b37d435721bbd450624165f334724e3585346499;anchor=swh:1:rev:259d0612af038d14f2cd889a14a3adb6c9e96d96;path=/html/semantics/document-metadata/the-meta-element/pragma-directives/attr-meta-http-equiv-refresh/support/x%3Burl=foo/>`_
.. code-block:: url .. code-block:: url
...@@ -234,15 +250,13 @@ Resolution ...@@ -234,15 +250,13 @@ Resolution
Dedicated resolvers Dedicated resolvers
------------------- -------------------
Persistent identifiers can be resolved using the Software Heritage Web SWHIDs can be resolved using the Software Heritage Web application (see
application (see :py:mod:`swh.web`). In particular, the **root endpoint** :py:mod:`swh.web`). In particular, the **root endpoint** ``/`` can be given a
``/`` can be given a persistent identifier and will lead to the browsing page SWHID and will lead to the browsing page of the corresponding object, like
of the corresponding object, like this: this: ``https://archive.softwareheritage.org/<identifier>``.
``https://archive.softwareheritage.org/<identifier>``.
A **dedicated** ``/resolve`` **endpoint** of the HTTP API is also available to A **dedicated** ``/resolve`` **endpoint** of the HTTP API is also available to
explicitly request persistent identifier resolution; see: explicitly request SWHID resolution; see: :http:get:`/api/1/resolve/(swh_id)/`.
:http:get:`/api/1/resolve/(swh_id)/`.
Examples: Examples:
...@@ -256,8 +270,7 @@ Examples: ...@@ -256,8 +270,7 @@ Examples:
External resolvers External resolvers
------------------ ------------------
The following **independent resolvers** support resolution of Software The following **independent resolvers** support resolution of SWHIDs:
Heritage persistent identifiers:
* `Identifiers.org <https://identifiers.org>`_; see: * `Identifiers.org <https://identifiers.org>`_; see:
`<http://identifiers.org/swh/>`_ (registry identifier `MIR:00000655 `<http://identifiers.org/swh/>`_ (registry identifier `MIR:00000655
......
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment