See :ref:`data-model` for an overview of object types and how they are linked
together. See :py:mod:`swh.model.identifiers` for details on how intrinsic
object identifiers are computed.
See the :ref:`Software Heritage data model <data-model>` for an overview of
object types and how they are linked together. See
:py:mod:`swh.model.identifiers` for details on how SWHIDs are computed.
Syntax
------
Syntactically, persistent identifiers are generated by the ``<identifier>``
entry point of the grammar:
Syntactically, SWHIDs are generated by the ``<identifier>`` entry point of the
grammar:
.. code-block:: bnf
...
...
@@ -58,15 +61,15 @@ entry point of the grammar:
Semantics
---------
``:`` is used as separator between the logical parts of identifiers. The
``swh`` prefix makes explicit that these identifiers are related to *SoftWare
``:`` is used as separator between the logical parts of SWHIDs. The ``swh``
prefix makes explicit that these identifiers are related to *SoftWare
Heritage*. ``1`` (``<scheme_version>``) is the current version of this
identifier *scheme*; future editions will use higher version numbers, possibly
breaking backward compatibility (but without breaking the resolvability of
identifiers that conform to previous versions of the scheme).
SWHIDs conform to previous versions of the scheme).
A persistent identifier points to a single object, whose type is explicitly
captured by ``<object_type>``:
A SWHID points to a single object, whose type is explicitly captured by
``<object_type>``:
* ``snp`` to **snapshots**,
* ``rel`` to **releases**,
...
...
@@ -101,15 +104,14 @@ computed on the content and metadata of the object itself, as follows:
Git compatibility
~~~~~~~~~~~~~~~~~
Intrinsic object identifiers for contents, directories, revisions, and releases
are, at present, compatible with the `Git <https://git-scm.com/>`_ way of
`computing identifiers
SWHIDs for contents, directories, revisions, and releases are, at present,
compatible with the `Git <https://git-scm.com/>`_ way of `computing identifiers
<https://git-scm.com/book/en/v2/Git-Internals-Git-Objects>`_ for its objects.
A Software Heritage content identifier will be identical to a Git blob
identifier of any file with the same content, a Software Heritage revision
identifier will be identical to the corresponding Git commit identifier, etc.
This is not the case for snapshot identifiers as Git doesn't have a
corresponding object type.
A SWHID for a content object will correspond (in its ``<object_id>`` part) to a
Git blob identifier of any file with the same content; a SWHID for a revision
will correspond to the Git commit identifier for the same revision, etc. This
is not the case for snapshot identifiers, as Git does not have a corresponding
object type.
Note that Git compatibility is incidental and is not guaranteed to be
maintained in future versions of this scheme (or Git).
...
...
@@ -135,10 +137,12 @@ Examples
Contextual information
======================
The Software Heritage persistent identifiers described above are *intrinsic identifiers*, as they are computed from the designated object itself, and it is often useful to provide *contextual information* about a particular
occurrence of the object, like the origin from where the object has been found.
To this end, persistent identifiers can be equipped with **qualifiers** that
contain this *contextual information*. Qualifiers come in different kinds :
The SWHIDs as described above are *intrinsic identifiers*, as they are computed
from the designated object itself, and it is often useful to provide
*contextual information* about a particular occurrence of the object, like the
origin from where the object has been found. To this end, SWHIDs can be
coupled with **qualifiers** that capture such *contextual information*.
Qualifiers come in different kinds:
* origin
* visit
...
...
@@ -146,11 +150,12 @@ contain this *contextual information*. Qualifiers come in different kinds :
* path
* lines
Syntax
------
The full-syntax to complement identifiers with contextual information is given
by the ``<identifier_with_context>`` entry point of the grammar:
The full-syntax to complement SWHIDs with contextual information is given by
the ``<identifier_with_context>`` entry point of the grammar:
.. code-block:: bnf
...
...
@@ -166,45 +171,54 @@ by the ``<identifier_with_context>`` entry point of the grammar:
Here ``<path_absolute_escaped>`` is the ``<path_absolute>`` in `Section 3.3 of RFC 3986 <https://tools.ietf.org/html/rfc3986#section-3.3>`_ where all occurrences of ``;`` and ``%`` must be percent-encoded (as `%3B` and `%25` respectively).
Here ``<path_absolute_escaped>`` is the ``<path_absolute>`` in `Section 3.3 of
RFC 3986 <https://tools.ietf.org/html/rfc3986#section-3.3>`_ where all
occurrences of ``;`` and ``%`` must be percent-encoded (as `%3B` and `%25`
respectively).
Semantics
---------
``;`` is used as separator between persistent identifiers and the
optional contextual information qualifiers. Each contextual information qualifier is
specified as a key/value pair, using ``=`` as a separator.
``;`` is used as separator between SWHIDs and the optional contextual
information qualifiers. Each contextual information qualifier is specified as a
key/value pair, using ``=`` as a separator.
The following piece of contextual information are supported:
* **origin** : the *software origin* where an object has been found or observed in the wild,
as an URI;
* **visit** : persistent identifier of a *snapshot* corresponding to a specific *visit* of a repository containing the designated object;
* **anchor** : a *designated node* in the Merkle DAG relative to which a *path to the object* is specified,
as a persistent identifier of a directory, a revision, a release or a snapshot;
* **path** : the *absolute file path*, from the *root directory* associated to the *anchor node*, to the object;
when the anchor denotes a directory or a revision, and almost always when it's a release,
the root directory is uniquely determined; when the anchor denotes a snapshot, the root
directory is the one pointed to by ``HEAD`` (possibly indirectly),
and undefined if such a reference is missing;
* **origin** : the *software origin* where an object has been found or observed
in the wild, as an URI;
* **visit** : persistent identifier of a *snapshot* corresponding to a specific
*visit* of a repository containing the designated object;
* **anchor** : a *designated node* in the Merkle DAG relative to which a *path
to the object* is specified, as a persistent identifier of a directory, a
revision, a release or a snapshot;
* **path** : the *absolute file path*, from the *root directory* associated to
the *anchor node*, to the object; when the anchor denotes a directory or a
revision, and almost always when it's a release, the root directory is
uniquely determined; when the anchor denotes a snapshot, the root directory
is the one pointed to by ``HEAD`` (possibly indirectly), and undefined if
such a reference is missing;
* **lines** : *line number(s)* of interest, usually within a content object
We recommend to equip identifiers meant to be shared with as many qualifiers as
possible. While qualifiers may be listed in any order, it is good practice
to present them in the order given above, i.e. ``origin``, ``visit``, ``anchor``, ``path``, ``lines``.
Redundant information should be omitted: for example, if the *visit*
is present, and the *path* is relative to the snapshot indicated there, then the
*anchor* qualifier is superfluous.
possible. While qualifiers may be listed in any order, it is good practice to
present them in the order given above, i.e., ``origin``, ``visit``, ``anchor``,
``path``, ``lines``. Redundant information should be omitted: for example, if
the *visit* is present, and the *path* is relative to the snapshot indicated
there, then the *anchor* qualifier is superfluous.
Example
-------
The following `fully qualified identifier <https://archive.softwareheritage.org/swh:1:cnt:4d99d2d18326621ccdd70f5ea66c2e2ac236ad8b;;origin=https://gitorious.org/ocamlp3l/ocamlp3l_cvs.git;visit=swh:1:snp:d7f1b9eb7ccb596c2622c4780febaa02549830f9;anchor=swh:1:rev:2db189928c94d62a3b4757b3eec68f0a4d4113f0;path=/Examples/SimpleFarm/simplefarm.ml;lines=9-15>`_
denotes the lines 9 to 15 of a file content that
can be found at absolute path ``/Examples/SimpleFarm/simplefarm.ml`` from the root directory
of the revision ``swh:1:rev:2db189928c94d62a3b4757b3eec68f0a4d4113f0`` that is contained
in the snapshot ``swh:1:snp:d7f1b9eb7ccb596c2622c4780febaa02549830f9`` taken from
the origin ``https://gitorious.org/ocamlp3l/ocamlp3l_cvs.git``.
@@ -216,7 +230,9 @@ the origin ``https://gitorious.org/ocamlp3l/ocamlp3l_cvs.git``.
lines=9-15
And this is an example of `a fully qualified identifier with a percent escaped file path <https://archive.softwareheritage.org/swh:1:cnt:f10371aa7b8ccabca8479196d6cd640676fd4a04;origin=https://github.com/web-platform-tests/wpt;visit=swh:1:snp:b37d435721bbd450624165f334724e3585346499;anchor=swh:1:rev:259d0612af038d14f2cd889a14a3adb6c9e96d96;path=/html/semantics/document-metadata/the-meta-element/pragma-directives/attr-meta-http-equiv-refresh/support/x%3Burl=foo/>`_
And this is an example of `a fully qualified SWHID with a percent escaped file