Skip to content
Snippets Groups Projects
Commit 9f5d266b authored by Stefano Zacchiroli's avatar Stefano Zacchiroli Committed by Roberto Di Cosmo
Browse files

SWHID spec: full reread

Reviewers: rdicosmo

Reviewed By: rdicosmo

Differential Revision: https://forge.softwareheritage.org/D3108
parent b80b1358
No related merge requests found
......@@ -4,22 +4,29 @@
SoftWare Heritage persistent IDentifiers (SWHIDs)
=================================================
**version 1.3, last modified 2020-04-28**
**version 1.4, last modified 2020-04-30**
.. contents::
:local:
:depth: 2
Overview
========
You can point to objects present in the Software Heritage archive by the means
of **SoftWare Heritage persistent IDentifiers**, or **SWHIDs** for short, that
are guaranteed to remain stable (persistent) over time. Their syntax, meaning,
and usage is described below. Note that they are identifiers and not URLs, even
though URL-based resolvers for SWHIDs are also available.
You can point to objects present in the `Software Heritage
<https://www.softwareheritage.org/>`_ `archive
<https://archive.softwareheritage.org/>`_ by the means of **SoftWare Heritage
persistent IDentifiers**, or **SWHIDs** for short, that are guaranteed to
remain stable (persistent) over time. Their syntax, meaning, and usage is
described below. Note that they are identifiers and not URLs, even though
URL-based `resolvers`_ for SWHIDs are also available.
A SWHID consists of two separate parts, a *core identifier* that can point to
any software artifact (or "object") available in the Software Heritage archive,
and an *optional list of qualifiers* that allows to specify the context where
the object is meant to be seen, or point to a subpart of the object itself.
A SWHID consists of two separate parts, a mandatory *core identifier* that can
point to any software artifact (or "object") available in the Software Heritage
archive, and an optional list of *qualifiers* that allows to specify the
context where the object is meant to be seen and point to a subpart of the
object itself.
Objects come in different types:
......@@ -33,7 +40,8 @@ Each object is identified by an intrinsic, type-specific object identifier that
is embedded in its SWHID as described below. The intrinsic identifiers embedded
in SWHIDs are strong cryptographic hashes computed on the entire set of object
properties. Together, these identifiers form a `Merkle structure
<https://en.wikipedia.org/wiki/Merkle_tree>`_, specifically a Merkle DAG.
<https://en.wikipedia.org/wiki/Merkle_tree>`_, specifically a Merkle `DAG
<https://en.wikipedia.org/wiki/Directed_acyclic_graph>`_.
See the :ref:`Software Heritage data model <data-model>` for an overview of
object types and how they are linked together. See
......@@ -42,23 +50,24 @@ embedded in SWHIDs are computed.
The optional qualifiers are of two kinds:
* *context qualifiers* carry information about the context where a given
object is meant to be seen; this is particularly important, as the same object
can be reached in the Merkle graph following different *paths* from different
nodes (or *anchors*), and it may have been retrieved from different *origins*,
that may evolve between different *visits*,
* *fragment qualifiers* allow to pinpoint specific subparts of an object
* **context qualifiers:** carry information about the context where a given
object is meant to be seen. This is particularly important, as the same
object can be reached in the Merkle graph following different *paths*
starting from different nodes (or *anchors*), and it may have been retrieved
from different *origins*, that may evolve between different *visits*
* **fragment qualifiers:** allow to pinpoint specific subparts of an object
Syntax
------
======
Syntactically, SWHIDs are generated by the ``<identifier>`` entry point of the
grammar:
Syntactically, SWHIDs are generated by the ``<identifier>`` entry point in the
following grammar:
.. code-block:: bnf
<identifier> ::= <identifier_core> [ <qualifierlist> ] ;
<identifier> ::= <identifier_core> [ <qualifiers> ] ;
<identifier_core> ::= "swh" ":" <scheme_version> ":" <object_type> ":" <object_id> ;
<scheme_version> ::= "1" ;
<object_type> ::=
......@@ -71,7 +80,8 @@ grammar:
<object_id> ::= 40 * <hex_digit> ; (* intrinsic object id, as hex-encoded SHA1 *)
<dec_digit> ::= "0" | "1" | "2" | "3" | "4" | "5" | "6" | "7" | "8" | "9" ;
<hex_digit> ::= <dec_digit> | "a" | "b" | "c" | "d" | "e" | "f" ;
<qualifierlist> := <qualifier> [ <qualifierlist> ] ;
<qualifiers> := ";" <qualifier> [ <qualifiers> ] ;
<qualifier> ::=
<context_qualifier>
| <fragment_qualifier>
......@@ -82,14 +92,14 @@ grammar:
| <anchor_ctxt>
| <path_ctxt>
;
<origin_ctxt> ::= ";" "origin" "=" <url_escaped> ;
<visit_ctxt> ::= ";" "visit" "=" <identifier_core> ;
<anchor_ctxt> ::= ";" "anchor" "=" <identifier_core> ;
<path_ctxt> ::= ";" "path" "=" <path_absolute_escaped> ;
<fragment_qualifier> ::= ";" "lines" "=" <line_number> ["-" <line_number>] ;
<origin_ctxt> ::= "origin" "=" <url_escaped> ;
<visit_ctxt> ::= "visit" "=" <identifier_core> ;
<anchor_ctxt> ::= "anchor" "=" <identifier_core> ;
<path_ctxt> ::= "path" "=" <path_absolute_escaped> ;
<fragment_qualifier> ::= "lines" "=" <line_number> ["-" <line_number>] ;
<line_number> ::= <dec_digit> + ;
<url_escaped> ::= (* RFC 3986 compliant URLs, percent-escaped *)
<path_absolute_escaped> ::= (* RFC 3986 compliant absolute file path, percent-escaped *)
<url_escaped> ::= (* RFC 3987 IRI *)
<path_absolute_escaped> ::= (* RFC 3987 absolute path *)
Where:
......@@ -105,17 +115,18 @@ embeddability of SWHID in other contexts.
Semantics
---------
=========
Core identifiers
~~~~~~~~~~~~~~~~
----------------
``:`` is used as separator between the logical parts of core identifiers. The ``swh``
prefix makes explicit that these identifiers are related to *SoftWare
``:`` is used as separator between the logical parts of core identifiers. The
``swh`` prefix makes explicit that these identifiers are related to *SoftWare
Heritage*. ``1`` (``<scheme_version>``) is the current version of this
identifier *scheme*; future editions will use higher version numbers, possibly
breaking backward compatibility (but without breaking the resolvability of
SWHIDs that conform to previous versions of the scheme).
identifier *scheme*. Future editions will use higher version numbers, possibly
breaking backward compatibility, but without breaking the resolvability of
SWHIDs that conform to previous versions of the scheme.
A SWHID points to a single object, whose type is explicitly captured by
``<object_type>``:
......@@ -151,23 +162,27 @@ computed on the content and metadata of the object itself, as follows:
quotes), a space, the length of the content as decimal digits, a NULL byte,
and the actual content of the file.
Qualifiers
~~~~~~~~~~
----------
``;`` is used as separator between the core identifier and the optional
qualifiers, and optional qualifiers. Each qualifier is specified as a
qualifiers, as well as between qualifiers. Each qualifier is specified as a
key/value pair, using ``=`` as a separator.
The following *context qualifiers* are available:
* **origin** : the *software origin* where an object has been found or observed
* **origin:** the *software origin* where an object has been found or observed
in the wild, as an URI;
* **visit** : the core identifier of a *snapshot* corresponding to a specific
* **visit:** the core identifier of a *snapshot* corresponding to a specific
*visit* of a repository containing the designated object;
* **anchor** : a *designated node* in the Merkle DAG relative to which a *path
* **anchor:** a *designated node* in the Merkle DAG relative to which a *path
to the object* is specified, as the core identifier of a directory, a
revision, a release or a snapshot;
* **path** : the *absolute file path*, from the *root directory* associated to
* **path:** the *absolute file path*, from the *root directory* associated to
the *anchor node*, to the object; when the anchor denotes a directory or a
revision, and almost always when it's a release, the root directory is
uniquely determined; when the anchor denotes a snapshot, the root directory
......@@ -176,7 +191,7 @@ The following *context qualifiers* are available:
The following *fragment qualifier* is available:
* **lines** : *line number(s)* of interest, usually within a content object
* **lines:** *line number(s)* of interest, usually within a content object
We recommend to equip identifiers meant to be shared with as many qualifiers as
possible. While qualifiers may be listed in any order, it is good practice to
......@@ -186,44 +201,69 @@ the *visit* is present, and the *path* is relative to the snapshot indicated
there, then the *anchor* qualifier is superfluous; similarly, if the *path* is
empty, it may be omitted.
Interoperability
================
URI scheme
----------
The ``swh`` URI scheme is registered at IANA for SWHIDs. The present documents
constitutes the scheme specification for such URI scheme.
Git compatibility
~~~~~~~~~~~~~~~~~
-----------------
SWHIDs for contents, directories, revisions, and releases are, at present,
compatible with the `Git <https://git-scm.com/>`_ way of `computing identifiers
<https://git-scm.com/book/en/v2/Git-Internals-Git-Objects>`_ for its objects.
The ``<object_id>`` part of a SWHID for a content object is the Git blob
identifier of any file with the same content; for a revision it is the Git
commit identifier for the same revision, etc. This is not the case for snapshot
identifiers, as Git does not have a corresponding object type.
commit identifier for the same revision, etc. This is not the case for
snapshot identifiers, as Git does not have a corresponding object type.
Note that Git compatibility is incidental and is not guaranteed to be
maintained in future versions of this scheme (or Git).
Examples
--------
========
Core identifiers
~~~~~~~~~~~~~~~~
----------------
* ``swh:1:cnt:94a9ed024d3859793618152ea559a168bbcbb5e2`` points to the content
of a file containing the full text of the GPL3 license
* ``swh:1:dir:d198bc9d7a6bcf6db04f476d29314f157507d505`` points to a directory
containing the source code of the Darktable photography application as it was
at some point on 4 May 2017
* ``swh:1:rev:309cf2674ee7a0749978cf8265ab91a60aea0f7d`` points to a commit in
the development history of Darktable, dated 16 January 2017, that added
undo/redo supports for masks
* ``swh:1:rel:22ece559cc7cc2364edc5e5593d63ae8bd229f9f`` points to Darktable
release 2.3.0, dated 24 December 2016
* ``swh:1:snp:c7c108084bc0bf3d81436bf980b46e98bd338453`` points to a snapshot
of the entire Darktable Git repository taken on 4 May 2017 from GitHub
Identifiers with qualifiers
~~~~~~~~~~~~~~~~~~~~~~~~~~~
---------------------------
* The following `fully qualified SWHID <https://archive.softwareheritage.org/swh:1:cnt:4d99d2d18326621ccdd70f5ea66c2e2ac236ad8b;;origin=https://gitorious.org/ocamlp3l/ocamlp3l_cvs.git;visit=swh:1:snp:d7f1b9eb7ccb596c2622c4780febaa02549830f9;anchor=swh:1:rev:2db189928c94d62a3b4757b3eec68f0a4d4113f0;path=/Examples/SimpleFarm/simplefarm.ml;lines=9-15>`_ denotes the lines 9 to 15 of a file content that can be found at absolute path ``/Examples/SimpleFarm/simplefarm.ml`` from the root directory of the revision ``swh:1:rev:2db189928c94d62a3b4757b3eec68f0a4d4113f0`` that is contained in the snapshot ``swh:1:snp:d7f1b9eb7ccb596c2622c4780febaa02549830f9`` taken from the origin ``https://gitorious.org/ocamlp3l/ocamlp3l_cvs.git``
* The following `SWHID
<https://archive.softwareheritage.org/swh:1:cnt:4d99d2d18326621ccdd70f5ea66c2e2ac236ad8b;;origin=https://gitorious.org/ocamlp3l/ocamlp3l_cvs.git;visit=swh:1:snp:d7f1b9eb7ccb596c2622c4780febaa02549830f9;anchor=swh:1:rev:2db189928c94d62a3b4757b3eec68f0a4d4113f0;path=/Examples/SimpleFarm/simplefarm.ml;lines=9-15>`_
denotes the lines 9 to 15 of a file content that can be found at absolute
path ``/Examples/SimpleFarm/simplefarm.ml`` from the root directory of the
revision ``swh:1:rev:2db189928c94d62a3b4757b3eec68f0a4d4113f0`` that is
contained in the snapshot
``swh:1:snp:d7f1b9eb7ccb596c2622c4780febaa02549830f9`` taken from the origin
``https://gitorious.org/ocamlp3l/ocamlp3l_cvs.git``:
.. code-block:: url
......@@ -234,8 +274,9 @@ Identifiers with qualifiers
path=/Examples/SimpleFarm/simplefarm.ml;
lines=9-15
* This is an example of `a fully qualified SWHID with a percent escaped file path <https://archive.softwareheritage.org/swh:1:cnt:f10371aa7b8ccabca8479196d6cd640676fd4a04;origin=https://github.com/web-platform-tests/wpt;visit=swh:1:snp:b37d435721bbd450624165f334724e3585346499;anchor=swh:1:rev:259d0612af038d14f2cd889a14a3adb6c9e96d96;path=/html/semantics/document-metadata/the-meta-element/pragma-directives/attr-meta-http-equiv-refresh/support/x%3Burl=foo/>`_
* Here is an example of a `SWHID
<https://archive.softwareheritage.org/swh:1:cnt:f10371aa7b8ccabca8479196d6cd640676fd4a04;origin=https://github.com/web-platform-tests/wpt;visit=swh:1:snp:b37d435721bbd450624165f334724e3585346499;anchor=swh:1:rev:259d0612af038d14f2cd889a14a3adb6c9e96d96;path=/html/semantics/document-metadata/the-meta-element/pragma-directives/attr-meta-http-equiv-refresh/support/x%3Burl=foo/>`_
with a file path that requires percent-escaping:
.. code-block:: url
......@@ -246,11 +287,23 @@ Identifiers with qualifiers
path=/html/semantics/document-metadata/the-meta-element/pragma-directives/attr-meta-http-equiv-refresh/support/x%3Burl=foo/
Computing and resolving SWHIDs
==============================
Implementation
==============
Computing
---------
An important property of any SWHID is that its core identifier is *intrinsic*:
it can be *computed from the object itself*, without having to rely on any
third party. An implementation of SWHID that allows to do so locally is the
`swh identify <https://docs.softwareheritage.org/devel/swh-model/cli.html>`_
tool, available from the `swh.model <https://pypi.org/project/swh.model/>`_
Python package under the GPL license.
An important property of SWHIDs is that a core identifier is *intrinsic*: it can
be *computed from the object itself* using the `swh-identify <https://docs.softwareheritage.org/devel/swh-model/cli.html>`_ utility, or equivalently using standard git tools.
SWHIDs are also automatically computed by Software Heritage for all archived
objects as part of its archival activity, and can be looked up via the project
`Web interface <https://archive.softwareheritage.org>`_.
This has various practical implications:
......@@ -259,19 +312,26 @@ This has various practical implications:
just compute the core identifier from the artefact itself, and check that it
is the same as the core identifier part of the SHWID
* the core identifier of a software artifact can be computed *before* its archival on
Software Heritage
* the core identifier of a software artifact can be computed *before* its
archival on Software Heritage
Resolvers
---------
SWHIDs can be resolved using the Software Heritage Web application (see
:py:mod:`swh.web`). In particular, the **root endpoint** ``/`` can be given a
SWHID and will lead to the browsing page of the corresponding object, like
this: ``https://archive.softwareheritage.org/<identifier>``.
A **dedicated** ``/resolve`` **endpoint** of the HTTP API is also available to
explicitly request SWHID resolution; see: :http:get:`/api/1/resolve/(swh_id)/`.
Software Heritage resolver
~~~~~~~~~~~~~~~~~~~~~~~~~~
SWHIDs can be resolved using the Software Heritage `Web interface
<https://archive.softwareheritage.org>`_. In particular, the **root endpoint**
``/`` can be given a SWHID and will lead to the browsing page of the
corresponding object, like this:
``https://archive.softwareheritage.org/<identifier>``.
A **dedicated** ``/resolve`` **endpoint** of the Software Heritage `Web API
<https://archive.softwareheritage.org/api/>`_ is also available to
programmatically resolve SWHIDs; see: :http:get:`/api/1/resolve/(swh_id)/`.
Examples:
......@@ -283,10 +343,11 @@ Examples:
* `<https://archive.softwareheritage.org/swh:1:cnt:4d99d2d18326621ccdd70f5ea66c2e2ac236ad8b;;origin=https://gitorious.org/ocamlp3l/ocamlp3l_cvs.git;visit=swh:1:snp:d7f1b9eb7ccb596c2622c4780febaa02549830f9;anchor=swh:1:rev:2db189928c94d62a3b4757b3eec68f0a4d4113f0;path=/Examples/SimpleFarm/simplefarm.ml;lines=9-15>`_
* `<https://archive.softwareheritage.org/swh:1:cnt:f10371aa7b8ccabca8479196d6cd640676fd4a04;origin=https://github.com/web-platform-tests/wpt;visit=swh:1:snp:b37d435721bbd450624165f334724e3585346499;anchor=swh:1:rev:259d0612af038d14f2cd889a14a3adb6c9e96d96;path=/html/semantics/document-metadata/the-meta-element/pragma-directives/attr-meta-http-equiv-refresh/support/x%3Burl=foo/>`_
External resolvers
~~~~~~~~~~~~~~~~~~
The following **independent resolvers** support resolution of SWHIDs:
Third-party resolvers
~~~~~~~~~~~~~~~~~~~~~
The following **third party resolvers** support SWHID resolution:
* `Identifiers.org <https://identifiers.org>`_; see:
`<http://identifiers.org/swh/>`_ (registry identifier `MIR:00000655
......@@ -294,6 +355,10 @@ The following **independent resolvers** support resolution of SWHIDs:
* `Name-to-Thing (N2T) <https://n2t.net/>`_
Note that resolution via Identifiers.org currently only supports *core
identifiers* due to `syntactic incompatibilities with qualifiers
<http://identifiers.org/documentation#custom_requests>`_.
Examples:
* `<https://identifiers.org/swh:1:cnt:94a9ed024d3859793618152ea559a168bbcbb5e2>`_
......@@ -304,8 +369,6 @@ Examples:
* `<https://n2t.net/swh:1:cnt:4d99d2d18326621ccdd70f5ea66c2e2ac236ad8b;;origin=https://gitorious.org/ocamlp3l/ocamlp3l_cvs.git;visit=swh:1:snp:d7f1b9eb7ccb596c2622c4780febaa02549830f9;anchor=swh:1:rev:2db189928c94d62a3b4757b3eec68f0a4d4113f0;path=/Examples/SimpleFarm/simplefarm.ml;lines=9-15>`_
* `<https://n2t.net/swh:1:cnt:f10371aa7b8ccabca8479196d6cd640676fd4a04;origin=https://github.com/web-platform-tests/wpt;visit=swh:1:snp:b37d435721bbd450624165f334724e3585346499;anchor=swh:1:rev:259d0612af038d14f2cd889a14a3adb6c9e96d96;path=/html/semantics/document-metadata/the-meta-element/pragma-directives/attr-meta-http-equiv-refresh/support/x%3Burl=foo/>`_
Note that resolution via Identifiers.org currently only supports *core identifiers* due to `syntactic incompatibilities with qualifiers <http://identifiers.org/documentation#custom_requests>`_.
References
==========
......
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment