-
Stefano Zacchiroli authoredStefano Zacchiroli authored
SoftWare Heritage persistent IDentifiers (SWHID)
version 1.2
Description
You can point to objects present in the Software Heritage archive by the means of SoftWare Heritage persistent IDentifiers, or SWHID for short, that are guaranteed to remain stable (persistent) over time. Their syntax, meaning, and usage is described below. Note that they are identifiers and not URLs, even though an URL-based resolver for Software Heritage persistent identifiers is also provided.
A SWHID can point to any software artifact (or "object") available in the Software Heritage archive. Objects come in different types, and most notably:
- contents
- directories
- revisions
- releases
- snapshots
Each object is identified by an intrinsic, type-specific object identifier that is embedded in its SWHID as described below. SWHIDs are strong cryptographic hashes computed on the entire set of object properties to form a Merkle structure.
See the :ref:`Software Heritage data model <data-model>` for an overview of object types and how they are linked together. See :py:mod:`swh.model.identifiers` for details on how SWHIDs are computed.
Syntax
Syntactically, SWHIDs are generated by the <identifier>
entry point of the
grammar:
<identifier> ::= "swh" ":" <scheme_version> ":" <object_type> ":" <object_id> ;
<scheme_version> ::= "1" ;
<object_type> ::=
"snp" (* snapshot *)
| "rel" (* release *)
| "rev" (* revision *)
| "dir" (* directory *)
| "cnt" (* content *)
;
<object_id> ::= 40 * <hex_digit> ; (* intrinsic object id, as hex-encoded SHA1 *)
<dec_digit> ::= "0" | "1" | "2" | "3" | "4" | "5" | "6" | "7" | "8" | "9"
<hex_digit> ::= <dec_digit> | "a" | "b" | "c" | "d" | "e" | "f" ;
Semantics
:
is used as separator between the logical parts of SWHIDs. The swh
prefix makes explicit that these identifiers are related to SoftWare
Heritage. 1
(<scheme_version>
) is the current version of this
identifier scheme; future editions will use higher version numbers, possibly
breaking backward compatibility (but without breaking the resolvability of
SWHIDs conform to previous versions of the scheme).
A SWHID points to a single object, whose type is explicitly captured by
<object_type>
:
-
snp
to snapshots, -
rel
to releases, -
rev
to revisions, -
dir
to directories, -
cnt
to contents.
The actual object pointed to is identified by the intrinsic identifier
<object_id>
, which is a hex-encoded (using lowercase ASCII characters) SHA1
computed on the content and metadata of the object itself, as follows:
- for snapshots, intrinsic identifiers are computed as per :py:func:`swh.model.identifiers.snapshot_identifier`
- for releases, as per :py:func:`swh.model.identifiers.release_identifier`
- for revisions, as per :py:func:`swh.model.identifiers.revision_identifier`
- for directories, as per :py:func:`swh.model.identifiers.directory_identifier`
- for contents, the intrinsic identifier is the
sha1_git
hash of the multiple hashes returned by :py:func:`swh.model.identifiers.content_identifier`, i.e., the SHA1 of a byte sequence obtained by juxtaposing the ASCII string"blob"
(without quotes), a space, the length of the content as decimal digits, a NULL byte, and the actual content of the file.
Git compatibility
SWHIDs for contents, directories, revisions, and releases are, at present,
compatible with the Git way of computing identifiers for its objects.
A SWHID for a content object will correspond (in its <object_id>
part) to a
Git blob identifier of any file with the same content; a SWHID for a revision
will correspond to the Git commit identifier for the same revision, etc. This
is not the case for snapshot identifiers, as Git does not have a corresponding
object type.
Note that Git compatibility is incidental and is not guaranteed to be maintained in future versions of this scheme (or Git).
Examples
-
swh:1:cnt:94a9ed024d3859793618152ea559a168bbcbb5e2
points to the content of a file containing the full text of the GPL3 license -
swh:1:dir:d198bc9d7a6bcf6db04f476d29314f157507d505
points to a directory containing the source code of the Darktable photography application as it was at some point on 4 May 2017 -
swh:1:rev:309cf2674ee7a0749978cf8265ab91a60aea0f7d
points to a commit in the development history of Darktable, dated 16 January 2017, that added undo/redo supports for masks -
swh:1:rel:22ece559cc7cc2364edc5e5593d63ae8bd229f9f
points to Darktable release 2.3.0, dated 24 December 2016 -
swh:1:snp:c7c108084bc0bf3d81436bf980b46e98bd338453
points to a snapshot of the entire Darktable Git repository taken on 4 May 2017 from GitHub
Contextual information
The SWHIDs as described above are intrinsic identifiers, as they are computed from the designated object itself, and it is often useful to provide contextual information about a particular occurrence of the object, like the origin from where the object has been found. To this end, SWHIDs can be coupled with qualifiers that capture such contextual information. Qualifiers come in different kinds:
- origin
- visit
- anchor
- path
- lines
Syntax
The full-syntax to complement SWHIDs with contextual information is given by
the <identifier_with_context>
entry point of the grammar:
<identifier_with_context> ::= <identifier> [ <qualifierlist> ]
<qualifierlist> := <qualifier> [ <qualifierlist> ]
<qualifier> ::= <origin_ctxt> | <visit_ctxt> | <anchor_ctxt> | <path_ctxt> |<lines_ctxt>
<origin_ctxt> ::= ";" "origin" "=" <url>
<visit_ctxt> ::= ";" "visit" "=" <identifier>
<anchor_ctxt> ::= ";" "anchor" "=" <identifier>
<path_ctxt> ::= ";" "path" "=" <path_absolute_escaped>
<lines_ctxt> ::= ";" "lines" "=" <line_number> ["-" <line_number>]
<line_number> ::= <dec_digit> +
<url> ::= (* RFC 3986 compliant URLs *)
<path_absolute_escaped> ::= (* RFC 3986 compliant absolute file path, percent-escaped *)
Here <path_absolute_escaped>
is the <path_absolute>
in Section 3.3 of
RFC 3986 where all
occurrences of ;
and %
must be percent-encoded (as %3B and %25
respectively).
Semantics
;
is used as separator between SWHIDs and the optional contextual
information qualifiers. Each contextual information qualifier is specified as a
key/value pair, using =
as a separator.
The following piece of contextual information are supported:
- origin : the software origin where an object has been found or observed in the wild, as an URI;
- visit : persistent identifier of a snapshot corresponding to a specific visit of a repository containing the designated object;
- anchor : a designated node in the Merkle DAG relative to which a path to the object is specified, as a persistent identifier of a directory, a revision, a release or a snapshot;
-
path : the absolute file path, from the root directory associated to
the anchor node, to the object; when the anchor denotes a directory or a
revision, and almost always when it's a release, the root directory is
uniquely determined; when the anchor denotes a snapshot, the root directory
is the one pointed to by
HEAD
(possibly indirectly), and undefined if such a reference is missing; - lines : line number(s) of interest, usually within a content object
We recommend to equip identifiers meant to be shared with as many qualifiers as
possible. While qualifiers may be listed in any order, it is good practice to
present them in the order given above, i.e., origin
, visit
, anchor
,
path
, lines
. Redundant information should be omitted: for example, if
the visit is present, and the path is relative to the snapshot indicated
there, then the anchor qualifier is superfluous.
Example
The following fully qualified SWHID
denotes the lines 9 to 15 of a file content that can be found at absolute path
/Examples/SimpleFarm/simplefarm.ml
from the root directory of the revision
swh:1:rev:2db189928c94d62a3b4757b3eec68f0a4d4113f0
that is contained in the
snapshot swh:1:snp:d7f1b9eb7ccb596c2622c4780febaa02549830f9
taken from the
origin https://gitorious.org/ocamlp3l/ocamlp3l_cvs.git
.
swh:1:cnt:4d99d2d18326621ccdd70f5ea66c2e2ac236ad8b;
origin=https://gitorious.org/ocamlp3l/ocamlp3l_cvs.git;
visit=swh:1:snp:d7f1b9eb7ccb596c2622c4780febaa02549830f9;
anchor=swh:1:rev:2db189928c94d62a3b4757b3eec68f0a4d4113f0;
path=/Examples/SimpleFarm/simplefarm.ml;
lines=9-15
And this is an example of a fully qualified SWHID with a percent escaped file path
swh:1:cnt:f10371aa7b8ccabca8479196d6cd640676fd4a04;
origin=https://github.com/web-platform-tests/wpt;
visit=swh:1:snp:b37d435721bbd450624165f334724e3585346499;
anchor=swh:1:rev:259d0612af038d14f2cd889a14a3adb6c9e96d96;
path=/html/semantics/document-metadata/the-meta-element/pragma-directives/attr-meta-http-equiv-refresh/support/x%3Burl=foo/
Resolution
Dedicated resolvers
SWHIDs can be resolved using the Software Heritage Web application (see
:py:mod:`swh.web`). In particular, the root endpoint /
can be given a
SWHID and will lead to the browsing page of the corresponding object, like
this: https://archive.softwareheritage.org/<identifier>
.
A dedicated /resolve
endpoint of the HTTP API is also available to
explicitly request SWHID resolution; see: :http:get:`/api/1/resolve/(swh_id)/`.
Examples:
- https://archive.softwareheritage.org/swh:1:cnt:94a9ed024d3859793618152ea559a168bbcbb5e2
- https://archive.softwareheritage.org/swh:1:dir:d198bc9d7a6bcf6db04f476d29314f157507d505
- https://archive.softwareheritage.org/api/1/resolve/swh:1:rev:309cf2674ee7a0749978cf8265ab91a60aea0f7d
- https://archive.softwareheritage.org/api/1/resolve/swh:1:rel:22ece559cc7cc2364edc5e5593d63ae8bd229f9f
- https://archive.softwareheritage.org/api/1/resolve/swh:1:snp:c7c108084bc0bf3d81436bf980b46e98bd338453
External resolvers
The following independent resolvers support resolution of SWHIDs:
- Identifiers.org; see: http://identifiers.org/swh/ (registry identifier MIR:00000655).
- Name-to-Thing (N2T)
Examples:
- https://identifiers.org/swh:1:cnt:94a9ed024d3859793618152ea559a168bbcbb5e2
- https://identifiers.org/swh:1:dir:d198bc9d7a6bcf6db04f476d29314f157507d505
- https://identifiers.org/swh:1:rev:309cf2674ee7a0749978cf8265ab91a60aea0f7d
- https://n2t.net/swh:1:rel:22ece559cc7cc2364edc5e5593d63ae8bd229f9f
- https://n2t.net/swh:1:snp:c7c108084bc0bf3d81436bf980b46e98bd338453
Note that resolution via Identifiers.org does not support contextual information, due to syntactic incompatibilities.
References
- Roberto Di Cosmo, Morane Gruenpeter, Stefano Zacchiroli. Identifiers for Digital Objects: the Case of Software Source Code Preservation. In Proceedings of iPRES 2018: 15th International Conference on Digital Preservation, Boston, MA, USA, September 2018, 9 pages.
- Roberto Di Cosmo, Morane Gruenpeter, Stefano Zacchiroli. Referencing Source Code Artifacts: a Separate Concern in Software Citation. In Computing in Science and Engineering, volume 22, issue 2, pages 33-43. ISSN 1521-9615, IEEE. March 2020.