Skip to content
Snippets Groups Projects
persistent-identifiers.rst 9.04 KiB
orphan:

Persistent identifiers

You can point to objects present in the Software Heritage archive by the means of persistent identifiers that are guaranteed to remain stable (persistent) over time. Their syntax, meaning, and usage is described below. Note that they are identifiers and not URLs, even though an URL-based resolver for Software Heritage persistent identifiers is also provided.

A persistent identifier can point to any software artifact (or "object") available in the Software Heritage archive. Objects come in different types, and most notably:

  • contents
  • directories
  • revisions
  • releases
  • snapshots

Each object is identified by an intrinsic, type-specific object identifier that is embedded in its persistent identifier as described below. Object identifiers are strong cryptographic hashes computed on the entire set of object properties to form a Merkle structure.

See :ref:`data-model` for an overview of object types and how they are linked together. See :py:mod:`swh.model.identifiers` for details on how intrinsic object identifiers are computed.

Syntax

Syntactically, persistent identifiers are generated by the <identifier> entry point of the grammar:

<identifier> ::= "swh" ":" <scheme_version> ":" <object_type> ":" <object_id> ;
<scheme_version> ::= "1" ;
<object_type> ::=
    "snp"  (* snapshot *)
  | "rel"  (* release *)
  | "rev"  (* revision *)
  | "dir"  (* directory *)
  | "cnt"  (* content *)
  ;
<object_id> ::= 40 * <hex_digit> ;  (* intrinsic object id, as hex-encoded SHA1 *)
<dec_digit> ::= "0" | "1" | "2" | "3" | "4" | "5" | "6" | "7" | "8" | "9"
<hex_digit> ::= <dec_digit> | "a" | "b" | "c" | "d" | "e" | "f" ;

Semantics

: is used as separator between the logical parts of identifiers. The swh prefix makes explicit that these identifiers are related to SoftWare Heritage. 1 (<scheme_version>) is the current version of this identifier scheme; future editions will use higher version numbers, possibly breaking backward compatibility (but without breaking the resolvability of identifiers that conform to previous versions of the scheme).

A persistent identifier points to a single object, whose type is explicitly captured by <object_type>:

  • snp identifiers points to snapshots,
  • rel to releases,
  • rev to revisions,
  • dir to directories,
  • cnt to contents.

The actual object pointed to is identified by the intrinsic identifier <object_id>, which is a hex-encoded (using lowercase ASCII characters) SHA1 computed on the content and metadata of the object itself, as follows:

  • for snapshots, intrinsic identifiers are computed as per :py:func:`swh.model.identifiers.snapshot_identifier`
  • for releases, as per :py:func:`swh.model.identifiers.release_identifier`
  • for revisions, as per :py:func:`swh.model.identifiers.revision_identifier`
  • for directories, as per :py:func:`swh.model.identifiers.directory_identifier`
  • for contents, the intrinsic identifier is the sha1_git hash of the multiple hashes returned by :py:func:`swh.model.identifiers.content_identifier`, i.e., the SHA1 of a byte sequence obtained by juxtaposing the ASCII string "blob" (without quotes), a space, the length of the content as decimal digits, a NULL byte, and the actual content of the file.

Git compatibility

Intrinsic object identifiers for contents, directories, revisions, and releases are, at present, compatible with the Git way of computing identifiers for its objects. A Software Heritage content identifier will be identical to a Git blob identifier of any file with the same content, a Software Heritage revision identifier will be identical to the corresponding Git commit identifier, etc. This is not the case for snapshot identifiers as Git doesn't have a corresponding object type.

Note that Git compatibility is incidental and is not guaranteed to be maintained in future versions of this scheme (or Git).

Examples

  • swh:1:cnt:94a9ed024d3859793618152ea559a168bbcbb5e2 points to the content of a file containing the full text of the GPL3 license
  • swh:1:dir:d198bc9d7a6bcf6db04f476d29314f157507d505 points to a directory containing the source code of the Darktable photography application as it was at some point on 4 May 2017
  • swh:1:rev:309cf2674ee7a0749978cf8265ab91a60aea0f7d points to a commit in the development history of Darktable, dated 16 January 2017, that added undo/redo supports for masks
  • swh:1:rel:22ece559cc7cc2364edc5e5593d63ae8bd229f9f points to Darktable release 2.3.0, dated 24 December 2016
  • swh:1:snp:c7c108084bc0bf3d81436bf980b46e98bd338453 points to a snapshot of the entire Darktable Git repository taken on 4 May 2017 from GitHub

Contextual information

It is often useful to complement persistent identifiers with contextual information about where the identified object has been found as well as which specific parts of it are of interest. To that end it is possible, via a dedicated syntax, to extend persistent identifiers with the following pieces of information:

  • the software origin where an object has been found/observed
  • the line number(s) of interest, usually within a content object

Syntax

The full-syntax to complement identifiers with contextual information is given by the <identifier_with_context> entry point of the grammar:

<identifier_with_context> ::= <identifier> [<lines_ctxt>] [<origin_ctxt>]
<lines_ctxt> ::= ";" "lines" "=" <line_number> ["-" <line_number>]
<origin_ctxt> ::= ";" "origin" "=" <url>
<line_number> ::= <dec_digit> +
<url> ::= (* RFC 3986 compliant URLs *)

Semantics

; is used as separator between persistent identifiers and additional optional contextual information. Each piece of contextual information is specified as a key/value pair, using = as a separator.

The following piece of contextual information are supported:

  • line numbers: it is possible to specify a single line number or a line range, separating two numbers with -. Note that line numbers are purely indicative and are not meant to be stable, as in some degenerate cases (e.g., text files which mix different types of line terminators) it is impossible to resolve them unambiguously.
  • software origin: where a given object has been found or observed in the wild, as the URI that was used by Software Heritage to ingest the object into the archive

Resolution

Dedicated resolvers

Persistent identifiers can be resolved using the Software Heritage Web application (see :py:mod:`swh.web`). In particular, the root endpoint / can be given a persistent identifier and will lead to the browsing page of the corresponding object, like this: https://archive.softwareheritage.org/<identifier>.

A dedicated /resolve endpoint of the HTTP API is also available to explicitly request persistent identifier resolution; see: :http:get:`/api/1/resolve/(swh_id)/`.

Examples:

External resolvers

The following independent resolvers support resolution of Software Heritage persistent identifiers: