Compare revisions

c9583bae · c9583bae · c9583bae · c9583bae · c9583bae · c9583bae
--- a/docs/data-model.rst
+++ b/docs/data-model.rst
 .. _data-model:

-Software Heritage data model
-============================
+Data model
+==========

-TODO
+.. note:: The text below is adapted from §7 of the article `Software Heritage:
+  Why and How to Preserve Software Source Code
+  <https://hal.archives-ouvertes.fr/hal-01590958/>`_ (in proceedings of `iPRES
+  2017 <https://ipres2017.jp/>`_, 14th International Conference on Digital
+  Preservation, by Roberto Di Cosmo and Stefano Zacchiroli), which also
+  provides a more general description of Software Heritage for the digital
+  preservation research community.

-Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor
-incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis
-nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo
-consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse
-cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non
-proident, sunt in culpa qui officia deserunt mollit anim id est laborum.
+In any archival project the choice of the underlying data model—at the logical
+level, independently from how data is actually stored on physical media—is
+paramount. The data model adopted by Software Heritage to represent the
+information that it collects is centered around the notion of *software
+artifact*, described below.
+
+It is important to notice that according to our principles, we must store with
+every software artifact full information on where it has been found
+(provenance), that is also captured in our data model, so we start by providing
+some basic information on the nature of this provenance information.
+
+
+Source code hosting places
+--------------------------
+
+Currently, Software Heritage uses of a curated list of source code hosting
+places to crawl. The most common entries we expect to place in such a list are
+popular collaborative development forges (e.g., GitHub, Bitbucket), package
+manager repositories that host source package (e.g., CPAN, npm), and FOSS
+distributions (e.g., Fedora, FreeBSD). But we may of course allow also more
+niche entries, such as URLs of personal or institutional project collections
+not hosted on major forges.
+
+While currently entirely manual, the curation of such a list might easily be
+semi-automatic, with entries suggested by fellow archivists and/or concerned
+users that want to notify Software Heritage of the need of archiving specific
+pieces of endangered source code. This approach is entirely compatible with
+Web-wide crawling approaches: crawlers capable of detecting the presence of
+source code might enrich the list. In both cases the list will remain curated,
+with (semi-automated) review processes that will need to pass before a hosting
+place starts to be used.
+
+
+Software artifacts
+------------------
+
+Once the hosting places are known, they will need to be periodically looked at
+in order to add to the archive missing software artifacts. Which software
+artifacts will be found there?
+
+In general, each software distribution mechanism hosts multiple releases of a
+given software at any given time. For VCS (Version Control Systems), this is
+the natural behaviour; for software packages, while a single version of a
+package is just a snapshot of the corresponding software product, one can often
+retrieve both current and past versions of the package from its distribution
+site.
+
+By reviewing and generalizing existing VCS and source package formats, we have
+identified the following recurrent artifacts as commonly found at source code
+hosting places. They form the basic ingredients of the Software Heritage
+archive. As the terminology varies quite a bit from technology to technology,
+we provide below both the canonical name used in Software Heritage and popular
+synonyms.
+
+**contents** (AKA "blobs")
+  the raw content of (source code) files as a sequence of bytes, without file
+  names or any other metadata.  File contents are often recurrent, e.g., across
+  different versions of the same software, different directories of the same
+  project, or different projects all together.
+
+**directories**
+  a list of named directory entries, each of which pointing to other artifacts,
+  usually file contents or sub-directories. Directory entries are also
+  associated to some metadata stored as permission bits.
+
+**revisions** (AKA "commits")
+  software development within a specific project is essentially a time-indexed
+  series of copies of a single "root" directory that contains the entire
+  project source code. Software evolves when a developer modifies the content
+  of one or more files in that directory and record their changes.
+
+  Each recorded copy of the root directory is known as a "revision". It points
+  to a fully-determined directory and is equipped with arbitrary metadata. Some
+  of those are added manually by the developer (e.g., commit message), others
+  are automatically synthesized (timestamps, preceding commit(s), etc).
+
+**releases** (AKA "tags")
+  some revisions are more equals than others and get selected by developers as
+  denoting important project milestones known as "releases". Each release
+  points to the last commit in project history corresponding to the release and
+  carries metadata: release name and version, release message, cryptographic
+  signatures, etc.
+
+
+Additionally, the following crawling-related information are stored as
+provenance information in the Software Heritage archive:
+
+**origins**
+  code "hosting places" as previously described are usually large platforms
+  that host several unrelated software projects. For software provenance
+  purposes it is important to be more specific than that.
+
+  Software origins are fine grained references to where source code artifacts
+  archived by Software Heritage have been retrieved from. They take the form of
+  ``(type, url)`` pairs, where ``url`` is a canonical URL (e.g., the address at
+  which one can ``git clone`` a repository or download a source tarball) and
+  ``type`` the kind of software origin (e.g., git, svn, or dsc for Debian
+  source packages).
+
+..
+   **projects**
+     as commonly intended are more abstract entities that precise software
+     origins. Projects relate together several development resources, including
+     websites, issue trackers, mailing lists, as well as software origins as
+     intended by Software Heritage.
+
+     The debate around the most apt ontologies to capture project-related
+     information for software hasn't settled yet, but the place projects will take
+     in the Software Heritage archive is fairly clear. Projects are abstract
+     entities, which will be arbitrarily nestable in a versioned
+     project/sub-project hierarchy, and that can be associated to arbitrary
+     metadata as well as origins where their source code can be found.
+
+**snapshots**
+  any kind of software origin offers multiple pointers to the "current" state
+  of a development project. In the case of VCS this is reflected by branches
+  (e.g., master, development, but also so called feature branches dedicated to
+  extending the software in a specific direction); in the case of package
+  distributions by notions such as suites that correspond to different maturity
+  levels of individual packages (e.g., stable, development, etc.).
+
+  A "snapshot" of a given software origin records all entry points found there
+  and where each of them was pointing at the time. For example, a snapshot
+  object might track the commit where the master branch was pointing to at any
+  given time, as well as the most recent release of a given package in the
+  stable suite of a FOSS distribution.
+
+**visits**
+  links together software origins with snapshots. Every time an origin is
+  consulted a new visit object is created, recording when (according to
+  Software Heritage clock) the visit happened and the full snapshot of the
+  state of the software origin at the time.
+
+.. note::
+  This model currently records visits as a single point in time. However, the
+  actual visit process is not instantaneous. Loaders can record successive
+  changes to the state of the visit, as their work progresses, as updates to
+  the visit object.
+
+Data structure
+--------------
+
+.. _swh-merkle-dag:
+.. figure:: images/swh-merkle-dag.svg
+   :width: 1024px
+   :align: center
+
+   Software Heritage archive as a Merkle DAG, augmented with crawling
+   information (click to zoom).
+
+
+With all the bits of what we want to archive in place, the next question is how
+to organize them, i.e., which logical data structure to adopt for their
+storage. A key observation for this decision is that source code artifacts are
+massively duplicated. This is so for several reasons:
+
+* code hosting diaspora (i.e., project development moving to the most
+  recent/cool collaborative development technology over time);
+* copy/paste (AKA "vendoring") of parts or entire external FOSS software
+  components into other software products;
+* large overlap between revisions of the same project: usually only a very
+  small amount of files/directories are modified by a single commit;
+* emergence of DVCS (distributed version control systems), which natively work
+  by replicating entire repository copies around. GitHub-style pull requests
+  are the pinnacle of this, as they result in creating an additional repository
+  copy at each change done by a new developer;
+* migration from one VCS to another—e.g., migrations from Subversion to Git,
+  which are really popular these days—resulting in additional copies, but in a
+  different distribution format, of the very same development histories.
+
+These trends seem to be neither stopping nor slowing down, and it is reasonable
+to expect that they will be even more prominent in the future, due to the
+decreasing costs of storage and bandwidth.
+
+For this reason we argue that any sustainable storage layout for archiving
+source code in the very long term should support deduplication, allowing to pay
+for the cost of storing source code artifacts that are encountered more than
+once only once. For storage efficiency, deduplication should be supported for
+all the software artifacts we have discussed, namely: file contents,
+directories, revisions, releases, snapshots.
+
+Realizing that principle, the Software Heritage archive is conceptually a
+single (big) `Merkle Direct Acyclic Graph (DAG)
+<https://en.wikipedia.org/wiki/Merkle_tree>`_, as depicted in Figure
+:ref:`Software Heritage Merkle DAG <swh-merkle-dag>`. In such a graph each of
+the artifacts we have described—from file contents up to entire
+snapshots—correspond to a node.  Edges between nodes emerge naturally:
+directory entries point to other directories or file contents; revisions point
+to directories and previous revisions, releases point to revisions, snapshots
+point to revisions and releases. Additionally, each node contains all metadata
+that are specific to the node itself rather than to pointed nodes; e.g., commit
+messages, timestamps, or file names. Note that the structure is really a DAG,
+and not a tree, due to the fact that the line of revisions nodes might be
+forked and merged back.
+
+..
+   directory: fff3cc22cb40f71d26f736c082326e77de0b7692
+   parent: e4feb05112588741b4764739d6da756c357e1f37
+   author: Stefano Zacchiroli <zack@upsilon.cc>
+   date: 1443617461 +0200
+   committer: Stefano Zacchiroli <zack@upsilon.cc>
+   commiter_date: 1443617461 +0200
+   message:
+     objstorage: fix tempfile race when adding objects
+
+     Before this change, two workers adding the same
+     object will end up racing to write <SHA1>.tmp.
+     [...]
+
+     revisionid: 64a783216c1ec69dcb267449c0bbf5e54f7c4d6d
+     A revision node in the Software Heritage DAG
+
+In a Merkle structure each node is identified by an intrinsic identifier
+computed as a cryptographic hash of the node content. In the case of Software
+Heritage identifiers are computed taking into account both node-specific
+metadata and the identifiers of child nodes.
+
+Consider the revision node in the picture whose identifier starts with
+`c7640e08d..`. it points to a directory (identifier starting with
+`45f0c078..`), which has also been archived. That directory contains a full
+copy, at a specific point in time, of a software component—in the example the
+`Hello World <https://forge.softwareheritage.org/source/helloworld/>`_ software
+component available on our forge. The revision node also points to the
+preceding revision node (`43ef7dcd..`) in the project development history.
+Finally, the node contains revision-specific metadata, such as the author and
+committer of the given change, its timestamps, and the message entered by the
+author at commit time.
+
+The identifier of the revision node itself (`c7640e08d..`) is computed as a
+cryptographic hash of a (canonical representation of) all the information shown
+in figure. A change in any of them—metadata and/or pointed nodes—would result
+in an entirely different node identifier. All other types of nodes in the
+Software Heritage archive behave similarly.
+
+The Software Heritage archive inherits useful properties from the underlying
+Merkle structure. In particular, deduplication is built-in. Any software
+artifacts encountered in the wild gets added to the archive only if a
+corresponding node with a matching intrinsic identifier is not already
+available in the graph—file content, commits, entire directories or project
+snapshots are all deduplicated incurring storage costs only once.
+
+Furthermore, as a side effect of this data model choice, the entire development
+history of all the source code archived in Software Heritage—which ambitions to
+match all published source code in the world—is available as a unified whole,
+making emergent structures such as code reuse across different projects or
+software origins, readily available. Further reinforcing the Software Heritage
+use cases, this object could become a veritable "map of the stars" of our
+entire software commons.
+
+
+Extended data model
+-------------------
+
+In addition to the artifacts detailed above used to represent original software
+artifacts, the Software Heritage archive stores information about these
+artifacts.
+
+**extid**
+  a relationship between an original identifier of an artifact, in its
+  native/upstream environment, and a `core SWHID <persistent-identifiers>`,
+  which is specific to Software Heritage. As such, it includes:
+
+  * the external identifier, stored as bytes whose format is opaque to the
+    data model
+  * a type (a simple name and a version), to identify the type of relationship
+  * the "target", which is a core SWHID
+
+  An extid may also include a "payload", which is arbitrary data about the
+  relationship. For example, an extid might link a directory to the
+  cryptographic hash of the tarball that originally contained it. In this
+  case, the payload could include data useful for reconstructing the
+  original tarball from the directory. The payload data is stored
+  separately.  An extid refers to it by its ``sha1_git`` hash.
+
+**raw extrinsic metadata**
+  an opaque bytestring, along with its format (a simple name), an identifier
+  of the object the metadata is about and in which context (similar to a
+  `qualified SWHID <persistent-identifiers>`), and provenance information
+  (the authority who provided it, the fetcher tool used to get it, and the
+  data it was discovered at).
+
+  It provides both a way to store information about an artifact contributed by
+  external entities, after the artifact was created, and an escape hatch to
+  store metadata that would not otherwise fit in the data model.
--- a/docs/iana-swh-template.txt
+++ b/docs/iana-swh-template.txt
+(last updated 2020-04-28)
+
+Scheme name: swh
+
+Status: Provisional
+
+Applications/protocols that use this scheme name:
+  Software Heritage: https://www.softwareheritage.org/
+  Software Package Data Exchange: https://spdx.org/
+  NTIA: https://www.ntia.doc.gov/SoftwareTransparency
+  Identifiers.org: http://identifiers.org/
+  Name-to-Thing (N2T): https://n2t.net/
+  HAL: https://hal.archives-ouvertes.fr/
+
+Contact: Stefano Zacchiroli <zack@upsilon.cc>
+
+Change controller: Software Heritage <info@softwareheritage.org>
+
+References:
+
+  Scheme specification: https://docs.softwareheritage.org/devel/swh-model/persistent-identifiers.html
+
+  The Software Heritage project: https://www.softwareheritage.org/
+
+  The Software Heritage archive: https://archive.softwareheritage.org/
+
+  Publications:
+
+    Roberto Di Cosmo, Morane Gruenpeter, Stefano Zacchiroli. Referencing Source
+    Code Artifacts: a Separate Concern in Software Citation. In Computing in
+    Science and Engineering, volume 22, issue 2, pp. 33-43. ISSN 1521-9615,
+    IEEE. March 2020. DOI 10.1109/MCSE.2019.2963148
+
+    Roberto Di Cosmo, Morane Gruenpeter, Stefano Zacchiroli. Identifiers for
+    Digital Objects: the Case of Software Source Code Preservation. In
+    proceedings of iPRES 2018: 15th International Conference on Digital
+    Preservation. September 2018. 10.17605/OSF.IO/KDE56
+
+(file created 2020-04-28)
--- a/docs/images/.gitignore
+++ b/docs/images/.gitignore
+swh-merkle-dag.pdf
+swh-merkle-dag.svg
--- a/docs/images/Makefile
+++ b/docs/images/Makefile
+
+MERKLE_DAG =  swh-merkle-dag.pdf swh-merkle-dag.svg
+
+BUILD_TARGETS =
+BUILD_TARGETS += $(MERKLE_DAG)
+
+all: $(BUILD_TARGETS)
+
+%.svg: %.dia
+	dia -e $@ $<
+
+%.pdf: %.svg
+	set -e; if [ $$(inkscape --version 2>/dev/null | grep -Eo '[0-9]+' | head -1) -gt 0 ]; then \
+	  inkscape -o $@ $< ; \
+	else \
+	  inkscape -A $@ $< ; \
+	fi
+
+clean:
+	-rm -f $(BUILD_TARGETS)
--- a/docs/images/swh-merkle-dag.dia
+++ b/docs/images/swh-merkle-dag.dia
--- a/docs/index.rst
+++ b/docs/index.rst
 .. _swh-model:

-Software Heritage - Development Documentation
-=============================================
+.. include:: README.rst

 .. toctree::
-   :maxdepth: 2
-   :caption: Contents:
+   :caption: Overview:
+   :titlesonly:

+   data-model
+   persistent-identifiers
+   cli

-Overview
--------
+.. only:: standalone_package_doc

-* :ref:`data-model`
+   Indices and tables
+   ------------------

-
-Indices and tables
-==================
-
-* :ref:`genindex`
-* :ref:`modindex`
-* :ref:`search`
+   * :ref:`genindex`
+   * :ref:`modindex`
+   * :ref:`search`
--- a/docs/persistent-identifiers.rst
+++ b/docs/persistent-identifiers.rst
+.. _persistent-identifiers:
+.. _swhids:
+
+=================================================
+SoftWare Heritage persistent IDentifiers (SWHIDs)
+=================================================
+
+**version 1.6, last modified 2021-04-30**
+
+.. contents::
+   :local:
+   :depth: 2
+
+
+Overview
+========
+
+You can point to objects present in the `Software Heritage
+<https://www.softwareheritage.org/>`_ `archive
+<https://archive.softwareheritage.org/>`_ by the means of **SoftWare Heritage
+persistent IDentifiers**, or **SWHIDs** for short, that are guaranteed to
+remain stable (persistent) over time. Their syntax, meaning, and usage is
+described below. Note that they are identifiers and not URLs, even though
+URL-based `resolvers`_ for SWHIDs are also available.
+
+A SWHID consists of two separate parts, a mandatory *core identifier* that can
+point to any software artifact (or "object") available in the Software Heritage
+archive, and an optional list of *qualifiers* that allows to specify the
+context where the object is meant to be seen and point to a subpart of the
+object itself.
+
+Objects come in different types:
+
+* contents
+* directories
+* revisions
+* releases
+* snapshots
+
+Each object is identified by an intrinsic, type-specific object identifier that
+is embedded in its SWHID as described below. The intrinsic identifiers embedded
+in SWHIDs are strong cryptographic hashes computed on the entire set of object
+properties. Together, these identifiers form a `Merkle structure
+<https://en.wikipedia.org/wiki/Merkle_tree>`_, specifically a Merkle `DAG
+<https://en.wikipedia.org/wiki/Directed_acyclic_graph>`_.
+
+See the :ref:`Software Heritage data model <data-model>` for an overview of
+object types and how they are linked together. See
+:py:mod:`swh.model.git_objects` for details on how the intrinsic identifiers
+embedded in SWHIDs are computed.
+
+The optional qualifiers are of two kinds:
+
+* **context qualifiers:** carry information about the context where a given
+  object is meant to be seen.  This is particularly important, as the same
+  object can be reached in the Merkle graph following different *paths*
+  starting from different nodes (or *anchors*), and it may have been retrieved
+  from different *origins*, that may evolve between different *visits*
+* **fragment qualifiers:** allow to pinpoint specific subparts of an object
+
+.. _swhids-syntax:
+
+Syntax
+======
+
+Syntactically, SWHIDs are generated by the ``<identifier>`` entry point in the
+following grammar:
+
+.. code-block:: bnf
+
+  <identifier> ::= <identifier_core> [ <qualifiers> ] ;
+
+  <identifier_core> ::= "swh" ":" <scheme_version> ":" <object_type> ":" <object_id> ;
+  <scheme_version> ::= "1" ;
+  <object_type> ::=
+      "snp"  (* snapshot *)
+    | "rel"  (* release *)
+    | "rev"  (* revision *)
+    | "dir"  (* directory *)
+    | "cnt"  (* content *)
+    ;
+  <object_id> ::= 40 * <hex_digit> ;  (* intrinsic object id, as hex-encoded SHA1 *)
+  <dec_digit> ::= "0" | "1" | "2" | "3" | "4" | "5" | "6" | "7" | "8" | "9" ;
+  <hex_digit> ::= <dec_digit> | "a" | "b" | "c" | "d" | "e" | "f" ;
+
+  <qualifiers> := ";" <qualifier> [ <qualifiers> ] ;
+  <qualifier> ::=
+      <context_qualifier>
+    | <fragment_qualifier>
+    ;
+  <context_qualifier> ::=
+      <origin_ctxt>
+    | <visit_ctxt>
+    | <anchor_ctxt>
+    | <path_ctxt>
+    ;
+  <origin_ctxt> ::= "origin" "=" <url_escaped> ;
+  <visit_ctxt> ::= "visit" "=" <identifier_core> ;
+  <anchor_ctxt> ::= "anchor" "=" <identifier_core> ;
+  <path_ctxt> ::= "path" "=" <path_absolute_escaped> ;
+  <fragment_qualifier> ::= "lines" "=" <line_number> ["-" <line_number>] ;
+  <line_number> ::= <dec_digit> + ;
+  <url_escaped> ::= (* RFC 3987 IRI *)
+  <path_absolute_escaped> ::= (* RFC 3987 absolute path *)
+
+Where:
+
+- ``<path_absolute_escaped>`` is an ``<ipath-absolute>`` from `RFC 3987`_, and
+- ``<url_escaped>`` is a `RFC 3987`_ IRI
+
+in either case all occurrences of ``;`` (and ``%``, as required by the RFC)
+have been percent-encoded (as ``%3B`` and ``%25`` respectively). Other
+characters *can* be percent-encoded, e.g., to improve readability and/or
+embeddability of SWHID in other contexts.
+
+.. _RFC 3987: https://tools.ietf.org/html/rfc3987
+
+.. _swhids-semantics:
+
+Semantics
+=========
+
+.. _swhids-core:
+
+Core identifiers
+----------------
+
+``:`` is used as separator between the logical parts of core identifiers. The
+``swh`` prefix makes explicit that these identifiers are related to *SoftWare
+Heritage*. ``1`` (``<scheme_version>``) is the current version of this
+identifier *scheme*. Future editions will use higher version numbers, possibly
+breaking backward compatibility, but without breaking the resolvability of
+SWHIDs that conform to previous versions of the scheme.
+
+A SWHID points to a single object, whose type is explicitly captured by
+``<object_type>``:
+
+* ``snp`` to **snapshots**,
+* ``rel`` to **releases**,
+* ``rev`` to **revisions**,
+* ``dir`` to **directories**,
+* ``cnt`` to **contents**.
+
+The actual object pointed to is identified by the intrinsic identifier
+``<object_id>``, which is a hex-encoded (using lowercase ASCII characters) SHA1
+computed on the content and metadata of the object itself, as follows:
+
+* for **snapshots**, intrinsic identifiers are SHA1 hashes of manifests computed as per
+  :py:func:`swh.model.git_objects.snapshot_git_object`
+
+* for **releases**, as per
+  :py:func:`swh.model.git_objects.release_git_object`
+  that produces the same result as a git release hash
+
+* for **revisions**, as per
+  :py:func:`swh.model.git_objects.revision_git_object`
+  that produces the same result as a git commit hash
+
+* for **directories**, per
+  :py:func:`swh.model.git_objects.directory_git_object`
+  that produces the same result as a git tree hash
+
+* for **contents**, the intrinsic identifier is the ``sha1_git`` hash returned by
+  :py:meth:`swh.hashutil.MultiHash.digest`, i.e., the SHA1 of a byte
+  sequence obtained by juxtaposing the ASCII string ``"blob"`` (without
+  quotes), a space, the length of the content as decimal digits, a NULL byte,
+  and the actual content of the file.
+
+.. _swhids-qualifiers:
+
+Qualifiers
+----------
+
+``;`` is used as separator between the core identifier and the optional
+qualifiers, as well as between qualifiers. Each qualifier is specified as a
+key/value pair, using ``=`` as a separator.
+
+The following *context qualifiers* are available:
+
+* **origin:** the *software origin* where an object has been found or observed
+  in the wild, as an URI;
+
+* **visit:** the core identifier of a *snapshot* corresponding to a specific
+  *visit* of a repository containing the designated object;
+
+* **anchor:** a *designated node* in the Merkle DAG relative to which a *path
+  to the object* is specified, as the core identifier of a directory, a
+  revision, a release or a snapshot;
+
+* **path:** the *absolute file path*, from the *root directory* associated to
+  the *anchor node*, to the object; when the anchor denotes a directory or a
+  revision, and almost always when it's a release, the root directory is
+  uniquely determined; when the anchor denotes a snapshot, the root directory
+  is the one pointed to by ``HEAD`` (possibly indirectly), and undefined if
+  such a reference is missing;
+
+The following *fragment qualifier* is available:
+
+* **lines:** *line number(s)* of interest, usually within a content object
+
+We recommend to equip identifiers meant to be shared with as many qualifiers as
+possible. While qualifiers may be listed in any order, it is good practice to
+present them in the order given above, i.e., ``origin``, ``visit``, ``anchor``,
+``path``, ``lines``.  Redundant information should be omitted: for example, if
+the *visit* is present, and the *path* is relative to the snapshot indicated
+there, then the *anchor* qualifier is superfluous; similarly, if the *path* is
+empty, it may be omitted.
+
+
+Interoperability
+================
+
+
+URI scheme
+----------
+
+The ``swh`` URI scheme is registered at IANA for SWHIDs. The present documents
+constitutes the scheme specification for such URI scheme.
+
+
+Git compatibility
+-----------------
+
+SWHIDs for contents, directories, revisions, and releases are, at present,
+compatible with the `Git <https://git-scm.com/>`_ way of `computing identifiers
+<https://git-scm.com/book/en/v2/Git-Internals-Git-Objects>`_ for its objects.
+The ``<object_id>`` part of a SWHID for a content object is the Git blob
+identifier of any file with the same content; for a revision it is the Git
+commit identifier for the same revision, etc.  This is not the case for
+snapshot identifiers, as Git does not have a corresponding object type.
+
+Note that Git compatibility is incidental and is not guaranteed to be
+maintained in future versions of this scheme (or Git).
+
+
+Automatically fixing invalid SWHIDs
+-----------------------------------
+
+User interfaces may fix invalid SWHIDs, by lower-casing the
+``<identifier_core>`` part of a SWHID, if it contains upper-case letters
+because of user errors or limitations in software displaying SWHIDs.
+
+However, implementations displaying or generating SWHIDs should not rely
+on this behavior, and must display or generate only valid SWHIDs when
+technically possible.
+
+User interfaces should show an error when such an automatic fix occurs,
+so users have a chance to fix their SWHID before pasting it to an other interface
+that does not perform the same corrections.
+This also makes it easier to understand issues when a case-sensitive
+qualifier has its casing altered.
+
+
+Examples
+========
+
+
+Core identifiers
+----------------
+
+* ``swh:1:cnt:94a9ed024d3859793618152ea559a168bbcbb5e2`` points to the content
+  of a file containing the full text of the GPL3 license
+
+* ``swh:1:dir:d198bc9d7a6bcf6db04f476d29314f157507d505`` points to a directory
+  containing the source code of the Darktable photography application as it was
+  at some point on 4 May 2017
+
+* ``swh:1:rev:309cf2674ee7a0749978cf8265ab91a60aea0f7d`` points to a commit in
+  the development history of Darktable, dated 16 January 2017, that added
+  undo/redo supports for masks
+
+* ``swh:1:rel:22ece559cc7cc2364edc5e5593d63ae8bd229f9f`` points to Darktable
+  release 2.3.0, dated 24 December 2016
+
+* ``swh:1:snp:c7c108084bc0bf3d81436bf980b46e98bd338453`` points to a snapshot
+  of the entire Darktable Git repository taken on 4 May 2017 from GitHub
+
+
+Identifiers with qualifiers
+---------------------------
+
+* The following :swh_web:`SWHID
+  <swh:1:cnt:4d99d2d18326621ccdd70f5ea66c2e2ac236ad8b;origin=https://gitorious.org/ocamlp3l/ocamlp3l_cvs.git;visit=swh:1:snp:d7f1b9eb7ccb596c2622c4780febaa02549830f9;anchor=swh:1:rev:2db189928c94d62a3b4757b3eec68f0a4d4113f0;path=/Examples/SimpleFarm/simplefarm.ml;lines=9-15>`
+  denotes the lines 9 to 15 of a file content that can be found at absolute
+  path ``/Examples/SimpleFarm/simplefarm.ml`` from the root directory of the
+  revision ``swh:1:rev:2db189928c94d62a3b4757b3eec68f0a4d4113f0`` that is
+  contained in the snapshot
+  ``swh:1:snp:d7f1b9eb7ccb596c2622c4780febaa02549830f9`` taken from the origin
+  ``https://gitorious.org/ocamlp3l/ocamlp3l_cvs.git``::
+
+    swh:1:cnt:4d99d2d18326621ccdd70f5ea66c2e2ac236ad8b;
+      origin=https://gitorious.org/ocamlp3l/ocamlp3l_cvs.git;
+      visit=swh:1:snp:d7f1b9eb7ccb596c2622c4780febaa02549830f9;
+      anchor=swh:1:rev:2db189928c94d62a3b4757b3eec68f0a4d4113f0;
+      path=/Examples/SimpleFarm/simplefarm.ml;
+      lines=9-15
+
+* Here is an example of a :swh_web:`SWHID
+  <swh:1:cnt:f10371aa7b8ccabca8479196d6cd640676fd4a04;origin=https://github.com/web-platform-tests/wpt;visit=swh:1:snp:b37d435721bbd450624165f334724e3585346499;anchor=swh:1:rev:259d0612af038d14f2cd889a14a3adb6c9e96d96;path=/html/semantics/document-metadata/the-meta-element/pragma-directives/attr-meta-http-equiv-refresh/support/x%253Burl=foo/>`
+  with a file path that requires percent-escaping::
+
+    swh:1:cnt:f10371aa7b8ccabca8479196d6cd640676fd4a04;
+      origin=https://github.com/web-platform-tests/wpt;
+      visit=swh:1:snp:b37d435721bbd450624165f334724e3585346499;
+      anchor=swh:1:rev:259d0612af038d14f2cd889a14a3adb6c9e96d96;
+      path=/html/semantics/document-metadata/the-meta-element/pragma-directives/attr-meta-http-equiv-refresh/support/x%3Burl=foo/
+
+
+Implementation
+==============
+
+
+Computing
+---------
+
+An important property of any SWHID is that its core identifier is *intrinsic*:
+it can be *computed from the object itself*, without having to rely on any
+third party.  An implementation of SWHID that allows to do so locally is the
+`swh identify <https://docs.softwareheritage.org/devel/swh-model/cli.html>`_
+tool, available from the `swh.model <https://pypi.org/project/swh.model/>`_
+Python package under the GPL license. This package can be installed via the ``pip``
+package manager with the one liner ``pip3 install swh.model[cli]`` on any machine with
+Python  (at least version 3.7) and ``pip`` installed (on a Debian or Ubuntu system a simple ``apt install python3 python3-pip``
+will suffice, see `the general instructions <https://packaging.python.org/tutorials/installing-packages/>`_ for other platforms).
+
+SWHIDs are also automatically computed by Software Heritage for all archived
+objects as part of its archival activity, and can be looked up via the project
+:swh_web:`Web interface <>`.
+
+This has various practical implications:
+
+* when a software artifact is obtained from Software Heritage by resolving a
+  SWHID, it is straightforward to verify that it is exactly the intended one:
+  just compute the core identifier from the artefact itself, and check that it
+  is the same as the core identifier part of the SHWID
+
+* the core identifier of a software artifact can be computed *before* its
+  archival on Software Heritage
+
+
+Choosing what type of SWHID to use
+----------------------------------
+
+``swh:1:dir:`` SWHIDs are the most robust SWHIDs, as they can be recomputed from
+the simplest objects (a directory structure on a filesystem), even when all
+metadata is lost, without relying on the Software Heritage archive.
+
+Therefore, we advise implementers and users to prefer this type of SWHIDs
+over ``swh:1:rev:`` and ``swh:1:rel:`` to reference a source code artifacts.
+
+However, since keeping the metadata is also important, you should add an anchor
+qualifier to ``swh:1:dir:`` SWHIDs whenever possible, so the metadata stored
+in the Software Heritage archive can be retrieved when needed.
+
+This means, for example, that you should prefer
+``swh:1:dir:a8eded6a2d062c998ba2dcc3dcb0ce68a4e15a58;anchor=swh:1:rel:22ece559cc7cc2364edc5e5593d63ae8bd229f9f``
+over ``swh:1:rel:22ece559cc7cc2364edc5e5593d63ae8bd229f9f``.
+
+
+Resolvers
+---------
+
+
+Software Heritage resolver
+~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+SWHIDs can be resolved using the Software Heritage :swh_web:`Web interface <>`.
+In particular, the **root endpoint**
+``/`` can be given a SWHID and will lead to the browsing page of the
+corresponding object, like this:
+``https://archive.softwareheritage.org/<identifier>``.
+
+A **dedicated** ``/resolve`` **endpoint** of the Software Heritage :swh_web:`Web API
+<api/>` is also available to
+programmatically resolve SWHIDs; see: :http:get:`/api/1/resolve/(swhid)/`.
+
+Examples:
+
+* :swh_web:`swh:1:cnt:94a9ed024d3859793618152ea559a168bbcbb5e2`
+* :swh_web:`swh:1:dir:d198bc9d7a6bcf6db04f476d29314f157507d505`
+* :swh_web:`api/1/resolve/swh:1:rev:309cf2674ee7a0749978cf8265ab91a60aea0f7d`
+* :swh_web:`api/1/resolve/swh:1:rel:22ece559cc7cc2364edc5e5593d63ae8bd229f9f`
+* :swh_web:`api/1/resolve/swh:1:snp:c7c108084bc0bf3d81436bf980b46e98bd338453`
+* :swh_web:`swh:1:cnt:4d99d2d18326621ccdd70f5ea66c2e2ac236ad8b;origin=https://gitorious.org/ocamlp3l/ocamlp3l_cvs.git;visit=swh:1:snp:d7f1b9eb7ccb596c2622c4780febaa02549830f9;anchor=swh:1:rev:2db189928c94d62a3b4757b3eec68f0a4d4113f0;path=/Examples/SimpleFarm/simplefarm.ml;lines=9-15`
+* :swh_web:`swh:1:cnt:f10371aa7b8ccabca8479196d6cd640676fd4a04;origin=https://github.com/web-platform-tests/wpt;visit=swh:1:snp:b37d435721bbd450624165f334724e3585346499;anchor=swh:1:rev:259d0612af038d14f2cd889a14a3adb6c9e96d96;path=/html/semantics/document-metadata/the-meta-element/pragma-directives/attr-meta-http-equiv-refresh/support/x%253Burl=foo/`
+
+
+Third-party resolvers
+~~~~~~~~~~~~~~~~~~~~~
+
+The following **third party resolvers** support SWHID resolution:
+
+* `Identifiers.org <https://identifiers.org>`_; see:
+  `<http://identifiers.org/swh/>`_ (registry identifier `MIR:00000655
+  <https://www.ebi.ac.uk/miriam/main/datatypes/MIR:00000655>`_).
+
+* `Name-to-Thing (N2T) <https://n2t.net/>`_
+
+Note that resolution via Identifiers.org currently only supports *core
+identifiers* due to `syntactic incompatibilities with qualifiers
+<http://identifiers.org/documentation#custom_requests>`_.
+
+Examples:
+
+* `<https://identifiers.org/swh:1:cnt:94a9ed024d3859793618152ea559a168bbcbb5e2>`_
+* `<https://identifiers.org/swh:1:dir:d198bc9d7a6bcf6db04f476d29314f157507d505>`_
+* `<https://identifiers.org/swh:1:rev:309cf2674ee7a0749978cf8265ab91a60aea0f7d>`_
+* `<https://n2t.net/swh:1:rel:22ece559cc7cc2364edc5e5593d63ae8bd229f9f>`_
+* `<https://n2t.net/swh:1:snp:c7c108084bc0bf3d81436bf980b46e98bd338453>`_
+* `<https://n2t.net/swh:1:cnt:4d99d2d18326621ccdd70f5ea66c2e2ac236ad8b;origin=https://gitorious.org/ocamlp3l/ocamlp3l_cvs.git;visit=swh:1:snp:d7f1b9eb7ccb596c2622c4780febaa02549830f9;anchor=swh:1:rev:2db189928c94d62a3b4757b3eec68f0a4d4113f0;path=/Examples/SimpleFarm/simplefarm.ml;lines=9-15>`_
+* `<https://n2t.net/swh:1:cnt:f10371aa7b8ccabca8479196d6cd640676fd4a04;origin=https://github.com/web-platform-tests/wpt;visit=swh:1:snp:b37d435721bbd450624165f334724e3585346499;anchor=swh:1:rev:259d0612af038d14f2cd889a14a3adb6c9e96d96;path=/html/semantics/document-metadata/the-meta-element/pragma-directives/attr-meta-http-equiv-refresh/support/x%25253Burl=foo/>`_
+
+
+References
+==========
+
+* Roberto Di Cosmo, Morane Gruenpeter, Stefano Zacchiroli. `Identifiers for
+  Digital Objects: the Case of Software Source Code Preservation
+  <https://hal.archives-ouvertes.fr/hal-01865790v4>`_. In Proceedings of `iPRES
+  2018 <https://ipres2018.org/>`_: 15th International Conference on Digital
+  Preservation, Boston, MA, USA, September 2018, 9 pages.
+
+* Roberto Di Cosmo, Morane Gruenpeter, Stefano Zacchiroli. `Referencing Source
+  Code Artifacts: a Separate Concern in Software Citation
+  <https://arxiv.org/abs/2001.08647>`_. In Computing in Science and
+  Engineering, volume 22, issue 2, pages 33-43. ISSN 1521-9615,
+  IEEE. March 2020.
--- a/pyproject.toml
+++ b/pyproject.toml
+[project]
+name = "swh.model"
+authors = [
+    {name="Software Heritage developers", email="swh-devel@inria.fr"},
+]
+
+description = "Software Heritage data model"
+readme = {file = "README.rst", content-type = "text/x-rst"}
+requires-python = ">=3.7"
+classifiers = [
+    "Programming Language :: Python :: 3",
+    "Intended Audience :: Developers",
+    "License :: OSI Approved :: GNU General Public License v3 (GPLv3)",
+    "Operating System :: OS Independent",
+    "Development Status :: 5 - Production/Stable",
+]
+dynamic = ["version", "dependencies", "optional-dependencies"]
+
+[tool.setuptools.packages.find]
+include = ["swh.*"]
+
+[tool.setuptools.dynamic]
+dependencies = {file = ["requirements.txt"]}
+
+[tool.setuptools.dynamic.optional-dependencies]
+cli = {file = "requirements-cli.txt"}
+testing = {file = ["requirements-cli.txt", "requirements-test.txt"]}
+testing_minimal = {file = "requirements-test.txt"}
+
+[project.entry-points.console_scripts]
+"swh.identify" = "swh.model.cli:identify"
+
+[project.entry-points."swh.cli.subcommands"]
+"swh.model" = "swh.model.cli"
+
+[project.urls]
+"Homepage" = "https://gitlab.softwareheritage.org/swh/devel/swh-model"
+"Bug Reports" = "https://gitlab.softwareheritage.org/swh/devel/swh-model/-/issues"
+"Funding" = "https://www.softwareheritage.org/donate"
+"Documentation" = "https://docs.softwareheritage.org/devel/swh-model/"
+"Source" = "https://gitlab.softwareheritage.org/swh/devel/swh-model.git"
+
+[build-system]
+requires = ["setuptools", "setuptools-scm"]
+build-backend = "setuptools.build_meta"
+
+[tool.setuptools_scm]
+fallback_version = "0.0.1"
+
+[tool.black]
+target-version = ['py39', 'py310', 'py311', 'py312']
+
+[tool.isort]
+multi_line_output = 3
+include_trailing_comma = true
+force_grid_wrap = 0
+use_parentheses = true
+ensure_newline_before_comments = true
+line_length = 88
+force_sort_within_sections = true
+known_first_party = ['swh']
+
+[tool.mypy]
+namespace_packages = true
+warn_unused_ignores = true
+explicit_package_bases = true
+# ^ Needed for mypy to detect py.typed from swh packages installed
+# in editable mode
+
+plugins = []
+
+# 3rd party libraries without stubs (yet)
+# [[tool.mypy.overrides]]
+# module = [
+#     "package1.*",
+#     "package2.*",
+# ]
+# ignore_missing_imports = true
+
+[tool.flake8]
+select = ["C", "E", "F", "W", "B950"]
+ignore = [
+    "E203", # whitespaces before ':' <https://github.com/psf/black/issues/315>
+    "E231", # missing whitespace after ','
+    "E501", # line too long, use B950 warning from flake8-bugbear instead
+    "W503" # line break before binary operator <https://github.com/psf/black/issues/52>
+]
+max-line-length = 88
+
+[tool.pytest.ini_options]
+addopts = "--doctest-modules -p no:pytest_swh_core"
+norecursedirs = "build docs .*"
+asyncio_mode = "strict"
+consider_namespace_packages = true
+markers = [
+    "requires_optional_deps: tests in test_cli.py that should not run if optional dependencies are not installed",
+]
--- a/requirements-cli.txt
+++ b/requirements-cli.txt
+swh.core >= 0.3
+Click
+dulwich
--- a/requirements-swh.txt
+++ b/requirements-swh.txt
--- a/requirements-test.txt
+++ b/requirements-test.txt
+aiohttp
+click
+pytest >= 8.1
+pytz
+types-click
+types-python-dateutil
+types-pytz
+types-deprecated
--- a/requirements.txt
+++ b/requirements.txt
 # Add here external Python modules dependencies, one per line. Module names
 # should match https://pypi.python.org/pypi names. For the full spec or
 # dependency lines, see https://pip.readthedocs.org/en/1.1/requirements.html
-vcversioner
+attrs != 21.1.0  # https://github.com/python-attrs/attrs/issues/804
+attrs_strict >= 0.0.7
+deprecated
+hypothesis
+iso8601
+python-dateutil
+typing_extensions
+
--- a/setup.py
+++ b/setup.py
-import hashlib
-
-from setuptools import setup, find_packages
-
-
-def parse_requirements():
-    requirements = []
-    for reqf in ('requirements.txt', 'requirements-swh.txt'):
-        with open(reqf) as f:
-            for line in f.readlines():
-                line = line.strip()
-                if not line or line.startswith('#'):
-                    continue
-                requirements.append(line)
-    return requirements
-
-
-extra_requirements = []
-
-
-pyblake2_hashes = {'blake2s256', 'blake2b512'}
-if pyblake2_hashes - set(hashlib.algorithms_available):
-    extra_requirements.append('pyblake2')
-
-setup(
-    name='swh.model',
-    description='Software Heritage data model',
-    author='Software Heritage developers',
-    author_email='swh-devel@inria.fr',
-    url='https://forge.softwareheritage.org/diffusion/DMOD/',
-    packages=find_packages(),  # packages's modules
-    scripts=[],   # scripts to package
-    install_requires=parse_requirements() + extra_requirements,
-    setup_requires=['vcversioner'],
-    vcversioner={},
-    include_package_data=True,
-)
--- a/swh/__init__.py
+++ b/swh/__init__.py
-__path__ = __import__('pkgutil').extend_path(__path__, __name__)
--- a/swh/model/cli.py
+++ b/swh/model/cli.py
+# Copyright (C) 2018-2020  The Software Heritage developers
+# See the AUTHORS file at the top-level directory of this distribution
+# License: GNU General Public License version 3, or any later version
+# See top-level LICENSE file for more information
+
+import os
+import sys
+from typing import Callable, Dict, Iterable, Optional
+
+# WARNING: do not import unnecessary things here to keep cli startup time under
+# control
+try:
+    import click
+except ImportError:
+    print(
+        "Cannot run swh-identify; the Click package is not installed."
+        "Please install 'swh.model[cli]' for full functionality.",
+        file=sys.stderr,
+    )
+    exit(1)
+
+try:
+    import swh.core.cli
+
+    cli_command = swh.core.cli.swh.command
+except ImportError:
+    # stub so that swh-identify can be used when swh-core isn't installed
+    cli_command = click.command
+
+from swh.model.from_disk import Directory
+from swh.model.swhids import CoreSWHID
+
+CONTEXT_SETTINGS = dict(help_option_names=["-h", "--help"])
+
+# Mapping between dulwich types and Software Heritage ones. Used by snapshot ID
+# computation.
+_DULWICH_TYPES = {
+    b"blob": "content",
+    b"tree": "directory",
+    b"commit": "revision",
+    b"tag": "release",
+}
+
+
+class CoreSWHIDParamType(click.ParamType):
+    """Click argument that accepts a core SWHID and returns them as
+    :class:`swh.model.swhids.CoreSWHID` instances"""
+
+    name = "SWHID"
+
+    def convert(self, value, param, ctx) -> CoreSWHID:
+        from swh.model.exceptions import ValidationError
+
+        try:
+            return CoreSWHID.from_string(value)
+        except ValidationError as e:
+            self.fail(f'"{value}" is not a valid core SWHID: {e}', param, ctx)
+
+
+def swhid_of_file(path) -> CoreSWHID:
+    from swh.model.from_disk import Content
+
+    object = Content.from_file(path=path)
+    return object.swhid()
+
+
+def swhid_of_file_content(data) -> CoreSWHID:
+    from swh.model.from_disk import Content
+
+    object = Content.from_bytes(mode=644, data=data)
+    return object.swhid()
+
+
+def model_of_dir(
+    path: bytes,
+    exclude_patterns: Optional[Iterable[bytes]] = None,
+    update_info: Optional[Callable[[int], None]] = None,
+) -> Directory:
+    from swh.model.from_disk import accept_all_paths, ignore_directories_patterns
+
+    path_filter = (
+        ignore_directories_patterns(path, exclude_patterns)
+        if exclude_patterns
+        else accept_all_paths
+    )
+
+    return Directory.from_disk(
+        path=path, path_filter=path_filter, progress_callback=update_info
+    )
+
+
+def swhid_of_dir(
+    path: bytes, exclude_patterns: Optional[Iterable[bytes]] = None
+) -> CoreSWHID:
+    obj = model_of_dir(path, exclude_patterns)
+    return obj.swhid()
+
+
+def swhid_of_origin(url):
+    from swh.model.model import Origin
+
+    return Origin(url).swhid()
+
+
+def swhid_of_git_repo(path) -> CoreSWHID:
+    try:
+        import dulwich.repo
+    except ImportError:
+        raise click.ClickException(
+            "Cannot compute snapshot identifier; the Dulwich package is not installed. "
+            "Please install 'swh.model[cli]' for full functionality.",
+        )
+
+    from swh.model import hashutil
+    from swh.model.model import Snapshot
+
+    repo = dulwich.repo.Repo(path)
+
+    branches: Dict[bytes, Optional[Dict]] = {}
+    for ref, target in repo.refs.as_dict().items():
+        obj = repo[target]
+        if obj:
+            branches[ref] = {
+                "target": hashutil.bytehex_to_hash(target),
+                "target_type": _DULWICH_TYPES[obj.type_name],
+            }
+        else:
+            branches[ref] = None
+
+    for ref, target in repo.refs.get_symrefs().items():
+        branches[ref] = {
+            "target": target,
+            "target_type": "alias",
+        }
+
+    snapshot = {"branches": branches}
+
+    return Snapshot.from_dict(snapshot).swhid()
+
+
+def identify_object(
+    obj_type: str, follow_symlinks: bool, exclude_patterns: Iterable[bytes], obj
+) -> str:
+    from urllib.parse import urlparse
+
+    if obj_type == "auto":
+        if obj == "-" or os.path.isfile(obj):
+            obj_type = "content"
+        elif os.path.isdir(obj):
+            obj_type = "directory"
+        else:
+            try:  # URL parsing
+                if urlparse(obj).scheme:
+                    obj_type = "origin"
+                else:
+                    raise ValueError
+            except ValueError:
+                raise click.BadParameter("cannot detect object type for %s" % obj)
+
+    if obj == "-":
+        content = sys.stdin.buffer.read()
+        swhid = str(swhid_of_file_content(content))
+    elif obj_type in ["content", "directory"]:
+        path = obj.encode(sys.getfilesystemencoding())
+        if follow_symlinks and os.path.islink(obj):
+            path = os.path.realpath(obj)
+        if obj_type == "content":
+            swhid = str(swhid_of_file(path))
+        elif obj_type == "directory":
+            swhid = str(swhid_of_dir(path, exclude_patterns))
+    elif obj_type == "origin":
+        swhid = str(swhid_of_origin(obj))
+    elif obj_type == "snapshot":
+        swhid = str(swhid_of_git_repo(obj))
+    else:  # shouldn't happen, due to option validation
+        raise click.BadParameter("invalid object type: " + obj_type)
+
+    # note: we return original obj instead of path here, to preserve user-given
+    # file name in output
+    return swhid
+
+
+@cli_command(context_settings=CONTEXT_SETTINGS)
+@click.option(
+    "--dereference/--no-dereference",
+    "follow_symlinks",
+    default=True,
+    help="follow (or not) symlinks for OBJECTS passed as arguments "
+    + "(default: follow)",
+)
+@click.option(
+    "--filename/--no-filename",
+    "show_filename",
+    default=True,
+    help="show/hide file name (default: show)",
+)
+@click.option(
+    "--type",
+    "-t",
+    "obj_type",
+    default="auto",
+    type=click.Choice(["auto", "content", "directory", "origin", "snapshot"]),
+    help="type of object to identify (default: auto)",
+)
+@click.option(
+    "--exclude",
+    "-x",
+    "exclude_patterns",
+    metavar="PATTERN",
+    multiple=True,
+    help="Exclude directories using glob patterns \
+    (e.g., ``*.git`` to exclude all .git directories)",
+)
+@click.option(
+    "--verify",
+    "-v",
+    metavar="SWHID",
+    type=CoreSWHIDParamType(),
+    help="reference identifier to be compared with computed one",
+)
+@click.option(
+    "-r",
+    "--recursive",
+    is_flag=True,
+    help="compute SWHID recursively",
+)
+@click.argument("objects", nargs=-1, required=True)
+def identify(
+    obj_type,
+    verify,
+    show_filename,
+    follow_symlinks,
+    objects,
+    exclude_patterns,
+    recursive,
+):
+    """Compute the Software Heritage persistent identifier (SWHID) for the given
+    source code object(s).
+
+    For more details about SWHIDs see:
+
+    https://docs.softwareheritage.org/devel/swh-model/persistent-identifiers.html
+
+    Tip: you can pass "-" to identify the content of standard input.
+
+    Examples::
+
+      $ swh identify fork.c kmod.c sched/deadline.c
+      swh:1:cnt:2e391c754ae730bd2d8520c2ab497c403220c6e3    fork.c
+      swh:1:cnt:0277d1216f80ae1adeed84a686ed34c9b2931fc2    kmod.c
+      swh:1:cnt:57b939c81bce5d06fa587df8915f05affbe22b82    sched/deadline.c
+
+      $ swh identify --no-filename /usr/src/linux/kernel/
+      swh:1:dir:f9f858a48d663b3809c9e2f336412717496202ab
+
+      $ git clone --mirror https://forge.softwareheritage.org/source/helloworld.git
+
+      $ swh identify --type snapshot helloworld.git/
+      swh:1:snp:510aa88bdc517345d258c1fc2babcd0e1f905e93    helloworld.git
+
+    """
+    from functools import partial
+    import logging
+
+    if exclude_patterns:
+        exclude_patterns = set(pattern.encode() for pattern in exclude_patterns)
+
+    if verify and len(objects) != 1:
+        raise click.BadParameter("verification requires a single object")
+
+    if recursive and not os.path.isdir(objects[0]):
+        recursive = False
+        logging.warn("recursive option disabled, input is not a directory object")
+
+    if recursive:
+        if verify:
+            raise click.BadParameter(
+                "verification of recursive object identification is not supported"
+            )
+
+        if not obj_type == ("auto" or "directory"):
+            raise click.BadParameter(
+                "recursive identification is supported only for directories"
+            )
+
+        path = os.fsencode(objects[0])
+        dir_obj = model_of_dir(path, exclude_patterns)
+        for sub_obj in dir_obj.iter_tree():
+            path_name = "path" if "path" in sub_obj.data.keys() else "data"
+            path = os.fsdecode(sub_obj.data[path_name])
+            swhid = str(sub_obj.swhid())
+            msg = f"{swhid}\t{path}" if show_filename else f"{swhid}"
+            click.echo(msg)
+    else:
+        results = zip(
+            objects,
+            map(
+                partial(identify_object, obj_type, follow_symlinks, exclude_patterns),
+                objects,
+            ),
+        )
+
+        if verify:
+            swhid = next(results)[1]
+            if str(verify) == swhid:
+                click.echo("SWHID match: %s" % swhid)
+                sys.exit(0)
+            else:
+                click.echo("SWHID mismatch: %s != %s" % (verify, swhid))
+                sys.exit(1)
+        else:
+            for obj, swhid in results:
+                msg = swhid
+                if show_filename:
+                    msg = "%s\t%s" % (swhid, os.fsdecode(obj))
+                click.echo(msg)
+
+
+if __name__ == "__main__":
+    identify()
--- a/swh/model/collections.py
+++ b/swh/model/collections.py
+# Copyright (C) 2020-2023 The Software Heritage developers
+# See the AUTHORS file at the top-level directory of this distribution
+# License: GNU General Public License version 3, or any later version
+# See top-level LICENSE file for more information
+
+from __future__ import annotations
+
+"""Utility data structures."""
+
+from collections.abc import Mapping
+import copy
+from typing import Dict, Generic, Iterable, Optional, Tuple, TypeVar, Union
+
+KT = TypeVar("KT")
+VT = TypeVar("VT")
+
+
+class ImmutableDict(Mapping, Generic[KT, VT]):
+    """A frozen dictionary.
+
+    This class behaves like a dictionary, but internally stores objects in a tuple,
+    so it is both immutable and hashable."""
+
+    _data: Dict[KT, VT]
+
+    def __init__(
+        self,
+        data: Union[Iterable[Tuple[KT, VT]], ImmutableDict[KT, VT], Dict[KT, VT]] = {},
+    ):
+        if isinstance(data, dict):
+            self._data = data
+        elif isinstance(data, ImmutableDict):
+            self._data = data._data
+        else:
+            self._data = {k: v for k, v in data}
+
+    @property
+    def data(self):
+        return tuple(self._data.items())
+
+    def __repr__(self):
+        return f"ImmutableDict({dict(self.data)!r})"
+
+    def __getitem__(self, key):
+        return self._data[key]
+
+    def __iter__(self):
+        for k, v in self.data:
+            yield k
+
+    def __len__(self):
+        return len(self._data)
+
+    def items(self):
+        yield from self.data
+
+    def __hash__(self):
+        return hash(tuple(sorted(self.data)))
+
+    def copy_pop(self, popped_key) -> Tuple[Optional[VT], ImmutableDict[KT, VT]]:
+        """Returns a copy of this ImmutableDict without the given key,
+        as well as the value associated to the key."""
+        new_items = copy.deepcopy(self._data)
+        popped_value: Optional[VT] = new_items.pop(popped_key, None)
+        return (popped_value, ImmutableDict(new_items))
--- a/swh/model/discovery.py
+++ b/swh/model/discovery.py
+# Copyright (C) 2022 The Software Heritage developers
+# See the AUTHORS file at the top-level directory of this distribution
+# License: GNU General Public License version 3, or any later version
+# See top-level LICENSE file for more information
+
+"""Primitives for finding unknown content efficiently."""
+
+from __future__ import annotations
+
+from collections import namedtuple
+import itertools
+import logging
+import random
+from typing import (
+    Any,
+    Callable,
+    Iterable,
+    List,
+    Mapping,
+    NamedTuple,
+    Optional,
+    Set,
+    Union,
+)
+
+from typing_extensions import Protocol, runtime_checkable
+
+from .from_disk import model
+from .model import Sha1Git
+
+logger = logging.getLogger(__name__)
+
+# Maximum amount when sampling from the undecided set of directory entries
+SAMPLE_SIZE = 1000
+
+# Sets of sha1 of contents, skipped contents and directories respectively
+Sample: NamedTuple = namedtuple(
+    "Sample", ["contents", "skipped_contents", "directories"]
+)
+
+
+@runtime_checkable
+class ArchiveDiscoveryInterface(Protocol):
+    """Interface used in discovery code to abstract over ways of connecting to
+    the SWH archive (direct storage, web API, etc.) for all methods needed by
+    discovery algorithms."""
+
+    contents: List[model.Content]
+    skipped_contents: List[model.SkippedContent]
+    directories: List[model.Directory]
+
+    def __init__(
+        self,
+        contents: List[model.Content],
+        skipped_contents: List[model.SkippedContent],
+        directories: List[model.Directory],
+    ) -> None:
+        self.contents = contents
+        self.skipped_contents = skipped_contents
+        self.directories = directories
+
+    def content_missing(self, contents: List[Sha1Git]) -> Iterable[Sha1Git]:
+        """List content missing from the archive by sha1"""
+
+    def skipped_content_missing(
+        self, skipped_contents: List[Sha1Git]
+    ) -> Iterable[Sha1Git]:
+        """List skipped content missing from the archive by sha1"""
+
+    def directory_missing(self, directories: List[Sha1Git]) -> Iterable[Sha1Git]:
+        """List directories missing from the archive by sha1"""
+
+
+class BaseDiscoveryGraph:
+    """Creates the base structures and methods needed for discovery algorithms.
+    Subclasses should override ``get_sample`` to affect how the discovery is made.
+
+    The `update_info_callback` is an optional argument that will get called for
+    each new piece of information we get. The callback arguments are `(content,
+    known)`.
+    - content: the relevant model.Content object,
+    - known: a boolean, True if the file is known to the archive False otherwise.
+    """
+
+    def __init__(
+        self,
+        contents,
+        skipped_contents,
+        directories,
+        update_info_callback: Optional[Callable[[Any, bool], None]] = None,
+    ):
+        self._all_contents: Mapping[
+            Sha1Git, Union[model.Content, model.SkippedContent]
+        ] = {}
+        self._undecided_directories: Set[Sha1Git] = set()
+        self._children: Mapping[Sha1Git, Set[Sha1Git]] = {}
+        self._parents: Mapping[model.DirectoryEntry, Set[Any]] = {}
+        self.undecided: Set[Sha1Git] = set()
+
+        for content in itertools.chain(contents, skipped_contents):
+            self.undecided.add(content.sha1_git)
+            self._all_contents[content.sha1_git] = content
+
+        for directory in directories:
+            self.undecided.add(directory.id)
+            self._undecided_directories.add(directory.id)
+            self._children[directory.id] = {c.target for c in directory.entries}
+            for child in directory.entries:
+                self._parents.setdefault(child.target, set()).add(directory.id)
+
+        self.undecided |= self._undecided_directories
+        self.known: Set[Sha1Git] = set()
+        self.unknown: Set[Sha1Git] = set()
+        self._update_info_callback = update_info_callback
+        self._sha1_to_obj = {}
+        for content in itertools.chain(contents, skipped_contents):
+            self._sha1_to_obj[content.sha1_git] = content
+        for directory in directories:
+            self._sha1_to_obj[directory.id] = directory
+
+    def mark_known(self, entries: Iterable[Sha1Git]):
+        """Mark ``entries`` and those they imply as known in the SWH archive"""
+        self._mark_entries(entries, self._children, self.known)
+
+    def mark_unknown(self, entries: Iterable[Sha1Git]):
+        """Mark ``entries`` and those they imply as unknown in the SWH archive"""
+        self._mark_entries(entries, self._parents, self.unknown)
+
+    def _mark_entries(
+        self,
+        entries: Iterable[Sha1Git],
+        transitive_mapping: Mapping[Any, Any],
+        target_set: Set[Any],
+    ):
+        """Use Merkle graph properties to mark a directory entry as known or unknown.
+
+        If an entry is known, then all of its descendants are known. If it's
+        unknown, then all of its ancestors are unknown.
+
+        - ``entries``: directory entries to mark along with their ancestors/descendants
+          where applicable.
+        - ``transitive_mapping``: mapping from an entry to the next entries to mark
+          in the hierarchy, if any.
+        - ``target_set``: set where marked entries will be added.
+
+        """
+        callback = self._update_info_callback
+        to_process = set(entries)
+        while to_process:
+            current = to_process.pop()
+            target_set.add(current)
+            new = current in self.undecided
+            self.undecided.discard(current)
+            self._undecided_directories.discard(current)
+            next_entries = transitive_mapping.get(current, set()) & self.undecided
+            to_process.update(next_entries)
+            if new and callback is not None:
+                obj = self._sha1_to_obj[current]
+                callback(obj, current in self.known)
+
+    def get_sample(
+        self,
+    ) -> Sample:
+        """Return a three-tuple of samples from the undecided sets of contents,
+        skipped contents and directories respectively.
+        These samples will be queried against the storage which will tell us
+        which are known."""
+        raise NotImplementedError()
+
+    def do_query(self, archive: ArchiveDiscoveryInterface, sample: Sample) -> None:
+        """Given a three-tuple of samples, ask the archive which are known or
+        unknown and mark them as such."""
+
+        methods = (
+            archive.content_missing,
+            archive.skipped_content_missing,
+            archive.directory_missing,
+        )
+
+        for sample_per_type, method in zip(sample, methods):
+            if not sample_per_type:
+                continue
+            known = set(sample_per_type)
+            unknown = set(method(list(sample_per_type)))
+            known -= unknown
+
+            self.mark_known(known)
+            self.mark_unknown(unknown)
+
+
+class RandomDirSamplingDiscoveryGraph(BaseDiscoveryGraph):
+    """Use a random sampling using only directories.
+
+    This allows us to find a statistically good spread of entries in the graph
+    with a smaller population than using all types of entries. When there are
+    no more directories, only contents or skipped contents are undecided if any
+    are left: we send them directly to the storage since they should be few and
+    their structure flat."""
+
+    def get_sample(self) -> Sample:
+        if self._undecided_directories:
+            if len(self._undecided_directories) <= SAMPLE_SIZE:
+                return Sample(
+                    contents=set(),
+                    skipped_contents=set(),
+                    directories=set(self._undecided_directories),
+                )
+            sample = random.sample(tuple(self._undecided_directories), SAMPLE_SIZE)
+            directories = {o for o in sample}
+            return Sample(
+                contents=set(), skipped_contents=set(), directories=directories
+            )
+
+        contents = set()
+        skipped_contents = set()
+
+        for sha1 in self.undecided:
+            obj = self._all_contents[sha1]
+            obj_type = obj.object_type
+            if obj_type == model.Content.object_type:
+                contents.add(sha1)
+            elif obj_type == model.SkippedContent.object_type:
+                skipped_contents.add(sha1)
+            else:
+                raise TypeError(f"Unexpected object type {obj_type}")
+
+        return Sample(
+            contents=contents, skipped_contents=skipped_contents, directories=set()
+        )
+
+
+def filter_known_objects(
+    archive: ArchiveDiscoveryInterface,
+    update_info_callback: Optional[Callable[[Any, bool], None]] = None,
+):
+    """Filter ``archive``'s ``contents``, ``skipped_contents`` and ``directories``
+    to only return those that are unknown to the SWH archive using a discovery
+    algorithm.
+
+    The `update_info_callback` is an optional argument that will get called for
+    each new piece of information we get. The callback arguments are `(content,
+    known)`.
+    - content: the relevant model.Content object,
+    - known: a boolean, True if the file is known to the archive False otherwise.
+    """
+    contents = archive.contents
+    skipped_contents = archive.skipped_contents
+    directories = archive.directories
+
+    contents_count = len(contents)
+    skipped_contents_count = len(skipped_contents)
+    directories_count = len(directories)
+
+    graph = RandomDirSamplingDiscoveryGraph(
+        contents,
+        skipped_contents,
+        directories,
+        update_info_callback=update_info_callback,
+    )
+
+    while graph.undecided:
+        sample = graph.get_sample()
+        graph.do_query(archive, sample)
+
+    contents = [c for c in contents if c.sha1_git in graph.unknown]
+    skipped_contents = [c for c in skipped_contents if c.sha1_git in graph.unknown]
+    directories = [c for c in directories if c.id in graph.unknown]
+
+    logger.debug(
+        "Filtered out %d contents, %d skipped contents and %d directories",
+        contents_count - len(contents),
+        skipped_contents_count - len(skipped_contents),
+        directories_count - len(directories),
+    )
+
+    return (contents, skipped_contents, directories)
--- a/swh/model/exceptions.py
+++ b/swh/model/exceptions.py
@@ -33,11 +33,12 @@
 # ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE
 # POSSIBILITY OF SUCH DAMAGE.

-NON_FIELD_ERRORS = '__all__'
+NON_FIELD_ERRORS = "__all__"


 class ValidationError(Exception):
    """An error while validating data."""
+
    def __init__(self, message, code=None, params=None):
        """
        The `message` argument can be a single error, a list of errors, or a
@@ -54,16 +55,15 @@ class ValidationError(Exception):
            message = message[0]

        if isinstance(message, ValidationError):
-            if hasattr(message, 'error_dict'):
+            if hasattr(message, "error_dict"):
                message = message.error_dict
            # PY2 has a `message` property which is always there so we can't
            # duck-type on it. It was introduced in Python 2.5 and already
            # deprecated in Python 2.6.
-            elif not hasattr(message, 'message'):
+            elif not hasattr(message, "message"):
                message = message.error_list
            else:
-                message, code, params = (message.message, message.code,
-                                         message.params)
+                message, code, params = (message.message, message.code, message.params)

        if isinstance(message, dict):
            self.error_dict = {}
@@ -78,9 +78,8 @@ class ValidationError(Exception):
                # Normalize plain strings to instances of ValidationError.
                if not isinstance(message, ValidationError):
                    message = ValidationError(message)
-                if hasattr(message, 'error_dict'):
-                    self.error_list.extend(sum(message.error_dict.values(),
-                                               []))
+                if hasattr(message, "error_dict"):
+                    self.error_list.extend(sum(message.error_dict.values(), []))
                else:
                    self.error_list.extend(message.error_list)

@@ -94,18 +93,18 @@ class ValidationError(Exception):
    def message_dict(self):
        # Trigger an AttributeError if this ValidationError
        # doesn't have an error_dict.
-        getattr(self, 'error_dict')
+        getattr(self, "error_dict")

        return dict(self)

    @property
    def messages(self):
-        if hasattr(self, 'error_dict'):
+        if hasattr(self, "error_dict"):
            return sum(dict(self).values(), [])
        return list(self)

    def update_error_dict(self, error_dict):
-        if hasattr(self, 'error_dict'):
+        if hasattr(self, "error_dict"):
            for field, error_list in self.error_dict.items():
                error_dict.setdefault(field, []).extend(error_list)
        else:
@@ -113,7 +112,7 @@ class ValidationError(Exception):
        return error_dict

    def __iter__(self):
-        if hasattr(self, 'error_dict'):
+        if hasattr(self, "error_dict"):
            for field, errors in self.error_dict.items():
                yield field, list(ValidationError(errors))
        else:
@@ -124,9 +123,13 @@ class ValidationError(Exception):
                yield message

    def __str__(self):
-        if hasattr(self, 'error_dict'):
+        if hasattr(self, "error_dict"):
            return repr(dict(self))
        return repr(list(self))

    def __repr__(self):
-        return 'ValidationError(%s)' % self
+        return "ValidationError(%s)" % self
+
+
+class InvalidDirectoryPath(Exception):
+    pass
--- a/swh/model/fields/__init__.py
+++ b/swh/model/fields/__init__.py
@@ -6,8 +6,13 @@
 # We do our imports here but we don't use them, so flake8 complains
 # flake8: noqa

-from .simple import (validate_type, validate_int, validate_str, validate_bytes,
-                     validate_datetime, validate_enum)
-from .hashes import (validate_sha1, validate_sha1_git, validate_sha256)
-from .compound import (validate_against_schema, validate_all_keys,
-                       validate_any_key)
+from .compound import validate_against_schema, validate_all_keys, validate_any_key
+from .hashes import validate_sha1, validate_sha1_git, validate_sha256
+from .simple import (
+    validate_bytes,
+    validate_datetime,
+    validate_enum,
+    validate_int,
+    validate_str,
+    validate_type,
+)
--- a/swh/model/fields/compound.py
+++ b/swh/model/fields/compound.py
@@ -6,7 +6,7 @@
 from collections import defaultdict
 import itertools

-from ..exceptions import ValidationError, NON_FIELD_ERRORS
+from ..exceptions import NON_FIELD_ERRORS, ValidationError


 def validate_against_schema(model, schema, value):
@@ -26,19 +26,19 @@ def validate_against_schema(model, schema, value):

    if not isinstance(value, dict):
        raise ValidationError(
-            'Unexpected type %(type)s for %(model)s, expected dict',
+            "Unexpected type %(type)s for %(model)s, expected dict",
            params={
-                'model': model,
-                'type': value.__class__.__name__,
+                "model": model,
+                "type": value.__class__.__name__,
            },
-            code='model-unexpected-type',
+            code="model-unexpected-type",
        )

    errors = defaultdict(list)

    for key, (mandatory, validators) in itertools.chain(
        ((k, v) for k, v in schema.items() if k != NON_FIELD_ERRORS),
-        [(NON_FIELD_ERRORS, (False, schema.get(NON_FIELD_ERRORS, [])))]
+        [(NON_FIELD_ERRORS, (False, schema.get(NON_FIELD_ERRORS, [])))],
    ):
        if not validators:
            continue
@@ -54,9 +54,9 @@ def validate_against_schema(model, schema, value):
                if mandatory:
                    errors[key].append(
                        ValidationError(
-                            'Field %(field)s is mandatory',
-                            params={'field': key},
-                            code='model-field-mandatory',
+                            "Field %(field)s is mandatory",
+                            params={"field": key},
+                            code="model-field-mandatory",
                        )
                    )

@@ -74,19 +74,21 @@ def validate_against_schema(model, schema, value):
            else:
                if not valid:
                    errdata = {
-                        'validator': validator.__name__,
+                        "validator": validator.__name__,
                    }

                    if key == NON_FIELD_ERRORS:
-                        errmsg = 'Validation of model %(model)s failed in ' \
-                                 '%(validator)s'
-                        errdata['model'] = model
-                        errcode = 'model-validation-failed'
+                        errmsg = (
+                            "Validation of model %(model)s failed in " "%(validator)s"
+                        )
+                        errdata["model"] = model
+                        errcode = "model-validation-failed"
                    else:
-                        errmsg = 'Validation of field %(field)s failed in ' \
-                                 '%(validator)s'
-                        errdata['field'] = key
-                        errcode = 'field-validation-failed'
+                        errmsg = (
+                            "Validation of field %(field)s failed in " "%(validator)s"
+                        )
+                        errdata["field"] = key
+                        errcode = "field-validation-failed"

                    errors[key].append(
                        ValidationError(errmsg, params=errdata, code=errcode)
@@ -102,11 +104,11 @@ def validate_all_keys(value, keys):
    """Validate that all the given keys are present in value"""
    missing_keys = set(keys) - set(value)
    if missing_keys:
-        missing_fields = ', '.join(sorted(missing_keys))
+        missing_fields = ", ".join(sorted(missing_keys))
        raise ValidationError(
-            'Missing mandatory fields %(missing_fields)s',
-            params={'missing_fields': missing_fields},
-            code='missing-mandatory-field'
+            "Missing mandatory fields %(missing_fields)s",
+            params={"missing_fields": missing_fields},
+            code="missing-mandatory-field",
        )

    return True
@@ -116,11 +118,11 @@ def validate_any_key(value, keys):
    """Validate that any of the given keys is present in value"""
    present_keys = set(keys) & set(value)
    if not present_keys:
-        missing_fields = ', '.join(sorted(keys))
+        missing_fields = ", ".join(sorted(keys))
        raise ValidationError(
-            'Must contain one of the alternative fields %(missing_fields)s',
-            params={'missing_fields': missing_fields},
-            code='missing-alternative-field',
+            "Must contain one of the alternative fields %(missing_fields)s",
+            params={"missing_fields": missing_fields},
+            code="missing-alternative-field",
        )

    return True
No results found