Skip to content
Snippets Groups Projects

Compare revisions

Changes are shown as if the source revision was being merged into the target revision. Learn more about comparing revisions.

Source

Select target project
No results found

Target

Select target project
  • anlambert/swh-model
  • lunar/swh-model
  • franckbret/swh-model
  • douardda/swh-model
  • olasd/swh-model
  • swh/devel/swh-model
  • Alphare/swh-model
  • samplet/swh-model
  • marmoute/swh-model
  • rboyer/swh-model
10 results
Show changes
Showing
with 1625 additions and 107 deletions
.. _data-model:
Software Heritage data model
============================
Data model
==========
TODO
.. note:: The text below is adapted from §7 of the article `Software Heritage:
Why and How to Preserve Software Source Code
<https://hal.archives-ouvertes.fr/hal-01590958/>`_ (in proceedings of `iPRES
2017 <https://ipres2017.jp/>`_, 14th International Conference on Digital
Preservation, by Roberto Di Cosmo and Stefano Zacchiroli), which also
provides a more general description of Software Heritage for the digital
preservation research community.
Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor
incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis
nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo
consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse
cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non
proident, sunt in culpa qui officia deserunt mollit anim id est laborum.
In any archival project the choice of the underlying data model—at the logical
level, independently from how data is actually stored on physical media—is
paramount. The data model adopted by Software Heritage to represent the
information that it collects is centered around the notion of *software
artifact*, described below.
It is important to notice that according to our principles, we must store with
every software artifact full information on where it has been found
(provenance), that is also captured in our data model, so we start by providing
some basic information on the nature of this provenance information.
Source code hosting places
--------------------------
Currently, Software Heritage uses of a curated list of source code hosting
places to crawl. The most common entries we expect to place in such a list are
popular collaborative development forges (e.g., GitHub, Bitbucket), package
manager repositories that host source package (e.g., CPAN, npm), and FOSS
distributions (e.g., Fedora, FreeBSD). But we may of course allow also more
niche entries, such as URLs of personal or institutional project collections
not hosted on major forges.
While currently entirely manual, the curation of such a list might easily be
semi-automatic, with entries suggested by fellow archivists and/or concerned
users that want to notify Software Heritage of the need of archiving specific
pieces of endangered source code. This approach is entirely compatible with
Web-wide crawling approaches: crawlers capable of detecting the presence of
source code might enrich the list. In both cases the list will remain curated,
with (semi-automated) review processes that will need to pass before a hosting
place starts to be used.
Software artifacts
------------------
Once the hosting places are known, they will need to be periodically looked at
in order to add to the archive missing software artifacts. Which software
artifacts will be found there?
In general, each software distribution mechanism hosts multiple releases of a
given software at any given time. For VCS (Version Control Systems), this is
the natural behaviour; for software packages, while a single version of a
package is just a snapshot of the corresponding software product, one can often
retrieve both current and past versions of the package from its distribution
site.
By reviewing and generalizing existing VCS and source package formats, we have
identified the following recurrent artifacts as commonly found at source code
hosting places. They form the basic ingredients of the Software Heritage
archive. As the terminology varies quite a bit from technology to technology,
we provide below both the canonical name used in Software Heritage and popular
synonyms.
**contents** (AKA "blobs")
the raw content of (source code) files as a sequence of bytes, without file
names or any other metadata. File contents are often recurrent, e.g., across
different versions of the same software, different directories of the same
project, or different projects all together.
**directories**
a list of named directory entries, each of which pointing to other artifacts,
usually file contents or sub-directories. Directory entries are also
associated to some metadata stored as permission bits.
**revisions** (AKA "commits")
software development within a specific project is essentially a time-indexed
series of copies of a single "root" directory that contains the entire
project source code. Software evolves when a developer modifies the content
of one or more files in that directory and record their changes.
Each recorded copy of the root directory is known as a "revision". It points
to a fully-determined directory and is equipped with arbitrary metadata. Some
of those are added manually by the developer (e.g., commit message), others
are automatically synthesized (timestamps, preceding commit(s), etc).
**releases** (AKA "tags")
some revisions are more equals than others and get selected by developers as
denoting important project milestones known as "releases". Each release
points to the last commit in project history corresponding to the release and
carries metadata: release name and version, release message, cryptographic
signatures, etc.
Additionally, the following crawling-related information are stored as
provenance information in the Software Heritage archive:
**origins**
code "hosting places" as previously described are usually large platforms
that host several unrelated software projects. For software provenance
purposes it is important to be more specific than that.
Software origins are fine grained references to where source code artifacts
archived by Software Heritage have been retrieved from. They take the form of
``(type, url)`` pairs, where ``url`` is a canonical URL (e.g., the address at
which one can ``git clone`` a repository or download a source tarball) and
``type`` the kind of software origin (e.g., git, svn, or dsc for Debian
source packages).
..
**projects**
as commonly intended are more abstract entities that precise software
origins. Projects relate together several development resources, including
websites, issue trackers, mailing lists, as well as software origins as
intended by Software Heritage.
The debate around the most apt ontologies to capture project-related
information for software hasn't settled yet, but the place projects will take
in the Software Heritage archive is fairly clear. Projects are abstract
entities, which will be arbitrarily nestable in a versioned
project/sub-project hierarchy, and that can be associated to arbitrary
metadata as well as origins where their source code can be found.
**snapshots**
any kind of software origin offers multiple pointers to the "current" state
of a development project. In the case of VCS this is reflected by branches
(e.g., master, development, but also so called feature branches dedicated to
extending the software in a specific direction); in the case of package
distributions by notions such as suites that correspond to different maturity
levels of individual packages (e.g., stable, development, etc.).
A "snapshot" of a given software origin records all entry points found there
and where each of them was pointing at the time. For example, a snapshot
object might track the commit where the master branch was pointing to at any
given time, as well as the most recent release of a given package in the
stable suite of a FOSS distribution.
**visits**
links together software origins with snapshots. Every time an origin is
consulted a new visit object is created, recording when (according to
Software Heritage clock) the visit happened and the full snapshot of the
state of the software origin at the time.
.. note::
This model currently records visits as a single point in time. However, the
actual visit process is not instantaneous. Loaders can record successive
changes to the state of the visit, as their work progresses, as updates to
the visit object.
Data structure
--------------
.. _swh-merkle-dag:
.. figure:: images/swh-merkle-dag.svg
:width: 1024px
:align: center
Software Heritage archive as a Merkle DAG, augmented with crawling
information (click to zoom).
With all the bits of what we want to archive in place, the next question is how
to organize them, i.e., which logical data structure to adopt for their
storage. A key observation for this decision is that source code artifacts are
massively duplicated. This is so for several reasons:
* code hosting diaspora (i.e., project development moving to the most
recent/cool collaborative development technology over time);
* copy/paste (AKA "vendoring") of parts or entire external FOSS software
components into other software products;
* large overlap between revisions of the same project: usually only a very
small amount of files/directories are modified by a single commit;
* emergence of DVCS (distributed version control systems), which natively work
by replicating entire repository copies around. GitHub-style pull requests
are the pinnacle of this, as they result in creating an additional repository
copy at each change done by a new developer;
* migration from one VCS to another—e.g., migrations from Subversion to Git,
which are really popular these days—resulting in additional copies, but in a
different distribution format, of the very same development histories.
These trends seem to be neither stopping nor slowing down, and it is reasonable
to expect that they will be even more prominent in the future, due to the
decreasing costs of storage and bandwidth.
For this reason we argue that any sustainable storage layout for archiving
source code in the very long term should support deduplication, allowing to pay
for the cost of storing source code artifacts that are encountered more than
once only once. For storage efficiency, deduplication should be supported for
all the software artifacts we have discussed, namely: file contents,
directories, revisions, releases, snapshots.
Realizing that principle, the Software Heritage archive is conceptually a
single (big) `Merkle Direct Acyclic Graph (DAG)
<https://en.wikipedia.org/wiki/Merkle_tree>`_, as depicted in Figure
:ref:`Software Heritage Merkle DAG <swh-merkle-dag>`. In such a graph each of
the artifacts we have described—from file contents up to entire
snapshots—correspond to a node. Edges between nodes emerge naturally:
directory entries point to other directories or file contents; revisions point
to directories and previous revisions, releases point to revisions, snapshots
point to revisions and releases. Additionally, each node contains all metadata
that are specific to the node itself rather than to pointed nodes; e.g., commit
messages, timestamps, or file names. Note that the structure is really a DAG,
and not a tree, due to the fact that the line of revisions nodes might be
forked and merged back.
..
directory: fff3cc22cb40f71d26f736c082326e77de0b7692
parent: e4feb05112588741b4764739d6da756c357e1f37
author: Stefano Zacchiroli <zack@upsilon.cc>
date: 1443617461 +0200
committer: Stefano Zacchiroli <zack@upsilon.cc>
commiter_date: 1443617461 +0200
message:
objstorage: fix tempfile race when adding objects
Before this change, two workers adding the same
object will end up racing to write <SHA1>.tmp.
[...]
revisionid: 64a783216c1ec69dcb267449c0bbf5e54f7c4d6d
A revision node in the Software Heritage DAG
In a Merkle structure each node is identified by an intrinsic identifier
computed as a cryptographic hash of the node content. In the case of Software
Heritage identifiers are computed taking into account both node-specific
metadata and the identifiers of child nodes.
Consider the revision node in the picture whose identifier starts with
`c7640e08d..`. it points to a directory (identifier starting with
`45f0c078..`), which has also been archived. That directory contains a full
copy, at a specific point in time, of a software component—in the example the
`Hello World <https://forge.softwareheritage.org/source/helloworld/>`_ software
component available on our forge. The revision node also points to the
preceding revision node (`43ef7dcd..`) in the project development history.
Finally, the node contains revision-specific metadata, such as the author and
committer of the given change, its timestamps, and the message entered by the
author at commit time.
The identifier of the revision node itself (`c7640e08d..`) is computed as a
cryptographic hash of a (canonical representation of) all the information shown
in figure. A change in any of them—metadata and/or pointed nodes—would result
in an entirely different node identifier. All other types of nodes in the
Software Heritage archive behave similarly.
The Software Heritage archive inherits useful properties from the underlying
Merkle structure. In particular, deduplication is built-in. Any software
artifacts encountered in the wild gets added to the archive only if a
corresponding node with a matching intrinsic identifier is not already
available in the graph—file content, commits, entire directories or project
snapshots are all deduplicated incurring storage costs only once.
Furthermore, as a side effect of this data model choice, the entire development
history of all the source code archived in Software Heritage—which ambitions to
match all published source code in the world—is available as a unified whole,
making emergent structures such as code reuse across different projects or
software origins, readily available. Further reinforcing the Software Heritage
use cases, this object could become a veritable "map of the stars" of our
entire software commons.
Extended data model
-------------------
In addition to the artifacts detailed above used to represent original software
artifacts, the Software Heritage archive stores information about these
artifacts.
**extid**
a relationship between an original identifier of an artifact, in its
native/upstream environment, and a `core SWHID <persistent-identifiers>`,
which is specific to Software Heritage. As such, it includes:
* the external identifier, stored as bytes whose format is opaque to the
data model
* a type (a simple name and a version), to identify the type of relationship
* the "target", which is a core SWHID
An extid may also include a "payload", which is arbitrary data about the
relationship. For example, an extid might link a directory to the
cryptographic hash of the tarball that originally contained it. In this
case, the payload could include data useful for reconstructing the
original tarball from the directory. The payload data is stored
separately. An extid refers to it by its ``sha1_git`` hash.
**raw extrinsic metadata**
an opaque bytestring, along with its format (a simple name), an identifier
of the object the metadata is about and in which context (similar to a
`qualified SWHID <persistent-identifiers>`), and provenance information
(the authority who provided it, the fetcher tool used to get it, and the
data it was discovered at).
It provides both a way to store information about an artifact contributed by
external entities, after the artifact was created, and an escape hatch to
store metadata that would not otherwise fit in the data model.
(last updated 2020-04-28)
Scheme name: swh
Status: Provisional
Applications/protocols that use this scheme name:
Software Heritage: https://www.softwareheritage.org/
Software Package Data Exchange: https://spdx.org/
NTIA: https://www.ntia.doc.gov/SoftwareTransparency
Identifiers.org: http://identifiers.org/
Name-to-Thing (N2T): https://n2t.net/
HAL: https://hal.archives-ouvertes.fr/
Contact: Stefano Zacchiroli <zack@upsilon.cc>
Change controller: Software Heritage <info@softwareheritage.org>
References:
Scheme specification: https://docs.softwareheritage.org/devel/swh-model/persistent-identifiers.html
The Software Heritage project: https://www.softwareheritage.org/
The Software Heritage archive: https://archive.softwareheritage.org/
Publications:
Roberto Di Cosmo, Morane Gruenpeter, Stefano Zacchiroli. Referencing Source
Code Artifacts: a Separate Concern in Software Citation. In Computing in
Science and Engineering, volume 22, issue 2, pp. 33-43. ISSN 1521-9615,
IEEE. March 2020. DOI 10.1109/MCSE.2019.2963148
Roberto Di Cosmo, Morane Gruenpeter, Stefano Zacchiroli. Identifiers for
Digital Objects: the Case of Software Source Code Preservation. In
proceedings of iPRES 2018: 15th International Conference on Digital
Preservation. September 2018. 10.17605/OSF.IO/KDE56
(file created 2020-04-28)
swh-merkle-dag.pdf
swh-merkle-dag.svg
MERKLE_DAG = swh-merkle-dag.pdf swh-merkle-dag.svg
BUILD_TARGETS =
BUILD_TARGETS += $(MERKLE_DAG)
all: $(BUILD_TARGETS)
%.svg: %.dia
dia -e $@ $<
%.pdf: %.svg
set -e; if [ $$(inkscape --version 2>/dev/null | grep -Eo '[0-9]+' | head -1) -gt 0 ]; then \
inkscape -o $@ $< ; \
else \
inkscape -A $@ $< ; \
fi
clean:
-rm -f $(BUILD_TARGETS)
File added
.. _swh-model:
Software Heritage - Development Documentation
=============================================
.. include:: README.rst
.. toctree::
:maxdepth: 2
:caption: Contents:
:caption: Overview:
:titlesonly:
data-model
persistent-identifiers
cli
Overview
--------
.. only:: standalone_package_doc
* :ref:`data-model`
Indices and tables
------------------
Indices and tables
==================
* :ref:`genindex`
* :ref:`modindex`
* :ref:`search`
* :ref:`genindex`
* :ref:`modindex`
* :ref:`search`
.. _persistent-identifiers:
.. _swhids:
=================================================
SoftWare Heritage persistent IDentifiers (SWHIDs)
=================================================
**version 1.6, last modified 2021-04-30**
.. contents::
:local:
:depth: 2
Overview
========
You can point to objects present in the `Software Heritage
<https://www.softwareheritage.org/>`_ `archive
<https://archive.softwareheritage.org/>`_ by the means of **SoftWare Heritage
persistent IDentifiers**, or **SWHIDs** for short, that are guaranteed to
remain stable (persistent) over time. Their syntax, meaning, and usage is
described below. Note that they are identifiers and not URLs, even though
URL-based `resolvers`_ for SWHIDs are also available.
A SWHID consists of two separate parts, a mandatory *core identifier* that can
point to any software artifact (or "object") available in the Software Heritage
archive, and an optional list of *qualifiers* that allows to specify the
context where the object is meant to be seen and point to a subpart of the
object itself.
Objects come in different types:
* contents
* directories
* revisions
* releases
* snapshots
Each object is identified by an intrinsic, type-specific object identifier that
is embedded in its SWHID as described below. The intrinsic identifiers embedded
in SWHIDs are strong cryptographic hashes computed on the entire set of object
properties. Together, these identifiers form a `Merkle structure
<https://en.wikipedia.org/wiki/Merkle_tree>`_, specifically a Merkle `DAG
<https://en.wikipedia.org/wiki/Directed_acyclic_graph>`_.
See the :ref:`Software Heritage data model <data-model>` for an overview of
object types and how they are linked together. See
:py:mod:`swh.model.git_objects` for details on how the intrinsic identifiers
embedded in SWHIDs are computed.
The optional qualifiers are of two kinds:
* **context qualifiers:** carry information about the context where a given
object is meant to be seen. This is particularly important, as the same
object can be reached in the Merkle graph following different *paths*
starting from different nodes (or *anchors*), and it may have been retrieved
from different *origins*, that may evolve between different *visits*
* **fragment qualifiers:** allow to pinpoint specific subparts of an object
.. _swhids-syntax:
Syntax
======
Syntactically, SWHIDs are generated by the ``<identifier>`` entry point in the
following grammar:
.. code-block:: bnf
<identifier> ::= <identifier_core> [ <qualifiers> ] ;
<identifier_core> ::= "swh" ":" <scheme_version> ":" <object_type> ":" <object_id> ;
<scheme_version> ::= "1" ;
<object_type> ::=
"snp" (* snapshot *)
| "rel" (* release *)
| "rev" (* revision *)
| "dir" (* directory *)
| "cnt" (* content *)
;
<object_id> ::= 40 * <hex_digit> ; (* intrinsic object id, as hex-encoded SHA1 *)
<dec_digit> ::= "0" | "1" | "2" | "3" | "4" | "5" | "6" | "7" | "8" | "9" ;
<hex_digit> ::= <dec_digit> | "a" | "b" | "c" | "d" | "e" | "f" ;
<qualifiers> := ";" <qualifier> [ <qualifiers> ] ;
<qualifier> ::=
<context_qualifier>
| <fragment_qualifier>
;
<context_qualifier> ::=
<origin_ctxt>
| <visit_ctxt>
| <anchor_ctxt>
| <path_ctxt>
;
<origin_ctxt> ::= "origin" "=" <url_escaped> ;
<visit_ctxt> ::= "visit" "=" <identifier_core> ;
<anchor_ctxt> ::= "anchor" "=" <identifier_core> ;
<path_ctxt> ::= "path" "=" <path_absolute_escaped> ;
<fragment_qualifier> ::= "lines" "=" <line_number> ["-" <line_number>] ;
<line_number> ::= <dec_digit> + ;
<url_escaped> ::= (* RFC 3987 IRI *)
<path_absolute_escaped> ::= (* RFC 3987 absolute path *)
Where:
- ``<path_absolute_escaped>`` is an ``<ipath-absolute>`` from `RFC 3987`_, and
- ``<url_escaped>`` is a `RFC 3987`_ IRI
in either case all occurrences of ``;`` (and ``%``, as required by the RFC)
have been percent-encoded (as ``%3B`` and ``%25`` respectively). Other
characters *can* be percent-encoded, e.g., to improve readability and/or
embeddability of SWHID in other contexts.
.. _RFC 3987: https://tools.ietf.org/html/rfc3987
.. _swhids-semantics:
Semantics
=========
.. _swhids-core:
Core identifiers
----------------
``:`` is used as separator between the logical parts of core identifiers. The
``swh`` prefix makes explicit that these identifiers are related to *SoftWare
Heritage*. ``1`` (``<scheme_version>``) is the current version of this
identifier *scheme*. Future editions will use higher version numbers, possibly
breaking backward compatibility, but without breaking the resolvability of
SWHIDs that conform to previous versions of the scheme.
A SWHID points to a single object, whose type is explicitly captured by
``<object_type>``:
* ``snp`` to **snapshots**,
* ``rel`` to **releases**,
* ``rev`` to **revisions**,
* ``dir`` to **directories**,
* ``cnt`` to **contents**.
The actual object pointed to is identified by the intrinsic identifier
``<object_id>``, which is a hex-encoded (using lowercase ASCII characters) SHA1
computed on the content and metadata of the object itself, as follows:
* for **snapshots**, intrinsic identifiers are SHA1 hashes of manifests computed as per
:py:func:`swh.model.git_objects.snapshot_git_object`
* for **releases**, as per
:py:func:`swh.model.git_objects.release_git_object`
that produces the same result as a git release hash
* for **revisions**, as per
:py:func:`swh.model.git_objects.revision_git_object`
that produces the same result as a git commit hash
* for **directories**, per
:py:func:`swh.model.git_objects.directory_git_object`
that produces the same result as a git tree hash
* for **contents**, the intrinsic identifier is the ``sha1_git`` hash returned by
:py:meth:`swh.hashutil.MultiHash.digest`, i.e., the SHA1 of a byte
sequence obtained by juxtaposing the ASCII string ``"blob"`` (without
quotes), a space, the length of the content as decimal digits, a NULL byte,
and the actual content of the file.
.. _swhids-qualifiers:
Qualifiers
----------
``;`` is used as separator between the core identifier and the optional
qualifiers, as well as between qualifiers. Each qualifier is specified as a
key/value pair, using ``=`` as a separator.
The following *context qualifiers* are available:
* **origin:** the *software origin* where an object has been found or observed
in the wild, as an URI;
* **visit:** the core identifier of a *snapshot* corresponding to a specific
*visit* of a repository containing the designated object;
* **anchor:** a *designated node* in the Merkle DAG relative to which a *path
to the object* is specified, as the core identifier of a directory, a
revision, a release or a snapshot;
* **path:** the *absolute file path*, from the *root directory* associated to
the *anchor node*, to the object; when the anchor denotes a directory or a
revision, and almost always when it's a release, the root directory is
uniquely determined; when the anchor denotes a snapshot, the root directory
is the one pointed to by ``HEAD`` (possibly indirectly), and undefined if
such a reference is missing;
The following *fragment qualifier* is available:
* **lines:** *line number(s)* of interest, usually within a content object
We recommend to equip identifiers meant to be shared with as many qualifiers as
possible. While qualifiers may be listed in any order, it is good practice to
present them in the order given above, i.e., ``origin``, ``visit``, ``anchor``,
``path``, ``lines``. Redundant information should be omitted: for example, if
the *visit* is present, and the *path* is relative to the snapshot indicated
there, then the *anchor* qualifier is superfluous; similarly, if the *path* is
empty, it may be omitted.
Interoperability
================
URI scheme
----------
The ``swh`` URI scheme is registered at IANA for SWHIDs. The present documents
constitutes the scheme specification for such URI scheme.
Git compatibility
-----------------
SWHIDs for contents, directories, revisions, and releases are, at present,
compatible with the `Git <https://git-scm.com/>`_ way of `computing identifiers
<https://git-scm.com/book/en/v2/Git-Internals-Git-Objects>`_ for its objects.
The ``<object_id>`` part of a SWHID for a content object is the Git blob
identifier of any file with the same content; for a revision it is the Git
commit identifier for the same revision, etc. This is not the case for
snapshot identifiers, as Git does not have a corresponding object type.
Note that Git compatibility is incidental and is not guaranteed to be
maintained in future versions of this scheme (or Git).
Automatically fixing invalid SWHIDs
-----------------------------------
User interfaces may fix invalid SWHIDs, by lower-casing the
``<identifier_core>`` part of a SWHID, if it contains upper-case letters
because of user errors or limitations in software displaying SWHIDs.
However, implementations displaying or generating SWHIDs should not rely
on this behavior, and must display or generate only valid SWHIDs when
technically possible.
User interfaces should show an error when such an automatic fix occurs,
so users have a chance to fix their SWHID before pasting it to an other interface
that does not perform the same corrections.
This also makes it easier to understand issues when a case-sensitive
qualifier has its casing altered.
Examples
========
Core identifiers
----------------
* ``swh:1:cnt:94a9ed024d3859793618152ea559a168bbcbb5e2`` points to the content
of a file containing the full text of the GPL3 license
* ``swh:1:dir:d198bc9d7a6bcf6db04f476d29314f157507d505`` points to a directory
containing the source code of the Darktable photography application as it was
at some point on 4 May 2017
* ``swh:1:rev:309cf2674ee7a0749978cf8265ab91a60aea0f7d`` points to a commit in
the development history of Darktable, dated 16 January 2017, that added
undo/redo supports for masks
* ``swh:1:rel:22ece559cc7cc2364edc5e5593d63ae8bd229f9f`` points to Darktable
release 2.3.0, dated 24 December 2016
* ``swh:1:snp:c7c108084bc0bf3d81436bf980b46e98bd338453`` points to a snapshot
of the entire Darktable Git repository taken on 4 May 2017 from GitHub
Identifiers with qualifiers
---------------------------
* The following :swh_web:`SWHID
<swh:1:cnt:4d99d2d18326621ccdd70f5ea66c2e2ac236ad8b;origin=https://gitorious.org/ocamlp3l/ocamlp3l_cvs.git;visit=swh:1:snp:d7f1b9eb7ccb596c2622c4780febaa02549830f9;anchor=swh:1:rev:2db189928c94d62a3b4757b3eec68f0a4d4113f0;path=/Examples/SimpleFarm/simplefarm.ml;lines=9-15>`
denotes the lines 9 to 15 of a file content that can be found at absolute
path ``/Examples/SimpleFarm/simplefarm.ml`` from the root directory of the
revision ``swh:1:rev:2db189928c94d62a3b4757b3eec68f0a4d4113f0`` that is
contained in the snapshot
``swh:1:snp:d7f1b9eb7ccb596c2622c4780febaa02549830f9`` taken from the origin
``https://gitorious.org/ocamlp3l/ocamlp3l_cvs.git``::
swh:1:cnt:4d99d2d18326621ccdd70f5ea66c2e2ac236ad8b;
origin=https://gitorious.org/ocamlp3l/ocamlp3l_cvs.git;
visit=swh:1:snp:d7f1b9eb7ccb596c2622c4780febaa02549830f9;
anchor=swh:1:rev:2db189928c94d62a3b4757b3eec68f0a4d4113f0;
path=/Examples/SimpleFarm/simplefarm.ml;
lines=9-15
* Here is an example of a :swh_web:`SWHID
<swh:1:cnt:f10371aa7b8ccabca8479196d6cd640676fd4a04;origin=https://github.com/web-platform-tests/wpt;visit=swh:1:snp:b37d435721bbd450624165f334724e3585346499;anchor=swh:1:rev:259d0612af038d14f2cd889a14a3adb6c9e96d96;path=/html/semantics/document-metadata/the-meta-element/pragma-directives/attr-meta-http-equiv-refresh/support/x%253Burl=foo/>`
with a file path that requires percent-escaping::
swh:1:cnt:f10371aa7b8ccabca8479196d6cd640676fd4a04;
origin=https://github.com/web-platform-tests/wpt;
visit=swh:1:snp:b37d435721bbd450624165f334724e3585346499;
anchor=swh:1:rev:259d0612af038d14f2cd889a14a3adb6c9e96d96;
path=/html/semantics/document-metadata/the-meta-element/pragma-directives/attr-meta-http-equiv-refresh/support/x%3Burl=foo/
Implementation
==============
Computing
---------
An important property of any SWHID is that its core identifier is *intrinsic*:
it can be *computed from the object itself*, without having to rely on any
third party. An implementation of SWHID that allows to do so locally is the
`swh identify <https://docs.softwareheritage.org/devel/swh-model/cli.html>`_
tool, available from the `swh.model <https://pypi.org/project/swh.model/>`_
Python package under the GPL license. This package can be installed via the ``pip``
package manager with the one liner ``pip3 install swh.model[cli]`` on any machine with
Python (at least version 3.7) and ``pip`` installed (on a Debian or Ubuntu system a simple ``apt install python3 python3-pip``
will suffice, see `the general instructions <https://packaging.python.org/tutorials/installing-packages/>`_ for other platforms).
SWHIDs are also automatically computed by Software Heritage for all archived
objects as part of its archival activity, and can be looked up via the project
:swh_web:`Web interface <>`.
This has various practical implications:
* when a software artifact is obtained from Software Heritage by resolving a
SWHID, it is straightforward to verify that it is exactly the intended one:
just compute the core identifier from the artefact itself, and check that it
is the same as the core identifier part of the SHWID
* the core identifier of a software artifact can be computed *before* its
archival on Software Heritage
Choosing what type of SWHID to use
----------------------------------
``swh:1:dir:`` SWHIDs are the most robust SWHIDs, as they can be recomputed from
the simplest objects (a directory structure on a filesystem), even when all
metadata is lost, without relying on the Software Heritage archive.
Therefore, we advise implementers and users to prefer this type of SWHIDs
over ``swh:1:rev:`` and ``swh:1:rel:`` to reference a source code artifacts.
However, since keeping the metadata is also important, you should add an anchor
qualifier to ``swh:1:dir:`` SWHIDs whenever possible, so the metadata stored
in the Software Heritage archive can be retrieved when needed.
This means, for example, that you should prefer
``swh:1:dir:a8eded6a2d062c998ba2dcc3dcb0ce68a4e15a58;anchor=swh:1:rel:22ece559cc7cc2364edc5e5593d63ae8bd229f9f``
over ``swh:1:rel:22ece559cc7cc2364edc5e5593d63ae8bd229f9f``.
Resolvers
---------
Software Heritage resolver
~~~~~~~~~~~~~~~~~~~~~~~~~~
SWHIDs can be resolved using the Software Heritage :swh_web:`Web interface <>`.
In particular, the **root endpoint**
``/`` can be given a SWHID and will lead to the browsing page of the
corresponding object, like this:
``https://archive.softwareheritage.org/<identifier>``.
A **dedicated** ``/resolve`` **endpoint** of the Software Heritage :swh_web:`Web API
<api/>` is also available to
programmatically resolve SWHIDs; see: :http:get:`/api/1/resolve/(swhid)/`.
Examples:
* :swh_web:`swh:1:cnt:94a9ed024d3859793618152ea559a168bbcbb5e2`
* :swh_web:`swh:1:dir:d198bc9d7a6bcf6db04f476d29314f157507d505`
* :swh_web:`api/1/resolve/swh:1:rev:309cf2674ee7a0749978cf8265ab91a60aea0f7d`
* :swh_web:`api/1/resolve/swh:1:rel:22ece559cc7cc2364edc5e5593d63ae8bd229f9f`
* :swh_web:`api/1/resolve/swh:1:snp:c7c108084bc0bf3d81436bf980b46e98bd338453`
* :swh_web:`swh:1:cnt:4d99d2d18326621ccdd70f5ea66c2e2ac236ad8b;origin=https://gitorious.org/ocamlp3l/ocamlp3l_cvs.git;visit=swh:1:snp:d7f1b9eb7ccb596c2622c4780febaa02549830f9;anchor=swh:1:rev:2db189928c94d62a3b4757b3eec68f0a4d4113f0;path=/Examples/SimpleFarm/simplefarm.ml;lines=9-15`
* :swh_web:`swh:1:cnt:f10371aa7b8ccabca8479196d6cd640676fd4a04;origin=https://github.com/web-platform-tests/wpt;visit=swh:1:snp:b37d435721bbd450624165f334724e3585346499;anchor=swh:1:rev:259d0612af038d14f2cd889a14a3adb6c9e96d96;path=/html/semantics/document-metadata/the-meta-element/pragma-directives/attr-meta-http-equiv-refresh/support/x%253Burl=foo/`
Third-party resolvers
~~~~~~~~~~~~~~~~~~~~~
The following **third party resolvers** support SWHID resolution:
* `Identifiers.org <https://identifiers.org>`_; see:
`<http://identifiers.org/swh/>`_ (registry identifier `MIR:00000655
<https://www.ebi.ac.uk/miriam/main/datatypes/MIR:00000655>`_).
* `Name-to-Thing (N2T) <https://n2t.net/>`_
Note that resolution via Identifiers.org currently only supports *core
identifiers* due to `syntactic incompatibilities with qualifiers
<http://identifiers.org/documentation#custom_requests>`_.
Examples:
* `<https://identifiers.org/swh:1:cnt:94a9ed024d3859793618152ea559a168bbcbb5e2>`_
* `<https://identifiers.org/swh:1:dir:d198bc9d7a6bcf6db04f476d29314f157507d505>`_
* `<https://identifiers.org/swh:1:rev:309cf2674ee7a0749978cf8265ab91a60aea0f7d>`_
* `<https://n2t.net/swh:1:rel:22ece559cc7cc2364edc5e5593d63ae8bd229f9f>`_
* `<https://n2t.net/swh:1:snp:c7c108084bc0bf3d81436bf980b46e98bd338453>`_
* `<https://n2t.net/swh:1:cnt:4d99d2d18326621ccdd70f5ea66c2e2ac236ad8b;origin=https://gitorious.org/ocamlp3l/ocamlp3l_cvs.git;visit=swh:1:snp:d7f1b9eb7ccb596c2622c4780febaa02549830f9;anchor=swh:1:rev:2db189928c94d62a3b4757b3eec68f0a4d4113f0;path=/Examples/SimpleFarm/simplefarm.ml;lines=9-15>`_
* `<https://n2t.net/swh:1:cnt:f10371aa7b8ccabca8479196d6cd640676fd4a04;origin=https://github.com/web-platform-tests/wpt;visit=swh:1:snp:b37d435721bbd450624165f334724e3585346499;anchor=swh:1:rev:259d0612af038d14f2cd889a14a3adb6c9e96d96;path=/html/semantics/document-metadata/the-meta-element/pragma-directives/attr-meta-http-equiv-refresh/support/x%25253Burl=foo/>`_
References
==========
* Roberto Di Cosmo, Morane Gruenpeter, Stefano Zacchiroli. `Identifiers for
Digital Objects: the Case of Software Source Code Preservation
<https://hal.archives-ouvertes.fr/hal-01865790v4>`_. In Proceedings of `iPRES
2018 <https://ipres2018.org/>`_: 15th International Conference on Digital
Preservation, Boston, MA, USA, September 2018, 9 pages.
* Roberto Di Cosmo, Morane Gruenpeter, Stefano Zacchiroli. `Referencing Source
Code Artifacts: a Separate Concern in Software Citation
<https://arxiv.org/abs/2001.08647>`_. In Computing in Science and
Engineering, volume 22, issue 2, pages 33-43. ISSN 1521-9615,
IEEE. March 2020.
[project]
name = "swh.model"
authors = [
{name="Software Heritage developers", email="swh-devel@inria.fr"},
]
description = "Software Heritage data model"
readme = {file = "README.rst", content-type = "text/x-rst"}
requires-python = ">=3.7"
classifiers = [
"Programming Language :: Python :: 3",
"Intended Audience :: Developers",
"License :: OSI Approved :: GNU General Public License v3 (GPLv3)",
"Operating System :: OS Independent",
"Development Status :: 5 - Production/Stable",
]
dynamic = ["version", "dependencies", "optional-dependencies"]
[tool.setuptools.packages.find]
include = ["swh.*"]
[tool.setuptools.dynamic]
dependencies = {file = ["requirements.txt"]}
[tool.setuptools.dynamic.optional-dependencies]
cli = {file = "requirements-cli.txt"}
testing = {file = ["requirements-cli.txt", "requirements-test.txt"]}
testing_minimal = {file = "requirements-test.txt"}
[project.entry-points.console_scripts]
"swh.identify" = "swh.model.cli:identify"
[project.entry-points."swh.cli.subcommands"]
"swh.model" = "swh.model.cli"
[project.urls]
"Homepage" = "https://gitlab.softwareheritage.org/swh/devel/swh-model"
"Bug Reports" = "https://gitlab.softwareheritage.org/swh/devel/swh-model/-/issues"
"Funding" = "https://www.softwareheritage.org/donate"
"Documentation" = "https://docs.softwareheritage.org/devel/swh-model/"
"Source" = "https://gitlab.softwareheritage.org/swh/devel/swh-model.git"
[build-system]
requires = ["setuptools", "setuptools-scm"]
build-backend = "setuptools.build_meta"
[tool.setuptools_scm]
fallback_version = "0.0.1"
[tool.black]
target-version = ['py39', 'py310', 'py311', 'py312']
[tool.isort]
multi_line_output = 3
include_trailing_comma = true
force_grid_wrap = 0
use_parentheses = true
ensure_newline_before_comments = true
line_length = 88
force_sort_within_sections = true
known_first_party = ['swh']
[tool.mypy]
namespace_packages = true
warn_unused_ignores = true
explicit_package_bases = true
# ^ Needed for mypy to detect py.typed from swh packages installed
# in editable mode
plugins = []
# 3rd party libraries without stubs (yet)
# [[tool.mypy.overrides]]
# module = [
# "package1.*",
# "package2.*",
# ]
# ignore_missing_imports = true
[tool.flake8]
select = ["C", "E", "F", "W", "B950"]
ignore = [
"E203", # whitespaces before ':' <https://github.com/psf/black/issues/315>
"E231", # missing whitespace after ','
"E501", # line too long, use B950 warning from flake8-bugbear instead
"W503" # line break before binary operator <https://github.com/psf/black/issues/52>
]
max-line-length = 88
[tool.pytest.ini_options]
addopts = "--doctest-modules -p no:pytest_swh_core"
norecursedirs = "build docs .*"
asyncio_mode = "strict"
consider_namespace_packages = true
markers = [
"requires_optional_deps: tests in test_cli.py that should not run if optional dependencies are not installed",
]
swh.core >= 0.3
Click
dulwich
aiohttp
click
pytest >= 8.1
pytz
types-click
types-python-dateutil
types-pytz
types-deprecated
# Add here external Python modules dependencies, one per line. Module names
# should match https://pypi.python.org/pypi names. For the full spec or
# dependency lines, see https://pip.readthedocs.org/en/1.1/requirements.html
vcversioner
attrs != 21.1.0 # https://github.com/python-attrs/attrs/issues/804
attrs_strict >= 0.0.7
deprecated
hypothesis
iso8601
python-dateutil
typing_extensions
import hashlib
from setuptools import setup, find_packages
def parse_requirements():
requirements = []
for reqf in ('requirements.txt', 'requirements-swh.txt'):
with open(reqf) as f:
for line in f.readlines():
line = line.strip()
if not line or line.startswith('#'):
continue
requirements.append(line)
return requirements
extra_requirements = []
pyblake2_hashes = {'blake2s256', 'blake2b512'}
if pyblake2_hashes - set(hashlib.algorithms_available):
extra_requirements.append('pyblake2')
setup(
name='swh.model',
description='Software Heritage data model',
author='Software Heritage developers',
author_email='swh-devel@inria.fr',
url='https://forge.softwareheritage.org/diffusion/DMOD/',
packages=find_packages(), # packages's modules
scripts=[], # scripts to package
install_requires=parse_requirements() + extra_requirements,
setup_requires=['vcversioner'],
vcversioner={},
include_package_data=True,
)
__path__ = __import__('pkgutil').extend_path(__path__, __name__)
# Copyright (C) 2018-2020 The Software Heritage developers
# See the AUTHORS file at the top-level directory of this distribution
# License: GNU General Public License version 3, or any later version
# See top-level LICENSE file for more information
import os
import sys
from typing import Callable, Dict, Iterable, Optional
# WARNING: do not import unnecessary things here to keep cli startup time under
# control
try:
import click
except ImportError:
print(
"Cannot run swh-identify; the Click package is not installed."
"Please install 'swh.model[cli]' for full functionality.",
file=sys.stderr,
)
exit(1)
try:
import swh.core.cli
cli_command = swh.core.cli.swh.command
except ImportError:
# stub so that swh-identify can be used when swh-core isn't installed
cli_command = click.command
from swh.model.from_disk import Directory
from swh.model.swhids import CoreSWHID
CONTEXT_SETTINGS = dict(help_option_names=["-h", "--help"])
# Mapping between dulwich types and Software Heritage ones. Used by snapshot ID
# computation.
_DULWICH_TYPES = {
b"blob": "content",
b"tree": "directory",
b"commit": "revision",
b"tag": "release",
}
class CoreSWHIDParamType(click.ParamType):
"""Click argument that accepts a core SWHID and returns them as
:class:`swh.model.swhids.CoreSWHID` instances"""
name = "SWHID"
def convert(self, value, param, ctx) -> CoreSWHID:
from swh.model.exceptions import ValidationError
try:
return CoreSWHID.from_string(value)
except ValidationError as e:
self.fail(f'"{value}" is not a valid core SWHID: {e}', param, ctx)
def swhid_of_file(path) -> CoreSWHID:
from swh.model.from_disk import Content
object = Content.from_file(path=path)
return object.swhid()
def swhid_of_file_content(data) -> CoreSWHID:
from swh.model.from_disk import Content
object = Content.from_bytes(mode=644, data=data)
return object.swhid()
def model_of_dir(
path: bytes,
exclude_patterns: Optional[Iterable[bytes]] = None,
update_info: Optional[Callable[[int], None]] = None,
) -> Directory:
from swh.model.from_disk import accept_all_paths, ignore_directories_patterns
path_filter = (
ignore_directories_patterns(path, exclude_patterns)
if exclude_patterns
else accept_all_paths
)
return Directory.from_disk(
path=path, path_filter=path_filter, progress_callback=update_info
)
def swhid_of_dir(
path: bytes, exclude_patterns: Optional[Iterable[bytes]] = None
) -> CoreSWHID:
obj = model_of_dir(path, exclude_patterns)
return obj.swhid()
def swhid_of_origin(url):
from swh.model.model import Origin
return Origin(url).swhid()
def swhid_of_git_repo(path) -> CoreSWHID:
try:
import dulwich.repo
except ImportError:
raise click.ClickException(
"Cannot compute snapshot identifier; the Dulwich package is not installed. "
"Please install 'swh.model[cli]' for full functionality.",
)
from swh.model import hashutil
from swh.model.model import Snapshot
repo = dulwich.repo.Repo(path)
branches: Dict[bytes, Optional[Dict]] = {}
for ref, target in repo.refs.as_dict().items():
obj = repo[target]
if obj:
branches[ref] = {
"target": hashutil.bytehex_to_hash(target),
"target_type": _DULWICH_TYPES[obj.type_name],
}
else:
branches[ref] = None
for ref, target in repo.refs.get_symrefs().items():
branches[ref] = {
"target": target,
"target_type": "alias",
}
snapshot = {"branches": branches}
return Snapshot.from_dict(snapshot).swhid()
def identify_object(
obj_type: str, follow_symlinks: bool, exclude_patterns: Iterable[bytes], obj
) -> str:
from urllib.parse import urlparse
if obj_type == "auto":
if obj == "-" or os.path.isfile(obj):
obj_type = "content"
elif os.path.isdir(obj):
obj_type = "directory"
else:
try: # URL parsing
if urlparse(obj).scheme:
obj_type = "origin"
else:
raise ValueError
except ValueError:
raise click.BadParameter("cannot detect object type for %s" % obj)
if obj == "-":
content = sys.stdin.buffer.read()
swhid = str(swhid_of_file_content(content))
elif obj_type in ["content", "directory"]:
path = obj.encode(sys.getfilesystemencoding())
if follow_symlinks and os.path.islink(obj):
path = os.path.realpath(obj)
if obj_type == "content":
swhid = str(swhid_of_file(path))
elif obj_type == "directory":
swhid = str(swhid_of_dir(path, exclude_patterns))
elif obj_type == "origin":
swhid = str(swhid_of_origin(obj))
elif obj_type == "snapshot":
swhid = str(swhid_of_git_repo(obj))
else: # shouldn't happen, due to option validation
raise click.BadParameter("invalid object type: " + obj_type)
# note: we return original obj instead of path here, to preserve user-given
# file name in output
return swhid
@cli_command(context_settings=CONTEXT_SETTINGS)
@click.option(
"--dereference/--no-dereference",
"follow_symlinks",
default=True,
help="follow (or not) symlinks for OBJECTS passed as arguments "
+ "(default: follow)",
)
@click.option(
"--filename/--no-filename",
"show_filename",
default=True,
help="show/hide file name (default: show)",
)
@click.option(
"--type",
"-t",
"obj_type",
default="auto",
type=click.Choice(["auto", "content", "directory", "origin", "snapshot"]),
help="type of object to identify (default: auto)",
)
@click.option(
"--exclude",
"-x",
"exclude_patterns",
metavar="PATTERN",
multiple=True,
help="Exclude directories using glob patterns \
(e.g., ``*.git`` to exclude all .git directories)",
)
@click.option(
"--verify",
"-v",
metavar="SWHID",
type=CoreSWHIDParamType(),
help="reference identifier to be compared with computed one",
)
@click.option(
"-r",
"--recursive",
is_flag=True,
help="compute SWHID recursively",
)
@click.argument("objects", nargs=-1, required=True)
def identify(
obj_type,
verify,
show_filename,
follow_symlinks,
objects,
exclude_patterns,
recursive,
):
"""Compute the Software Heritage persistent identifier (SWHID) for the given
source code object(s).
For more details about SWHIDs see:
https://docs.softwareheritage.org/devel/swh-model/persistent-identifiers.html
Tip: you can pass "-" to identify the content of standard input.
Examples::
$ swh identify fork.c kmod.c sched/deadline.c
swh:1:cnt:2e391c754ae730bd2d8520c2ab497c403220c6e3 fork.c
swh:1:cnt:0277d1216f80ae1adeed84a686ed34c9b2931fc2 kmod.c
swh:1:cnt:57b939c81bce5d06fa587df8915f05affbe22b82 sched/deadline.c
$ swh identify --no-filename /usr/src/linux/kernel/
swh:1:dir:f9f858a48d663b3809c9e2f336412717496202ab
$ git clone --mirror https://forge.softwareheritage.org/source/helloworld.git
$ swh identify --type snapshot helloworld.git/
swh:1:snp:510aa88bdc517345d258c1fc2babcd0e1f905e93 helloworld.git
"""
from functools import partial
import logging
if exclude_patterns:
exclude_patterns = set(pattern.encode() for pattern in exclude_patterns)
if verify and len(objects) != 1:
raise click.BadParameter("verification requires a single object")
if recursive and not os.path.isdir(objects[0]):
recursive = False
logging.warn("recursive option disabled, input is not a directory object")
if recursive:
if verify:
raise click.BadParameter(
"verification of recursive object identification is not supported"
)
if not obj_type == ("auto" or "directory"):
raise click.BadParameter(
"recursive identification is supported only for directories"
)
path = os.fsencode(objects[0])
dir_obj = model_of_dir(path, exclude_patterns)
for sub_obj in dir_obj.iter_tree():
path_name = "path" if "path" in sub_obj.data.keys() else "data"
path = os.fsdecode(sub_obj.data[path_name])
swhid = str(sub_obj.swhid())
msg = f"{swhid}\t{path}" if show_filename else f"{swhid}"
click.echo(msg)
else:
results = zip(
objects,
map(
partial(identify_object, obj_type, follow_symlinks, exclude_patterns),
objects,
),
)
if verify:
swhid = next(results)[1]
if str(verify) == swhid:
click.echo("SWHID match: %s" % swhid)
sys.exit(0)
else:
click.echo("SWHID mismatch: %s != %s" % (verify, swhid))
sys.exit(1)
else:
for obj, swhid in results:
msg = swhid
if show_filename:
msg = "%s\t%s" % (swhid, os.fsdecode(obj))
click.echo(msg)
if __name__ == "__main__":
identify()
# Copyright (C) 2020-2023 The Software Heritage developers
# See the AUTHORS file at the top-level directory of this distribution
# License: GNU General Public License version 3, or any later version
# See top-level LICENSE file for more information
from __future__ import annotations
"""Utility data structures."""
from collections.abc import Mapping
import copy
from typing import Dict, Generic, Iterable, Optional, Tuple, TypeVar, Union
KT = TypeVar("KT")
VT = TypeVar("VT")
class ImmutableDict(Mapping, Generic[KT, VT]):
"""A frozen dictionary.
This class behaves like a dictionary, but internally stores objects in a tuple,
so it is both immutable and hashable."""
_data: Dict[KT, VT]
def __init__(
self,
data: Union[Iterable[Tuple[KT, VT]], ImmutableDict[KT, VT], Dict[KT, VT]] = {},
):
if isinstance(data, dict):
self._data = data
elif isinstance(data, ImmutableDict):
self._data = data._data
else:
self._data = {k: v for k, v in data}
@property
def data(self):
return tuple(self._data.items())
def __repr__(self):
return f"ImmutableDict({dict(self.data)!r})"
def __getitem__(self, key):
return self._data[key]
def __iter__(self):
for k, v in self.data:
yield k
def __len__(self):
return len(self._data)
def items(self):
yield from self.data
def __hash__(self):
return hash(tuple(sorted(self.data)))
def copy_pop(self, popped_key) -> Tuple[Optional[VT], ImmutableDict[KT, VT]]:
"""Returns a copy of this ImmutableDict without the given key,
as well as the value associated to the key."""
new_items = copy.deepcopy(self._data)
popped_value: Optional[VT] = new_items.pop(popped_key, None)
return (popped_value, ImmutableDict(new_items))
# Copyright (C) 2022 The Software Heritage developers
# See the AUTHORS file at the top-level directory of this distribution
# License: GNU General Public License version 3, or any later version
# See top-level LICENSE file for more information
"""Primitives for finding unknown content efficiently."""
from __future__ import annotations
from collections import namedtuple
import itertools
import logging
import random
from typing import (
Any,
Callable,
Iterable,
List,
Mapping,
NamedTuple,
Optional,
Set,
Union,
)
from typing_extensions import Protocol, runtime_checkable
from .from_disk import model
from .model import Sha1Git
logger = logging.getLogger(__name__)
# Maximum amount when sampling from the undecided set of directory entries
SAMPLE_SIZE = 1000
# Sets of sha1 of contents, skipped contents and directories respectively
Sample: NamedTuple = namedtuple(
"Sample", ["contents", "skipped_contents", "directories"]
)
@runtime_checkable
class ArchiveDiscoveryInterface(Protocol):
"""Interface used in discovery code to abstract over ways of connecting to
the SWH archive (direct storage, web API, etc.) for all methods needed by
discovery algorithms."""
contents: List[model.Content]
skipped_contents: List[model.SkippedContent]
directories: List[model.Directory]
def __init__(
self,
contents: List[model.Content],
skipped_contents: List[model.SkippedContent],
directories: List[model.Directory],
) -> None:
self.contents = contents
self.skipped_contents = skipped_contents
self.directories = directories
def content_missing(self, contents: List[Sha1Git]) -> Iterable[Sha1Git]:
"""List content missing from the archive by sha1"""
def skipped_content_missing(
self, skipped_contents: List[Sha1Git]
) -> Iterable[Sha1Git]:
"""List skipped content missing from the archive by sha1"""
def directory_missing(self, directories: List[Sha1Git]) -> Iterable[Sha1Git]:
"""List directories missing from the archive by sha1"""
class BaseDiscoveryGraph:
"""Creates the base structures and methods needed for discovery algorithms.
Subclasses should override ``get_sample`` to affect how the discovery is made.
The `update_info_callback` is an optional argument that will get called for
each new piece of information we get. The callback arguments are `(content,
known)`.
- content: the relevant model.Content object,
- known: a boolean, True if the file is known to the archive False otherwise.
"""
def __init__(
self,
contents,
skipped_contents,
directories,
update_info_callback: Optional[Callable[[Any, bool], None]] = None,
):
self._all_contents: Mapping[
Sha1Git, Union[model.Content, model.SkippedContent]
] = {}
self._undecided_directories: Set[Sha1Git] = set()
self._children: Mapping[Sha1Git, Set[Sha1Git]] = {}
self._parents: Mapping[model.DirectoryEntry, Set[Any]] = {}
self.undecided: Set[Sha1Git] = set()
for content in itertools.chain(contents, skipped_contents):
self.undecided.add(content.sha1_git)
self._all_contents[content.sha1_git] = content
for directory in directories:
self.undecided.add(directory.id)
self._undecided_directories.add(directory.id)
self._children[directory.id] = {c.target for c in directory.entries}
for child in directory.entries:
self._parents.setdefault(child.target, set()).add(directory.id)
self.undecided |= self._undecided_directories
self.known: Set[Sha1Git] = set()
self.unknown: Set[Sha1Git] = set()
self._update_info_callback = update_info_callback
self._sha1_to_obj = {}
for content in itertools.chain(contents, skipped_contents):
self._sha1_to_obj[content.sha1_git] = content
for directory in directories:
self._sha1_to_obj[directory.id] = directory
def mark_known(self, entries: Iterable[Sha1Git]):
"""Mark ``entries`` and those they imply as known in the SWH archive"""
self._mark_entries(entries, self._children, self.known)
def mark_unknown(self, entries: Iterable[Sha1Git]):
"""Mark ``entries`` and those they imply as unknown in the SWH archive"""
self._mark_entries(entries, self._parents, self.unknown)
def _mark_entries(
self,
entries: Iterable[Sha1Git],
transitive_mapping: Mapping[Any, Any],
target_set: Set[Any],
):
"""Use Merkle graph properties to mark a directory entry as known or unknown.
If an entry is known, then all of its descendants are known. If it's
unknown, then all of its ancestors are unknown.
- ``entries``: directory entries to mark along with their ancestors/descendants
where applicable.
- ``transitive_mapping``: mapping from an entry to the next entries to mark
in the hierarchy, if any.
- ``target_set``: set where marked entries will be added.
"""
callback = self._update_info_callback
to_process = set(entries)
while to_process:
current = to_process.pop()
target_set.add(current)
new = current in self.undecided
self.undecided.discard(current)
self._undecided_directories.discard(current)
next_entries = transitive_mapping.get(current, set()) & self.undecided
to_process.update(next_entries)
if new and callback is not None:
obj = self._sha1_to_obj[current]
callback(obj, current in self.known)
def get_sample(
self,
) -> Sample:
"""Return a three-tuple of samples from the undecided sets of contents,
skipped contents and directories respectively.
These samples will be queried against the storage which will tell us
which are known."""
raise NotImplementedError()
def do_query(self, archive: ArchiveDiscoveryInterface, sample: Sample) -> None:
"""Given a three-tuple of samples, ask the archive which are known or
unknown and mark them as such."""
methods = (
archive.content_missing,
archive.skipped_content_missing,
archive.directory_missing,
)
for sample_per_type, method in zip(sample, methods):
if not sample_per_type:
continue
known = set(sample_per_type)
unknown = set(method(list(sample_per_type)))
known -= unknown
self.mark_known(known)
self.mark_unknown(unknown)
class RandomDirSamplingDiscoveryGraph(BaseDiscoveryGraph):
"""Use a random sampling using only directories.
This allows us to find a statistically good spread of entries in the graph
with a smaller population than using all types of entries. When there are
no more directories, only contents or skipped contents are undecided if any
are left: we send them directly to the storage since they should be few and
their structure flat."""
def get_sample(self) -> Sample:
if self._undecided_directories:
if len(self._undecided_directories) <= SAMPLE_SIZE:
return Sample(
contents=set(),
skipped_contents=set(),
directories=set(self._undecided_directories),
)
sample = random.sample(tuple(self._undecided_directories), SAMPLE_SIZE)
directories = {o for o in sample}
return Sample(
contents=set(), skipped_contents=set(), directories=directories
)
contents = set()
skipped_contents = set()
for sha1 in self.undecided:
obj = self._all_contents[sha1]
obj_type = obj.object_type
if obj_type == model.Content.object_type:
contents.add(sha1)
elif obj_type == model.SkippedContent.object_type:
skipped_contents.add(sha1)
else:
raise TypeError(f"Unexpected object type {obj_type}")
return Sample(
contents=contents, skipped_contents=skipped_contents, directories=set()
)
def filter_known_objects(
archive: ArchiveDiscoveryInterface,
update_info_callback: Optional[Callable[[Any, bool], None]] = None,
):
"""Filter ``archive``'s ``contents``, ``skipped_contents`` and ``directories``
to only return those that are unknown to the SWH archive using a discovery
algorithm.
The `update_info_callback` is an optional argument that will get called for
each new piece of information we get. The callback arguments are `(content,
known)`.
- content: the relevant model.Content object,
- known: a boolean, True if the file is known to the archive False otherwise.
"""
contents = archive.contents
skipped_contents = archive.skipped_contents
directories = archive.directories
contents_count = len(contents)
skipped_contents_count = len(skipped_contents)
directories_count = len(directories)
graph = RandomDirSamplingDiscoveryGraph(
contents,
skipped_contents,
directories,
update_info_callback=update_info_callback,
)
while graph.undecided:
sample = graph.get_sample()
graph.do_query(archive, sample)
contents = [c for c in contents if c.sha1_git in graph.unknown]
skipped_contents = [c for c in skipped_contents if c.sha1_git in graph.unknown]
directories = [c for c in directories if c.id in graph.unknown]
logger.debug(
"Filtered out %d contents, %d skipped contents and %d directories",
contents_count - len(contents),
skipped_contents_count - len(skipped_contents),
directories_count - len(directories),
)
return (contents, skipped_contents, directories)
......@@ -33,11 +33,12 @@
# ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE
# POSSIBILITY OF SUCH DAMAGE.
NON_FIELD_ERRORS = '__all__'
NON_FIELD_ERRORS = "__all__"
class ValidationError(Exception):
"""An error while validating data."""
def __init__(self, message, code=None, params=None):
"""
The `message` argument can be a single error, a list of errors, or a
......@@ -54,16 +55,15 @@ class ValidationError(Exception):
message = message[0]
if isinstance(message, ValidationError):
if hasattr(message, 'error_dict'):
if hasattr(message, "error_dict"):
message = message.error_dict
# PY2 has a `message` property which is always there so we can't
# duck-type on it. It was introduced in Python 2.5 and already
# deprecated in Python 2.6.
elif not hasattr(message, 'message'):
elif not hasattr(message, "message"):
message = message.error_list
else:
message, code, params = (message.message, message.code,
message.params)
message, code, params = (message.message, message.code, message.params)
if isinstance(message, dict):
self.error_dict = {}
......@@ -78,9 +78,8 @@ class ValidationError(Exception):
# Normalize plain strings to instances of ValidationError.
if not isinstance(message, ValidationError):
message = ValidationError(message)
if hasattr(message, 'error_dict'):
self.error_list.extend(sum(message.error_dict.values(),
[]))
if hasattr(message, "error_dict"):
self.error_list.extend(sum(message.error_dict.values(), []))
else:
self.error_list.extend(message.error_list)
......@@ -94,18 +93,18 @@ class ValidationError(Exception):
def message_dict(self):
# Trigger an AttributeError if this ValidationError
# doesn't have an error_dict.
getattr(self, 'error_dict')
getattr(self, "error_dict")
return dict(self)
@property
def messages(self):
if hasattr(self, 'error_dict'):
if hasattr(self, "error_dict"):
return sum(dict(self).values(), [])
return list(self)
def update_error_dict(self, error_dict):
if hasattr(self, 'error_dict'):
if hasattr(self, "error_dict"):
for field, error_list in self.error_dict.items():
error_dict.setdefault(field, []).extend(error_list)
else:
......@@ -113,7 +112,7 @@ class ValidationError(Exception):
return error_dict
def __iter__(self):
if hasattr(self, 'error_dict'):
if hasattr(self, "error_dict"):
for field, errors in self.error_dict.items():
yield field, list(ValidationError(errors))
else:
......@@ -124,9 +123,13 @@ class ValidationError(Exception):
yield message
def __str__(self):
if hasattr(self, 'error_dict'):
if hasattr(self, "error_dict"):
return repr(dict(self))
return repr(list(self))
def __repr__(self):
return 'ValidationError(%s)' % self
return "ValidationError(%s)" % self
class InvalidDirectoryPath(Exception):
pass
......@@ -6,8 +6,13 @@
# We do our imports here but we don't use them, so flake8 complains
# flake8: noqa
from .simple import (validate_type, validate_int, validate_str, validate_bytes,
validate_datetime, validate_enum)
from .hashes import (validate_sha1, validate_sha1_git, validate_sha256)
from .compound import (validate_against_schema, validate_all_keys,
validate_any_key)
from .compound import validate_against_schema, validate_all_keys, validate_any_key
from .hashes import validate_sha1, validate_sha1_git, validate_sha256
from .simple import (
validate_bytes,
validate_datetime,
validate_enum,
validate_int,
validate_str,
validate_type,
)
......@@ -6,7 +6,7 @@
from collections import defaultdict
import itertools
from ..exceptions import ValidationError, NON_FIELD_ERRORS
from ..exceptions import NON_FIELD_ERRORS, ValidationError
def validate_against_schema(model, schema, value):
......@@ -26,19 +26,19 @@ def validate_against_schema(model, schema, value):
if not isinstance(value, dict):
raise ValidationError(
'Unexpected type %(type)s for %(model)s, expected dict',
"Unexpected type %(type)s for %(model)s, expected dict",
params={
'model': model,
'type': value.__class__.__name__,
"model": model,
"type": value.__class__.__name__,
},
code='model-unexpected-type',
code="model-unexpected-type",
)
errors = defaultdict(list)
for key, (mandatory, validators) in itertools.chain(
((k, v) for k, v in schema.items() if k != NON_FIELD_ERRORS),
[(NON_FIELD_ERRORS, (False, schema.get(NON_FIELD_ERRORS, [])))]
[(NON_FIELD_ERRORS, (False, schema.get(NON_FIELD_ERRORS, [])))],
):
if not validators:
continue
......@@ -54,9 +54,9 @@ def validate_against_schema(model, schema, value):
if mandatory:
errors[key].append(
ValidationError(
'Field %(field)s is mandatory',
params={'field': key},
code='model-field-mandatory',
"Field %(field)s is mandatory",
params={"field": key},
code="model-field-mandatory",
)
)
......@@ -74,19 +74,21 @@ def validate_against_schema(model, schema, value):
else:
if not valid:
errdata = {
'validator': validator.__name__,
"validator": validator.__name__,
}
if key == NON_FIELD_ERRORS:
errmsg = 'Validation of model %(model)s failed in ' \
'%(validator)s'
errdata['model'] = model
errcode = 'model-validation-failed'
errmsg = (
"Validation of model %(model)s failed in " "%(validator)s"
)
errdata["model"] = model
errcode = "model-validation-failed"
else:
errmsg = 'Validation of field %(field)s failed in ' \
'%(validator)s'
errdata['field'] = key
errcode = 'field-validation-failed'
errmsg = (
"Validation of field %(field)s failed in " "%(validator)s"
)
errdata["field"] = key
errcode = "field-validation-failed"
errors[key].append(
ValidationError(errmsg, params=errdata, code=errcode)
......@@ -102,11 +104,11 @@ def validate_all_keys(value, keys):
"""Validate that all the given keys are present in value"""
missing_keys = set(keys) - set(value)
if missing_keys:
missing_fields = ', '.join(sorted(missing_keys))
missing_fields = ", ".join(sorted(missing_keys))
raise ValidationError(
'Missing mandatory fields %(missing_fields)s',
params={'missing_fields': missing_fields},
code='missing-mandatory-field'
"Missing mandatory fields %(missing_fields)s",
params={"missing_fields": missing_fields},
code="missing-mandatory-field",
)
return True
......@@ -116,11 +118,11 @@ def validate_any_key(value, keys):
"""Validate that any of the given keys is present in value"""
present_keys = set(keys) & set(value)
if not present_keys:
missing_fields = ', '.join(sorted(keys))
missing_fields = ", ".join(sorted(keys))
raise ValidationError(
'Must contain one of the alternative fields %(missing_fields)s',
params={'missing_fields': missing_fields},
code='missing-alternative-field',
"Must contain one of the alternative fields %(missing_fields)s",
params={"missing_fields": missing_fields},
code="missing-alternative-field",
)
return True