Compare revisions

Nicolas Dandrimont · Nicolas Dandrimont · Antoine R. Dumont · Antoine R. Dumont · Antoine R. Dumont · Antoine R. Dumont
--- a/.gitignore
+++ b/.gitignore
-*~
-build
-/.coverage
-/.coverage.*
-dist
-*.egg-info/
-.eggs/
-.hypothesis
-*.pyc
-__pycache__
-.pytest_cache
-*.sw?
-.tox
-version.txt
--- a/AUTHORS
+++ b/AUTHORS
-Copyright (C) 2015 The Software Heritage developers
-
-See http://www.softwareheritage.org/ for more information.
--- a/CODE_OF_CONDUCT.md
+++ b/CODE_OF_CONDUCT.md
-# Software Heritage Code of Conduct
-
-## Our Pledge
-
-In the interest of fostering an open and welcoming environment, we as Software
-Heritage contributors and maintainers pledge to making participation in our
-project and our community a harassment-free experience for everyone, regardless
-of age, body size, disability, ethnicity, sex characteristics, gender identity
-and expression, level of experience, education, socio-economic status,
-nationality, personal appearance, race, religion, or sexual identity and
-orientation.
-
-## Our Standards
-
-Examples of behavior that contributes to creating a positive environment
-include:
-
-* Using welcoming and inclusive language
-* Being respectful of differing viewpoints and experiences
-* Gracefully accepting constructive criticism
-* Focusing on what is best for the community
-* Showing empathy towards other community members
-
-Examples of unacceptable behavior by participants include:
-
-* The use of sexualized language or imagery and unwelcome sexual attention or
-  advances
-* Trolling, insulting/derogatory comments, and personal or political attacks
-* Public or private harassment
-* Publishing others' private information, such as a physical or electronic
-  address, without explicit permission
-* Other conduct which could reasonably be considered inappropriate in a
-  professional setting
-
-## Our Responsibilities
-
-Project maintainers are responsible for clarifying the standards of acceptable
-behavior and are expected to take appropriate and fair corrective action in
-response to any instances of unacceptable behavior.
-
-Project maintainers have the right and responsibility to remove, edit, or
-reject comments, commits, code, wiki edits, issues, and other contributions
-that are not aligned to this Code of Conduct, or to ban temporarily or
-permanently any contributor for other behaviors that they deem inappropriate,
-threatening, offensive, or harmful.
-
-## Scope
-
-This Code of Conduct applies within all project spaces, and it also applies when
-an individual is representing the project or its community in public spaces.
-Examples of representing a project or community include using an official
-project e-mail address, posting via an official social media account, or acting
-as an appointed representative at an online or offline event. Representation of
-a project may be further defined and clarified by project maintainers.
-
-## Enforcement
-
-Instances of abusive, harassing, or otherwise unacceptable behavior may be
-reported by contacting the project team at `conduct@softwareheritage.org`. All
-complaints will be reviewed and investigated and will result in a response that
-is deemed necessary and appropriate to the circumstances. The project team is
-obligated to maintain confidentiality with regard to the reporter of an
-incident.  Further details of specific enforcement policies may be posted
-separately.
-
-Project maintainers who do not follow or enforce the Code of Conduct in good
-faith may face temporary or permanent repercussions as determined by other
-members of the project's leadership.
-
-## Attribution
-
-This Code of Conduct is adapted from the [Contributor Covenant][homepage], version 1.4,
-available at https://www.contributor-covenant.org/version/1/4/code-of-conduct.html
-
-[homepage]: https://www.contributor-covenant.org
-
-For answers to common questions about this code of conduct, see
-https://www.contributor-covenant.org/faq
--- a/CONTRIBUTORS
+++ b/CONTRIBUTORS
-Ishan Bhanuka
--- a/LICENSE
+++ b/LICENSE
--- a/Makefile.local
+++ b/Makefile.local
-FLAG=-v
-NOSEFLAGS=-v -s
--- a/PKG-INFO
+++ b/PKG-INFO
+Metadata-Version: 2.1
+Name: swh.model
+Version: 0.0.44
+Summary: Software Heritage data model
+Home-page: https://forge.softwareheritage.org/diffusion/DMOD/
+Author: Software Heritage developers
+Author-email: swh-devel@inria.fr
+License: UNKNOWN
+Project-URL: Funding, https://www.softwareheritage.org/donate
+Project-URL: Source, https://forge.softwareheritage.org/source/swh-model
+Project-URL: Bug Reports, https://forge.softwareheritage.org/maniphest
+Description: swh-model
+        =========
+        
+        Implementation of the Data model of the Software Heritage project, used to
+        archive source code artifacts.
+        
+        This module defines the notion of Persistent Identifier (PID) and provides
+        tools to compute them:
+        
+        ```sh
+           $ swh-identify fork.c kmod.c sched/deadline.c
+           swh:1:cnt:2e391c754ae730bd2d8520c2ab497c403220c6e3    fork.c
+           swh:1:cnt:0277d1216f80ae1adeed84a686ed34c9b2931fc2    kmod.c
+           swh:1:cnt:57b939c81bce5d06fa587df8915f05affbe22b82    sched/deadline.c
+        
+           $ swh-identify --no-filename /usr/src/linux/kernel/
+           swh:1:dir:f9f858a48d663b3809c9e2f336412717496202ab
+        ```
+        
+Platform: UNKNOWN
+Classifier: Programming Language :: Python :: 3
+Classifier: Intended Audience :: Developers
+Classifier: License :: OSI Approved :: GNU General Public License v3 (GPLv3)
+Classifier: Operating System :: OS Independent
+Classifier: Development Status :: 5 - Production/Stable
+Description-Content-Type: text/markdown
+Provides-Extra: testing
--- a/bin/git-revhash
+++ b/bin/git-revhash
-#!/usr/bin/env bash
-
-# Use
-# git-revhash 'tree 4b825dc642cb6eb9a060e54bf8d69288fbee4904\nparent 22c0fa5195a53f2e733ec75a9b6e9d1624a8b771\nauthor seanius <seanius@3187e211-bb14-4c82-9596-0b59d67cd7f4> 1138341044 +0000\ncommitter seanius <seanius@3187e211-bb14-4c82-9596-0b59d67cd7f4> 1138341044 +0000\n\nmaking dir structure...\n'  # noqa
-# output: 17a631d474f49bbebfdf3d885dcde470d7faafd7
-
-echo -ne $* | git hash-object --stdin -t commit
--- a/bin/swh-hashtree
+++ b/bin/swh-hashtree
-#!/usr/bin/env python3
-
-# Use sample:
-# swh-hashtree --path . --ignore '.svn' --ignore '.git-svn' \
-#    --ignore-empty-folders
-# 38f8d2c3a951f6b94007896d0981077e48bbd702
-
-import click
-import os
-
-from swh.model import from_disk, hashutil
-
-
-def combine_filters(*filters):
-    """Combine several ignore filters"""
-    if len(filters) == 0:
-        return from_disk.accept_all_directories
-    elif len(filters) == 1:
-        return filters[0]
-
-    def combined_filter(*args, **kwargs):
-        return all(filter(*args, **kwargs) for filter in filters)
-
-    return combined_filter
-
-
-@click.command()
-@click.option('--path', default='.',
-              help='Optional path to hash.')
-@click.option('--ignore-empty-folder', is_flag=True, default=False,
-              help='Ignore empty folder.')
-@click.option('--ignore', multiple=True,
-              help='Ignore pattern.')
-def main(path, ignore_empty_folder=False, ignore=None):
-
-    filters = []
-    if ignore_empty_folder:
-        filters.append(from_disk.ignore_empty_directories)
-    if ignore:
-        filters.append(
-            from_disk.ignore_named_directories(
-                [os.fsencode(name) for name in ignore]
-            )
-        )
-
-    try:
-        d = from_disk.Directory.from_disk(path=os.fsencode(path),
-                                          dir_filter=combine_filters(*filters))
-        hash = d.hash
-    except Exception as e:
-        print(e)
-        return
-    else:
-        print(hashutil.hash_to_hex(hash))
-
-
-if __name__ == '__main__':
-    main()
--- a/bin/swh-revhash
+++ b/bin/swh-revhash
-#!/usr/bin/env python3
-
-# Use:
-# swh-revhash 'tree 4b825dc642cb6eb9a060e54bf8d69288fbee4904\nparent 22c0fa5195a53f2e733ec75a9b6e9d1624a8b771\nauthor seanius <seanius@3187e211-bb14-4c82-9596-0b59d67cd7f4> 1138341044 +0000\ncommitter seanius <seanius@3187e211-bb14-4c82-9596-0b59d67cd7f4> 1138341044 +0000\n\nmaking dir structure...\n'  # noqa
-# output: 17a631d474f49bbebfdf3d885dcde470d7faafd7
-
-# To compare with git:
-# git-revhash 'tree 4b825dc642cb6eb9a060e54bf8d69288fbee4904\nparent 22c0fa5195a53f2e733ec75a9b6e9d1624a8b771\nauthor seanius <seanius@3187e211-bb14-4c82-9596-0b59d67cd7f4> 1138341044 +0000\ncommitter seanius <seanius@3187e211-bb14-4c82-9596-0b59d67cd7f4> 1138341044 +0000\n\nmaking dir structure...\n'   # noqa
-# output: 17a631d474f49bbebfdf3d885dcde470d7faafd7
-
-
-import sys
-
-from swh.model import identifiers, hashutil
-
-
-def revhash(revision_raw):
-    """Compute the revision hash.
-
-    """
-    if b'\\n' in revision_raw:  # HACK: string have somehow their \n
-                                # expanded to \\n
-        revision_raw = revision_raw.replace(b'\\n', b'\n')
-
-    h = hashutil.hash_git_data(revision_raw, 'commit')
-    return identifiers.identifier_to_str(h)
-
-
-if __name__ == '__main__':
-    revision_raw = sys.argv[1].encode('utf-8')
-    print(revhash(revision_raw))
--- a/docs/.gitignore
+++ b/docs/.gitignore
-_build/
-apidoc/
-*-stamp
--- a/docs/Makefile
+++ b/docs/Makefile
-include ../../swh-docs/Makefile.sphinx
-include Makefile.local
--- a/docs/Makefile.local
+++ b/docs/Makefile.local
-sphinx/html: images
-sphinx/clean: clean-images
-assets: images
-
-images:
-	make -C images/
-clean-images:
-	make -C images/ clean
-
-.PHONY: images clean-images
-
-
-# Local Variables:
-# mode: makefile
-# End:
--- a/docs/_static/.placeholder
+++ b/docs/_static/.placeholder
--- a/docs/_templates/.placeholder
+++ b/docs/_templates/.placeholder
--- a/docs/conf.py
+++ b/docs/conf.py
-from swh.docs.sphinx.conf import *  # NoQA
--- a/docs/data-model.rst
+++ b/docs/data-model.rst
-.. _data-model:
-
-Data model
-==========
-
-.. note:: The text below is adapted from §7 of the article `Software Heritage:
-  Why and How to Preserve Software Source Code
-  <https://hal.archives-ouvertes.fr/hal-01590958/>`_ (in proceedings of `iPRES
-  2017 <https://ipres2017.jp/>`_, 14th International Conference on Digital
-  Preservation, by Roberto Di Cosmo and Stefano Zacchiroli), which also
-  provides a more general description of Software Heritage for the digital
-  preservation research community.
-
-In any archival project the choice of the underlying data model—at the logical
-level, independently from how data is actually stored on physical media—is
-paramount. The data model adopted by Software Heritage to represent the
-information that it collects is centered around the notion of *software
-artifact*, described below.
-
-It is important to notice that according to our principles, we must store with
-every software artifact full information on where it has been found
-(provenance), that is also captured in our data model, so we start by providing
-some basic information on the nature of this provenance information.
-
-
-Source code hosting places
--------------------------
-
-Currently, Software Heritage uses of a curated list of source code hosting
-places to crawl. The most common entries we expect to place in such a list are
-popular collaborative development forges (e.g., GitHub, Bitbucket), package
-manager repositories that host source package (e.g., CPAN, npm), and FOSS
-distributions (e.g., Fedora, FreeBSD). But we may of course allow also more
-niche entries, such as URLs of personal or institutional project collections
-not hosted on major forges.
-
-While currently entirely manual, the curation of such a list might easily be
-semi-automatic, with entries suggested by fellow archivists and/or concerned
-users that want to notify Software Heritage of the need of archiving specific
-pieces of endangered source code. This approach is entirely compatible with
-Web-wide crawling approaches: crawlers capable of detecting the presence of
-source code might enrich the list. In both cases the list will remain curated,
-with (semi-automated) review processes that will need to pass before a hosting
-place starts to be used.
-
-
-Software artifacts
------------------
-
-Once the hosting places are known, they will need to be periodically looked at
-in order to add to the archive missing software artifacts. Which software
-artifacts will be found there?
-
-In general, each software distribution mechanism hosts multiple releases of a
-given software at any given time. For VCS (Version Control Systems), this is
-the natural behaviour; for software packages, while a single version of a
-package is just a snapshot of the corresponding software product, one can often
-retrieve both current and past versions of the package from its distribution
-site.
-
-By reviewing and generalizing existing VCS and source package formats, we have
-identified the following recurrent artifacts as commonly found at source code
-hosting places. They form the basic ingredients of the Software Heritage
-archive. As the terminology varies quite a bit from technology to technology,
-we provide below both the canonical name used in Software Heritage and popular
-synonyms.
-
-**contents** (AKA "blobs")
-  the raw content of (source code) files as a sequence of bytes, without file
-  names or any other metadata.  File contents are often recurrent, e.g., across
-  different versions of the same software, different directories of the same
-  project, or different projects all together.
-
-**directories**
-  a list of named directory entries, each of which pointing to other artifacts,
-  usually file contents or sub-directories. Directory entries are also
-  associated to arbitrary metadata, which vary with technologies, but usually
-  includes permission bits, modification timestamps, etc.
-
-**revisions** (AKA "commits")
-  software development within a specific project is essentially a time-indexed
-  series of copies of a single "root" directory that contains the entire
-  project source code. Software evolves when a developer modifies the content
-  of one or more files in that directory and record their changes.
-
-  Each recorded copy of the root directory is known as a "revision". It points
-  to a fully-determined directory and is equipped with arbitrary metadata. Some
-  of those are added manually by the developer (e.g., commit message), others
-  are automatically synthesized (timestamps, preceding commit(s), etc).
-
-**releases** (AKA "tags")
-  some revisions are more equals than others and get selected by developers as
-  denoting important project milestones known as "releases". Each release
-  points to the last commit in project history corresponding to the release and
-  might carry arbitrary metadata—e.g., release name and version, release
-  message, cryptographic signatures, etc.
-
-
-Additionally, the following crawling-related information are stored as
-provenance information in the Software Heritage archive:
-
-**origins**
-  code "hosting places" as previously described are usually large platforms
-  that host several unrelated software projects. For software provenance
-  purposes it is important to be more specific than that.
-
-  Software origins are fine grained references to where source code artifacts
-  archived by Software Heritage have been retrieved from. They take the form of
-  ``(type, url)`` pairs, where ``url`` is a canonical URL (e.g., the address at
-  which one can ``git clone`` a repository or download a source tarball) and
-  ``type`` the kind of software origin (e.g., git, svn, or dsc for Debian
-  source packages).
-
-..
-   **projects**
-     as commonly intended are more abstract entities that precise software
-     origins. Projects relate together several development resources, including
-     websites, issue trackers, mailing lists, as well as software origins as
-     intended by Software Heritage.
-
-     The debate around the most apt ontologies to capture project-related
-     information for software hasn't settled yet, but the place projects will take
-     in the Software Heritage archive is fairly clear. Projects are abstract
-     entities, which will be arbitrarily nestable in a versioned
-     project/sub-project hierarchy, and that can be associated to arbitrary
-     metadata as well as origins where their source code can be found.
-
-**snapshots**
-  any kind of software origin offers multiple pointers to the "current" state
-  of a development project. In the case of VCS this is reflected by branches
-  (e.g., master, development, but also so called feature branches dedicated to
-  extending the software in a specific direction); in the case of package
-  distributions by notions such as suites that correspond to different maturity
-  levels of individual packages (e.g., stable, development, etc.).
-
-  A "snapshot" of a given software origin records all entry points found there
-  and where each of them was pointing at the time. For example, a snapshot
-  object might track the commit where the master branch was pointing to at any
-  given time, as well as the most recent release of a given package in the
-  stable suite of a FOSS distribution.
-
-**visits**
-  links together software origins with snapshots. Every time an origin is
-  consulted a new visit object is created, recording when (according to
-  Software Heritage clock) the visit happened and the full snapshot of the
-  state of the software origin at the time.
-
-
-Data structure
--------------
-
-.. _swh-merkle-dag:
-.. figure:: images/swh-merkle-dag.svg
-   :width: 1024px
-   :align: center
-
-   Software Heritage archive as a Merkle DAG, augmented with crawling
-   information (click to zoom).
-
-
-With all the bits of what we want to archive in place, the next question is how
-to organize them, i.e., which logical data structure to adopt for their
-storage. A key observation for this decision is that source code artifacts are
-massively duplicated. This is so for several reasons:
-
-* code hosting diaspora (i.e., project development moving to the most
-  recent/cool collaborative development technology over time);
-* copy/paste (AKA "vendoring") of parts or entire external FOSS software
-  components into other software products;
-* large overlap between revisions of the same project: usually only a very
-  small amount of files/directories are modified by a single commit;
-* emergence of DVCS (distributed version control systems), which natively work
-  by replicating entire repository copies around. GitHub-style pull requests
-  are the pinnacle of this, as they result in creating an additional repository
-  copy at each change done by a new developer;
-* migration from one VCS to another—e.g., migrations from Subversion to Git,
-  which are really popular these days—resulting in additional copies, but in a
-  different distribution format, of the very same development histories.
-
-These trends seem to be neither stopping nor slowing down, and it is reasonable
-to expect that they will be even more prominent in the future, due to the
-decreasing costs of storage and bandwidth.
-
-For this reason we argue that any sustainable storage layout for archiving
-source code in the very long term should support deduplication, allowing to pay
-for the cost of storing source code artifacts that are encountered more than
-once only once. For storage efficiency, deduplication should be supported for
-all the software artifacts we have discussed, namely: file contents,
-directories, revisions, releases, snapshots.
-
-Realizing that principle, the Software Heritage archive is conceptually a
-single (big) `Merkle Direct Acyclic Graph (DAG)
-<https://en.wikipedia.org/wiki/Merkle_tree>`_, as depicted in Figure
-:ref:`Software Heritage Merkle DAG <swh-merkle-dag>`. In such a graph each of
-the artifacts we have described—from file contents up to entire
-snapshots—correspond to a node.  Edges between nodes emerge naturally:
-directory entries point to other directories or file contents; revisions point
-to directories and previous revisions, releases point to revisions, snapshots
-point to revisions and releases. Additionally, each node contains all metadata
-that are specific to the node itself rather than to pointed nodes; e.g., commit
-messages, timestamps, or file names. Note that the structure is really a DAG,
-and not a tree, due to the fact that the line of revisions nodes might be
-forked and merged back.
-
-..
-   directory: fff3cc22cb40f71d26f736c082326e77de0b7692
-   parent: e4feb05112588741b4764739d6da756c357e1f37
-   author: Stefano Zacchiroli <zack@upsilon.cc>
-   date: 1443617461 +0200
-   committer: Stefano Zacchiroli <zack@upsilon.cc>
-   commiter_date: 1443617461 +0200
-   message:
-     objstorage: fix tempfile race when adding objects
-
-     Before this change, two workers adding the same
-     object will end up racing to write <SHA1>.tmp.
-     [...]
-
-     revisionid: 64a783216c1ec69dcb267449c0bbf5e54f7c4d6d
-     A revision node in the Software Heritage DAG
-
-In a Merkle structure each node is identified by an intrinsic identifier
-computed as a cryptographic hash of the node content. In the case of Software
-Heritage identifiers are computed taking into account both node-specific
-metadata and the identifiers of child nodes.
-
-Consider the revision node in the picture whose identifier starts with
-`c7640e08d..`. it points to a directory (identifier starting with
-`45f0c078..`), which has also been archived. That directory contains a full
-copy, at a specific point in time, of a software component—in the example the
-`Hello World <https://forge.softwareheritage.org/source/helloworld/>`_ software
-component available on our forge. The revision node also points to the
-preceding revision node (`43ef7dcd..`) in the project development history.
-Finally, the node contains revision-specific metadata, such as the author and
-committer of the given change, its timestamps, and the message entered by the
-author at commit time.
-
-The identifier of the revision node itself (`c7640e08d..`) is computed as a
-cryptographic hash of a (canonical representation of) all the information shown
-in figure. A change in any of them—metadata and/or pointed nodes—would result
-in an entirely different node identifier. All other types of nodes in the
-Software Heritage archive behave similarly.
-
-The Software Heritage archive inherits useful properties from the underlying
-Merkle structure. In particular, deduplication is built-in. Any software
-artifacts encountered in the wild gets added to the archive only if a
-corresponding node with a matching intrinsic identifier is not already
-available in the graph—file content, commits, entire directories or project
-snapshots are all deduplicated incurring storage costs only once.
-
-Furthermore, as a side effect of this data model choice, the entire development
-history of all the source code archived in Software Heritage—which ambitions to
-match all published source code in the world—is available as a unified whole,
-making emergent structures such as code reuse across different projects or
-software origins, readily available. Further reinforcing the Software Heritage
-use cases, this object could become a veritable "map of the stars" of our
-entire software commons.
--- a/docs/images/.gitignore
+++ b/docs/images/.gitignore
-swh-merkle-dag.pdf
-swh-merkle-dag.svg
--- a/docs/images/Makefile
+++ b/docs/images/Makefile
-
-MERKLE_DAG =  swh-merkle-dag.pdf swh-merkle-dag.svg
-
-BUILD_TARGETS =
-BUILD_TARGETS += $(MERKLE_DAG)
-
-all: $(BUILD_TARGETS)
-
-
-%.svg: %.dia
-	inkscape -l $@ $<
-
-%.pdf: %.dia
-	inkscape -A $@ $<
-
-clean:
-	-rm -f $(BUILD_TARGETS)
--- a/docs/images/swh-merkle-dag.dia
+++ b/docs/images/swh-merkle-dag.dia
No results found