Skip to content
Snippets Groups Projects

Compare revisions

Changes are shown as if the source revision was being merged into the target revision. Learn more about comparing revisions.

Source

Select target project
No results found

Target

Select target project
  • anlambert/swh-model
  • lunar/swh-model
  • franckbret/swh-model
  • douardda/swh-model
  • olasd/swh-model
  • swh/devel/swh-model
  • Alphare/swh-model
  • samplet/swh-model
  • marmoute/swh-model
  • rboyer/swh-model
10 results
Show changes
Commits on Source (45)
*~
build
/.coverage
/.coverage.*
dist
*.egg-info/
.eggs/
.hypothesis
*.pyc
__pycache__
.pytest_cache
*.sw?
.tox
version.txt
Copyright (C) 2015 The Software Heritage developers
See http://www.softwareheritage.org/ for more information.
# Software Heritage Code of Conduct
## Our Pledge
In the interest of fostering an open and welcoming environment, we as Software
Heritage contributors and maintainers pledge to making participation in our
project and our community a harassment-free experience for everyone, regardless
of age, body size, disability, ethnicity, sex characteristics, gender identity
and expression, level of experience, education, socio-economic status,
nationality, personal appearance, race, religion, or sexual identity and
orientation.
## Our Standards
Examples of behavior that contributes to creating a positive environment
include:
* Using welcoming and inclusive language
* Being respectful of differing viewpoints and experiences
* Gracefully accepting constructive criticism
* Focusing on what is best for the community
* Showing empathy towards other community members
Examples of unacceptable behavior by participants include:
* The use of sexualized language or imagery and unwelcome sexual attention or
advances
* Trolling, insulting/derogatory comments, and personal or political attacks
* Public or private harassment
* Publishing others' private information, such as a physical or electronic
address, without explicit permission
* Other conduct which could reasonably be considered inappropriate in a
professional setting
## Our Responsibilities
Project maintainers are responsible for clarifying the standards of acceptable
behavior and are expected to take appropriate and fair corrective action in
response to any instances of unacceptable behavior.
Project maintainers have the right and responsibility to remove, edit, or
reject comments, commits, code, wiki edits, issues, and other contributions
that are not aligned to this Code of Conduct, or to ban temporarily or
permanently any contributor for other behaviors that they deem inappropriate,
threatening, offensive, or harmful.
## Scope
This Code of Conduct applies within all project spaces, and it also applies when
an individual is representing the project or its community in public spaces.
Examples of representing a project or community include using an official
project e-mail address, posting via an official social media account, or acting
as an appointed representative at an online or offline event. Representation of
a project may be further defined and clarified by project maintainers.
## Enforcement
Instances of abusive, harassing, or otherwise unacceptable behavior may be
reported by contacting the project team at `conduct@softwareheritage.org`. All
complaints will be reviewed and investigated and will result in a response that
is deemed necessary and appropriate to the circumstances. The project team is
obligated to maintain confidentiality with regard to the reporter of an
incident. Further details of specific enforcement policies may be posted
separately.
Project maintainers who do not follow or enforce the Code of Conduct in good
faith may face temporary or permanent repercussions as determined by other
members of the project's leadership.
## Attribution
This Code of Conduct is adapted from the [Contributor Covenant][homepage], version 1.4,
available at https://www.contributor-covenant.org/version/1/4/code-of-conduct.html
[homepage]: https://www.contributor-covenant.org
For answers to common questions about this code of conduct, see
https://www.contributor-covenant.org/faq
Ishan Bhanuka
This diff is collapsed.
FLAG=-v
NOSEFLAGS=-v -s
Metadata-Version: 2.1
Name: swh.model
Version: 0.0.44
Summary: Software Heritage data model
Home-page: https://forge.softwareheritage.org/diffusion/DMOD/
Author: Software Heritage developers
Author-email: swh-devel@inria.fr
License: UNKNOWN
Project-URL: Funding, https://www.softwareheritage.org/donate
Project-URL: Source, https://forge.softwareheritage.org/source/swh-model
Project-URL: Bug Reports, https://forge.softwareheritage.org/maniphest
Description: swh-model
=========
Implementation of the Data model of the Software Heritage project, used to
archive source code artifacts.
This module defines the notion of Persistent Identifier (PID) and provides
tools to compute them:
```sh
$ swh-identify fork.c kmod.c sched/deadline.c
swh:1:cnt:2e391c754ae730bd2d8520c2ab497c403220c6e3 fork.c
swh:1:cnt:0277d1216f80ae1adeed84a686ed34c9b2931fc2 kmod.c
swh:1:cnt:57b939c81bce5d06fa587df8915f05affbe22b82 sched/deadline.c
$ swh-identify --no-filename /usr/src/linux/kernel/
swh:1:dir:f9f858a48d663b3809c9e2f336412717496202ab
```
Platform: UNKNOWN
Classifier: Programming Language :: Python :: 3
Classifier: Intended Audience :: Developers
Classifier: License :: OSI Approved :: GNU General Public License v3 (GPLv3)
Classifier: Operating System :: OS Independent
Classifier: Development Status :: 5 - Production/Stable
Description-Content-Type: text/markdown
Provides-Extra: testing
#!/usr/bin/env bash
# Use
# git-revhash 'tree 4b825dc642cb6eb9a060e54bf8d69288fbee4904\nparent 22c0fa5195a53f2e733ec75a9b6e9d1624a8b771\nauthor seanius <seanius@3187e211-bb14-4c82-9596-0b59d67cd7f4> 1138341044 +0000\ncommitter seanius <seanius@3187e211-bb14-4c82-9596-0b59d67cd7f4> 1138341044 +0000\n\nmaking dir structure...\n' # noqa
# output: 17a631d474f49bbebfdf3d885dcde470d7faafd7
echo -ne $* | git hash-object --stdin -t commit
#!/usr/bin/env python3
# Use sample:
# swh-hashtree --path . --ignore '.svn' --ignore '.git-svn' \
# --ignore-empty-folders
# 38f8d2c3a951f6b94007896d0981077e48bbd702
import click
import os
from swh.model import from_disk, hashutil
def combine_filters(*filters):
"""Combine several ignore filters"""
if len(filters) == 0:
return from_disk.accept_all_directories
elif len(filters) == 1:
return filters[0]
def combined_filter(*args, **kwargs):
return all(filter(*args, **kwargs) for filter in filters)
return combined_filter
@click.command()
@click.option('--path', default='.',
help='Optional path to hash.')
@click.option('--ignore-empty-folder', is_flag=True, default=False,
help='Ignore empty folder.')
@click.option('--ignore', multiple=True,
help='Ignore pattern.')
def main(path, ignore_empty_folder=False, ignore=None):
filters = []
if ignore_empty_folder:
filters.append(from_disk.ignore_empty_directories)
if ignore:
filters.append(
from_disk.ignore_named_directories(
[os.fsencode(name) for name in ignore]
)
)
try:
d = from_disk.Directory.from_disk(path=os.fsencode(path),
dir_filter=combine_filters(*filters))
hash = d.hash
except Exception as e:
print(e)
return
else:
print(hashutil.hash_to_hex(hash))
if __name__ == '__main__':
main()
#!/usr/bin/env python3
# Use:
# swh-revhash 'tree 4b825dc642cb6eb9a060e54bf8d69288fbee4904\nparent 22c0fa5195a53f2e733ec75a9b6e9d1624a8b771\nauthor seanius <seanius@3187e211-bb14-4c82-9596-0b59d67cd7f4> 1138341044 +0000\ncommitter seanius <seanius@3187e211-bb14-4c82-9596-0b59d67cd7f4> 1138341044 +0000\n\nmaking dir structure...\n' # noqa
# output: 17a631d474f49bbebfdf3d885dcde470d7faafd7
# To compare with git:
# git-revhash 'tree 4b825dc642cb6eb9a060e54bf8d69288fbee4904\nparent 22c0fa5195a53f2e733ec75a9b6e9d1624a8b771\nauthor seanius <seanius@3187e211-bb14-4c82-9596-0b59d67cd7f4> 1138341044 +0000\ncommitter seanius <seanius@3187e211-bb14-4c82-9596-0b59d67cd7f4> 1138341044 +0000\n\nmaking dir structure...\n' # noqa
# output: 17a631d474f49bbebfdf3d885dcde470d7faafd7
import sys
from swh.model import identifiers, hashutil
def revhash(revision_raw):
"""Compute the revision hash.
"""
if b'\\n' in revision_raw: # HACK: string have somehow their \n
# expanded to \\n
revision_raw = revision_raw.replace(b'\\n', b'\n')
h = hashutil.hash_git_data(revision_raw, 'commit')
return identifiers.identifier_to_str(h)
if __name__ == '__main__':
revision_raw = sys.argv[1].encode('utf-8')
print(revhash(revision_raw))
_build/
apidoc/
*-stamp
include ../../swh-docs/Makefile.sphinx
-include Makefile.local
sphinx/html: images
sphinx/clean: clean-images
assets: images
images:
make -C images/
clean-images:
make -C images/ clean
.PHONY: images clean-images
# Local Variables:
# mode: makefile
# End:
from swh.docs.sphinx.conf import * # NoQA
.. _data-model:
Data model
==========
.. note:: The text below is adapted from §7 of the article `Software Heritage:
Why and How to Preserve Software Source Code
<https://hal.archives-ouvertes.fr/hal-01590958/>`_ (in proceedings of `iPRES
2017 <https://ipres2017.jp/>`_, 14th International Conference on Digital
Preservation, by Roberto Di Cosmo and Stefano Zacchiroli), which also
provides a more general description of Software Heritage for the digital
preservation research community.
In any archival project the choice of the underlying data model—at the logical
level, independently from how data is actually stored on physical media—is
paramount. The data model adopted by Software Heritage to represent the
information that it collects is centered around the notion of *software
artifact*, described below.
It is important to notice that according to our principles, we must store with
every software artifact full information on where it has been found
(provenance), that is also captured in our data model, so we start by providing
some basic information on the nature of this provenance information.
Source code hosting places
--------------------------
Currently, Software Heritage uses of a curated list of source code hosting
places to crawl. The most common entries we expect to place in such a list are
popular collaborative development forges (e.g., GitHub, Bitbucket), package
manager repositories that host source package (e.g., CPAN, npm), and FOSS
distributions (e.g., Fedora, FreeBSD). But we may of course allow also more
niche entries, such as URLs of personal or institutional project collections
not hosted on major forges.
While currently entirely manual, the curation of such a list might easily be
semi-automatic, with entries suggested by fellow archivists and/or concerned
users that want to notify Software Heritage of the need of archiving specific
pieces of endangered source code. This approach is entirely compatible with
Web-wide crawling approaches: crawlers capable of detecting the presence of
source code might enrich the list. In both cases the list will remain curated,
with (semi-automated) review processes that will need to pass before a hosting
place starts to be used.
Software artifacts
------------------
Once the hosting places are known, they will need to be periodically looked at
in order to add to the archive missing software artifacts. Which software
artifacts will be found there?
In general, each software distribution mechanism hosts multiple releases of a
given software at any given time. For VCS (Version Control Systems), this is
the natural behaviour; for software packages, while a single version of a
package is just a snapshot of the corresponding software product, one can often
retrieve both current and past versions of the package from its distribution
site.
By reviewing and generalizing existing VCS and source package formats, we have
identified the following recurrent artifacts as commonly found at source code
hosting places. They form the basic ingredients of the Software Heritage
archive. As the terminology varies quite a bit from technology to technology,
we provide below both the canonical name used in Software Heritage and popular
synonyms.
**contents** (AKA "blobs")
the raw content of (source code) files as a sequence of bytes, without file
names or any other metadata. File contents are often recurrent, e.g., across
different versions of the same software, different directories of the same
project, or different projects all together.
**directories**
a list of named directory entries, each of which pointing to other artifacts,
usually file contents or sub-directories. Directory entries are also
associated to arbitrary metadata, which vary with technologies, but usually
includes permission bits, modification timestamps, etc.
**revisions** (AKA "commits")
software development within a specific project is essentially a time-indexed
series of copies of a single "root" directory that contains the entire
project source code. Software evolves when a developer modifies the content
of one or more files in that directory and record their changes.
Each recorded copy of the root directory is known as a "revision". It points
to a fully-determined directory and is equipped with arbitrary metadata. Some
of those are added manually by the developer (e.g., commit message), others
are automatically synthesized (timestamps, preceding commit(s), etc).
**releases** (AKA "tags")
some revisions are more equals than others and get selected by developers as
denoting important project milestones known as "releases". Each release
points to the last commit in project history corresponding to the release and
might carry arbitrary metadata—e.g., release name and version, release
message, cryptographic signatures, etc.
Additionally, the following crawling-related information are stored as
provenance information in the Software Heritage archive:
**origins**
code "hosting places" as previously described are usually large platforms
that host several unrelated software projects. For software provenance
purposes it is important to be more specific than that.
Software origins are fine grained references to where source code artifacts
archived by Software Heritage have been retrieved from. They take the form of
``(type, url)`` pairs, where ``url`` is a canonical URL (e.g., the address at
which one can ``git clone`` a repository or download a source tarball) and
``type`` the kind of software origin (e.g., git, svn, or dsc for Debian
source packages).
..
**projects**
as commonly intended are more abstract entities that precise software
origins. Projects relate together several development resources, including
websites, issue trackers, mailing lists, as well as software origins as
intended by Software Heritage.
The debate around the most apt ontologies to capture project-related
information for software hasn't settled yet, but the place projects will take
in the Software Heritage archive is fairly clear. Projects are abstract
entities, which will be arbitrarily nestable in a versioned
project/sub-project hierarchy, and that can be associated to arbitrary
metadata as well as origins where their source code can be found.
**snapshots**
any kind of software origin offers multiple pointers to the "current" state
of a development project. In the case of VCS this is reflected by branches
(e.g., master, development, but also so called feature branches dedicated to
extending the software in a specific direction); in the case of package
distributions by notions such as suites that correspond to different maturity
levels of individual packages (e.g., stable, development, etc.).
A "snapshot" of a given software origin records all entry points found there
and where each of them was pointing at the time. For example, a snapshot
object might track the commit where the master branch was pointing to at any
given time, as well as the most recent release of a given package in the
stable suite of a FOSS distribution.
**visits**
links together software origins with snapshots. Every time an origin is
consulted a new visit object is created, recording when (according to
Software Heritage clock) the visit happened and the full snapshot of the
state of the software origin at the time.
Data structure
--------------
.. _swh-merkle-dag:
.. figure:: images/swh-merkle-dag.svg
:width: 1024px
:align: center
Software Heritage archive as a Merkle DAG, augmented with crawling
information (click to zoom).
With all the bits of what we want to archive in place, the next question is how
to organize them, i.e., which logical data structure to adopt for their
storage. A key observation for this decision is that source code artifacts are
massively duplicated. This is so for several reasons:
* code hosting diaspora (i.e., project development moving to the most
recent/cool collaborative development technology over time);
* copy/paste (AKA "vendoring") of parts or entire external FOSS software
components into other software products;
* large overlap between revisions of the same project: usually only a very
small amount of files/directories are modified by a single commit;
* emergence of DVCS (distributed version control systems), which natively work
by replicating entire repository copies around. GitHub-style pull requests
are the pinnacle of this, as they result in creating an additional repository
copy at each change done by a new developer;
* migration from one VCS to another—e.g., migrations from Subversion to Git,
which are really popular these days—resulting in additional copies, but in a
different distribution format, of the very same development histories.
These trends seem to be neither stopping nor slowing down, and it is reasonable
to expect that they will be even more prominent in the future, due to the
decreasing costs of storage and bandwidth.
For this reason we argue that any sustainable storage layout for archiving
source code in the very long term should support deduplication, allowing to pay
for the cost of storing source code artifacts that are encountered more than
once only once. For storage efficiency, deduplication should be supported for
all the software artifacts we have discussed, namely: file contents,
directories, revisions, releases, snapshots.
Realizing that principle, the Software Heritage archive is conceptually a
single (big) `Merkle Direct Acyclic Graph (DAG)
<https://en.wikipedia.org/wiki/Merkle_tree>`_, as depicted in Figure
:ref:`Software Heritage Merkle DAG <swh-merkle-dag>`. In such a graph each of
the artifacts we have described—from file contents up to entire
snapshots—correspond to a node. Edges between nodes emerge naturally:
directory entries point to other directories or file contents; revisions point
to directories and previous revisions, releases point to revisions, snapshots
point to revisions and releases. Additionally, each node contains all metadata
that are specific to the node itself rather than to pointed nodes; e.g., commit
messages, timestamps, or file names. Note that the structure is really a DAG,
and not a tree, due to the fact that the line of revisions nodes might be
forked and merged back.
..
directory: fff3cc22cb40f71d26f736c082326e77de0b7692
parent: e4feb05112588741b4764739d6da756c357e1f37
author: Stefano Zacchiroli <zack@upsilon.cc>
date: 1443617461 +0200
committer: Stefano Zacchiroli <zack@upsilon.cc>
commiter_date: 1443617461 +0200
message:
objstorage: fix tempfile race when adding objects
Before this change, two workers adding the same
object will end up racing to write <SHA1>.tmp.
[...]
revisionid: 64a783216c1ec69dcb267449c0bbf5e54f7c4d6d
A revision node in the Software Heritage DAG
In a Merkle structure each node is identified by an intrinsic identifier
computed as a cryptographic hash of the node content. In the case of Software
Heritage identifiers are computed taking into account both node-specific
metadata and the identifiers of child nodes.
Consider the revision node in the picture whose identifier starts with
`c7640e08d..`. it points to a directory (identifier starting with
`45f0c078..`), which has also been archived. That directory contains a full
copy, at a specific point in time, of a software component—in the example the
`Hello World <https://forge.softwareheritage.org/source/helloworld/>`_ software
component available on our forge. The revision node also points to the
preceding revision node (`43ef7dcd..`) in the project development history.
Finally, the node contains revision-specific metadata, such as the author and
committer of the given change, its timestamps, and the message entered by the
author at commit time.
The identifier of the revision node itself (`c7640e08d..`) is computed as a
cryptographic hash of a (canonical representation of) all the information shown
in figure. A change in any of them—metadata and/or pointed nodes—would result
in an entirely different node identifier. All other types of nodes in the
Software Heritage archive behave similarly.
The Software Heritage archive inherits useful properties from the underlying
Merkle structure. In particular, deduplication is built-in. Any software
artifacts encountered in the wild gets added to the archive only if a
corresponding node with a matching intrinsic identifier is not already
available in the graph—file content, commits, entire directories or project
snapshots are all deduplicated incurring storage costs only once.
Furthermore, as a side effect of this data model choice, the entire development
history of all the source code archived in Software Heritage—which ambitions to
match all published source code in the world—is available as a unified whole,
making emergent structures such as code reuse across different projects or
software origins, readily available. Further reinforcing the Software Heritage
use cases, this object could become a veritable "map of the stars" of our
entire software commons.
swh-merkle-dag.pdf
swh-merkle-dag.svg
MERKLE_DAG = swh-merkle-dag.pdf swh-merkle-dag.svg
BUILD_TARGETS =
BUILD_TARGETS += $(MERKLE_DAG)
all: $(BUILD_TARGETS)
%.svg: %.dia
inkscape -l $@ $<
%.pdf: %.dia
inkscape -A $@ $<
clean:
-rm -f $(BUILD_TARGETS)
File deleted