Skip to content
Snippets Groups Projects

Compare revisions

Changes are shown as if the source revision was being merged into the target revision. Learn more about comparing revisions.

Source

Select target project
No results found

Target

Select target project
  • anlambert/swh-vault
  • lunar/swh-vault
  • swh/devel/swh-vault
  • douardda/swh-vault
  • olasd/swh-vault
  • marmoute/swh-vault
  • rboyer/swh-vault
7 results
Show changes
Commits on Source (390)
# Changes here will be overwritten by Copier
_commit: v0.3.3
_src_path: https://gitlab.softwareheritage.org/swh/devel/swh-py-template.git
description: Software Heritage vault
distribution_name: swh-vault
have_cli: true
have_workers: true
package_root: swh/vault
project_name: swh.vault
python_minimal_version: '3.7'
readme_format: rst
# python: Reformat code with black
be318c7fc864410fb44187fdaeade22ca3ee9914
19fc56a7ffa2a7715b8b0dcb1673f0d6f697313a
d746a27c972076801a7a217261443f10b186d15b
*.egg-info/
*.pyc
*.sw?
*~
.coverage
.eggs/
.hypothesis
.mypy_cache
.tox
__pycache__
dist
*.egg-info
version.txt
build/
dist/
# these are symlinks created by a hook in swh-docs' main sphinx conf.py
docs/README.rst
docs/README.md
# this should be a symlink for people who want to build the sphinx doc
# without using tox, generally created by the swh-env/bin/update script
docs/Makefile.sphinx
repos:
- repo: https://github.com/pre-commit/pre-commit-hooks
rev: v5.0.0
hooks:
- id: trailing-whitespace
- id: check-json
- id: check-yaml
- repo: https://github.com/python/black
rev: 25.1.0
hooks:
- id: black
- repo: https://github.com/PyCQA/isort
rev: 6.0.0
hooks:
- id: isort
- repo: https://github.com/pycqa/flake8
rev: 7.1.1
hooks:
- id: flake8
additional_dependencies: [flake8-bugbear==24.12.12, flake8-pyproject]
- repo: https://github.com/codespell-project/codespell
rev: v2.4.1
hooks:
- id: codespell
name: Check source code spelling
stages: [pre-commit]
- id: codespell
name: Check commit message spelling
stages: [commit-msg]
- repo: local
hooks:
- id: mypy
name: mypy
entry: mypy
args: [swh]
pass_filenames: false
language: system
types: [python]
- id: twine-check
name: twine check
description: call twine check when pushing an annotated release tag
entry: bash -c "ref=$(git describe) &&
[[ $ref =~ ^v[0-9]+\.[0-9]+\.[0-9]+$ ]] &&
(python3 -m build --sdist && twine check $(ls -t dist/* | head -1)) || true"
pass_filenames: false
stages: [pre-push]
language: python
additional_dependencies: [twine, build]
# Software Heritage Code of Conduct
## Our Pledge
In the interest of fostering an open and welcoming environment, we as Software
Heritage contributors and maintainers pledge to making participation in our
project and our community a harassment-free experience for everyone, regardless
of age, body size, disability, ethnicity, sex characteristics, gender identity
and expression, level of experience, education, socioeconomic status,
nationality, personal appearance, race, religion, or sexual identity and
orientation.
## Our Standards
Examples of behavior that contributes to creating a positive environment
include:
* Using welcoming and inclusive language
* Being respectful of differing viewpoints and experiences
* Gracefully accepting constructive criticism
* Focusing on what is best for the community
* Showing empathy towards other community members
Examples of unacceptable behavior by participants include:
* The use of sexualized language or imagery and unwelcome sexual attention or
advances
* Trolling, insulting/derogatory comments, and personal or political attacks
* Public or private harassment
* Publishing others' private information, such as a physical or electronic
address, without explicit permission
* Other conduct which could reasonably be considered inappropriate in a
professional setting
## Our Responsibilities
Project maintainers are responsible for clarifying the standards of acceptable
behavior and are expected to take appropriate and fair corrective action in
response to any instances of unacceptable behavior.
Project maintainers have the right and responsibility to remove, edit, or
reject comments, commits, code, wiki edits, issues, and other contributions
that are not aligned to this Code of Conduct, or to ban temporarily or
permanently any contributor for other behaviors that they deem inappropriate,
threatening, offensive, or harmful.
## Scope
This Code of Conduct applies within all project spaces, and it also applies when
an individual is representing the project or its community in public spaces.
Examples of representing a project or community include using an official
project e-mail address, posting via an official social media account, or acting
as an appointed representative at an online or offline event. Representation of
a project may be further defined and clarified by project maintainers.
## Enforcement
Instances of abusive, harassing, or otherwise unacceptable behavior may be
reported by contacting the project team at `conduct@softwareheritage.org`. All
complaints will be reviewed and investigated and will result in a response that
is deemed necessary and appropriate to the circumstances. The project team is
obligated to maintain confidentiality with regard to the reporter of an
incident. Further details of specific enforcement policies may be posted
separately.
Project maintainers who do not follow or enforce the Code of Conduct in good
faith may face temporary or permanent repercussions as determined by other
members of the project's leadership.
## Attribution
This Code of Conduct is adapted from the [Contributor Covenant][homepage], version 1.4,
available at https://www.contributor-covenant.org/version/1/4/code-of-conduct.html
[homepage]: https://www.contributor-covenant.org
For answers to common questions about this code of conduct, see
https://www.contributor-covenant.org/faq
Quentin Campos
include Makefile
include Makefile.local
include README.db_testing
include README.dev
include requirements.txt
include requirements-swh.txt
include version.txt
recursive-include sql *
Software Heritage - Vault
=========================
User-facing service that allows to retrieve parts of the archive as
self-contained bundles (e.g., individual releases, entire repository snapshots,
etc.)
The creation of a bundle is called "cooking" a bundle.
Architecture
------------
The vault is made of two main parts:
1. a stateful RPC server called the **backend**
2. Celery tasks, called **cookers**
# Copyright (C) 2020-2023 The Software Heritage developers
# See the AUTHORS file at the top-level directory of this distribution
# License: GNU General Public License version 3, or any later version
# See top-level LICENSE file for more information
pytest_plugins = [
"swh.storage.pytest_plugin",
]
_build/
apidoc/
*-stamp
include Makefile.sphinx
../README.rst
\ No newline at end of file
File moved
.. _vault-api-ref:
Vault API Reference
===================
Software source code **objects**---e.g., individual files, directories,
commits, tagged releases, etc.---are stored in the Software Heritage (SWH)
Archive in fully deduplicated form. That allows direct access to individual
artifacts, but require some preparation ("cooking") when fast access to a large
set of related objects (e.g., an entire repository) is required.
The **Software Heritage Vault** takes care of that preparation by
asynchronously assembling **bundles** of related source code objects, caching,
and garbage collecting them as needed.
The Vault is accessible via a RPC API documented below.
All endpoints are mounted at API root, which is currently at :swh_web:`api/1/`.
Unless otherwise stated, API endpoints respond to HTTP GET method.
Object identification
---------------------
The vault stores bundles corresponding to different kinds of objects (see
:ref:`data-model`).
The URL fragment ``:bundletype/:swhid`` is used throughout the vault API to
identify vault objects. See :ref:`persistent-identifiers` for details on
the syntax and meaning of ``:swhid``.
Bundle types
------------
Flat
~~~~
Flat bundles are simple tarballs that can be read without any specialized software.
When cooking directories, they are (very close to) the original directories that
were ingested.
When cooking other types of objects, they have multiple root directories,
each corresponding to an original object (revision, ...)
This is typically only useful to cook directories; cooking other types of objects
(revisions, releases, snapshots) are usually done with ``git-bare`` as it is
more efficient and closer to the original repository.
You can extract the resulting bundle using:
.. code:: shell
tar xaf bundle.tar.gz
gitfast
~~~~~~~
A gzip-compressed `git fast-export
<https://git-scm.com/docs/git-fast-export>`_. You can extract the resulting
bundle using:
.. code:: shell
git init
zcat bundle.gitfast.gz | git fast-import
git checkout HEAD
git-bare
~~~~~~~~
A tarball that can be decompressed to get a real git repository.
It is without a checkout, so it is the equivalent of what one would get
with ``git clone --bare``.
This is the most flexible bundle type, as it allow to perfectly recreate
original git repositories, including branches.
You can extract the resulting bundle using:
.. code:: shell
tar xaf bundle.tar.gz
Then explore its content like a normal ("non-bare") git repository by cloning it:
.. code:: shell
git clone path/to/extracted/:swhid
Cooking and status checking
---------------------------
Vault bundles might be ready for retrieval or not. When they are not, they will
need to be **cooked** before they can be retrieved. A cooked bundle will remain
around until it expires; after expiration, it will need to be cooked again
before it can be retrieved. Cooking is idempotent, and a no-op in between a
previous cooking operation and expiration.
.. http:post:: /vault/:bundletype/:swhid
.. http:get:: /vault/:bundletype/:swhid
**Request body**: optionally, an ``email`` POST parameter containing an
e-mail to notify when the bundle cooking has ended.
**Allowed HTTP Methods:**
- :http:method:`post` to **request** a bundle cooking
- :http:method:`get` to check the progress and status of the cooking
- :http:method:`head`
- :http:method:`options`
**Response:**
:statuscode 200: bundle available for cooking, status of the cooking
:statuscode 400: malformed SWHID
:statuscode 404: unavailable bundle or object not found
.. sourcecode:: http
HTTP/1.1 200 OK
Content-Type: application/json
{
"id": 42,
"fetch_url": "/api/1/vault/flat/:swhid/raw/",
"swhid": ":swhid",
"progress_message": "Creating tarball...",
"status": "pending"
}
After a cooking request has been started, all subsequent GET and POST
requests to the cooking URL return some JSON data containing information
about the progress of the bundle creation. The JSON contains the
following keys:
- ``id``: the ID of the cooking request
- ``fetch_url``: the URL that can be used for the retrieval of the bundle
- ``swhid``: the identifier of the requested bundle
- ``progress_message``: a string describing the current progress of the
cooking. If the cooking failed, ``progress_message`` will contain the
reason of the failure.
- ``status``: one of the following values:
- ``new``: the bundle request was created
- ``pending``: the bundle is being cooked
- ``done``: the bundle has been cooked and is ready for retrieval
- ``failed``: the bundle cooking failed and can be retried
Retrieval
---------
Retrieve a specific bundle from the vault with:
.. http:get:: /vault/:bundletype/:swhid/raw
**Allowed HTTP Methods:** :http:method:`get`, :http:method:`head`,
:http:method:`options`
**Response**:
:statuscode 200: bundle available; response body is the bundle.
:statuscode 404: unavailable bundle; client should request its cooking.
.. _swh-vault-cli:
Command-line interface
======================
.. click:: swh.vault.cli:vault
:prog: swh vault
:nested: full
from swh.docs.sphinx.conf import * # NoQA
.. _vault-primer:
Getting started
===============
The Vault is a service in charge of reconstructing parts of the archive
as self-contained bundles, that can then be imported locally, for
instance in a Git repository. This is basically where you can do a
``git clone`` of a repository stored in Software Heritage.
The Vault is asynchronous : you first need to do a request to prepare
the bundle you need, and then a second request to fetch the bundle once
the Vault has finished to reconstitute the bundle.
Example: retrieving a directory
-------------------------------
First, ask the Vault to prepare your bundle:
.. code:: shell
curl -X POST https://archive.softwareheritage.org/api/1/vault/flat/:swhid/
where ``:swhid`` is a :ref:`persistent-identifiers`. This initial request and all
subsequent requests to this endpoint will return some JSON data containing
information about the progress of bundle creation:
.. code:: json
{
"id": 42,
"fetch_url": "/api/1/vault/flat/:swhid/raw/",
"swhid": ":swhid",
"progress_message": "Creating tarball...",
"status": "pending"
}
Once the status is ``done``, you can fetch the bundle at the address
given in the ``fetch_url`` field.
.. code:: shell
curl -o bundle.tar.gz https://archive.softwareheritage.org/api/1/vault/flat/:swhid/raw
tar xaf bundle.tar.gz
E-mail notifications
--------------------
You can also ask to be notified by e-mail once the bundle you requested is
ready, by giving an ``email`` POST parameter:
.. code:: shell
curl -X POST -d 'email=example@example.com' \
https://archive.softwareheritage.org/api/1/vault/directory/:dir_id/
API reference
~~~~~~~~~~~~~
For a more exhaustive overview of the Vault API, see the :ref:`vault-api-ref`.
.. _swh-vault:
.. include:: README.rst
The Vault backend
~~~~~~~~~~~~~~~~~
The Vault backend is the RPC server other |swh| components (mainly
:ref:`swh-web <swh-web>`) interact with.
It is in charge of receiving cooking requests, scheduling corresponding tasks (via
:ref:`swh-scheduler <swh-scheduler>` and Celery), getting heartbeats and final
results from these, cooking tasks, and finally serving the results.
It uses the same RPC protocol as the other components of the archive, and
its interface is described in :mod:`swh.vault.interface`.
The cookers
~~~~~~~~~~~
Cookers are Python modules/classes, each in charge of cooking a type of bundle.
The main ones are :mod:`swh.vault.cookers.directory` for flat tarballs of directories,
and :mod:`swh.vault.cookers.git_bare` for bare ``.git`` repositories of any
type of git object.
They all derive from :class:`swh.vault.cookers.base.BaseVaultCooker`.
The base cooker first notifies the backend the cooking task is in progress,
then runs the cooker (which does the bundle-specific handling and uploads the result),
then notifies the backend of the final result (success/failure).
Cookers may notify the backend of the progress, so they can be displayed in
swh-web's vault interface, which polls the status from the vault backend.
.. toctree::
:maxdepth: 2
:caption: Contents:
getting-started.rst
api.rst
Reference Documentation
-----------------------
.. toctree::
:maxdepth: 2
cli
.. only:: standalone_package_doc
Indices and tables
------------------
* :ref:`genindex`
* :ref:`modindex`
* :ref:`search`
Software Heritage Vault
=======================
Software source code **objects**---e.g., individual source code files,
tarballs, commits, tagged releases, etc.---are stored in the Software Heritage
(SWH) Archive in fully deduplicated form. That allows direct access to
individual artifacts but require some preparation, usually in the form of
collecting and assembling multiple artifacts in a single **bundle**, when fast
access to a set of related artifacts (e.g., the snapshot of a VCS repository,
the archive corresponding to a Git commit, or a specific software release as a
zip archive) is required.
The **Software Heritage Vault** is a cache of pre-built source code bundles
which are assembled opportunistically retrieving objects from the Software
Heritage Archive, can be accessed efficiently, and might be garbage collected
after a long period of non-use.
Requirements
------------
* **Shared cache**
The vault is a cache shared among the various origins that the SWH archive
tracks. If the same bundle, originally coming from different origins, is
requested, a single entry for it in the cache shall exist.
* **Efficient retrieval**
Where supported by the desired access protocol (e.g., HTTP) it should be
possible for the vault to serve bundles efficiently (e.g., as static files
served via HTTP, possibly further proxied/cached at that level). In
particular, this rules out building bundles on the fly from the archive DB.
API
---
All URLs below are meant to be mounted at API root, which is currently at
<https://archive.softwareheritage.org/api/1/>. Unless otherwise stated, all API
endpoints respond on HTTP GET method.
## Object identification
The vault stores bundles corresponding to different kinds of objects. The
following object kinds are supported:
* directories
* revisions
* repository snapshots
The URL fragment `:objectkind/:objectid` is used throughout the vault API to
fully identify vault objects. The syntax and meaning of :objectid for the
different object kinds is detailed below.
### Directories
* object kind: directory
* URL fragment: directory/:sha1git
where :sha1git is the directory ID in the SWH data model.
### Revisions
* object kind: revision
* URL fragment: revision/:sha1git
where :sha1git is the revision ID in the SWH data model.
### Repository snapshots
* object kind: snapshot
* URL fragment: snapshot/:sha1git
where :sha1git is the snapshot ID in the SWH data model. (**TODO** repository
snapshots don't exist yet as first-class citizens in the SWH data model; see
References below.)
## Cooking
Bundles in the vault might be ready for retrieval or not. When they are not,
they will need to be **cooked** before they can be retrieved. A cooked bundle
will remain around until it expires; at that point it will need to be cooked
again before it can be retrieved. Cooking is idempotent, and a no-op in between
a previous cooking operation and expiration.
To cook a bundle:
* POST /vault/:objectkind/:objectid
Request body: **TODO** something here in a JSON payload that would allow
notifying the user when the bundle is ready.
Response: 201 Created
## Retrieval
* GET /vault/:objectkind
(paginated) list of all bundles of a given kind available in the vault; see
Pagination. Note that, due to cache expiration, objects might disappear
between listing and subsequent actions on them.
Examples:
* GET /vault/directory
* GET /vault/revision
* GET /vault/:objectkind/:objectid
Retrieve a specific bundle from the vault.
Response:
* 200 OK: bundle available; response body is the bundle
* 404 Not Found: missing bundle; client should request its preparation (see Cooking)
References
----------
* [Repository snapshot objects](https://wiki.softwareheritage.org/index.php?title=User:StefanoZacchiroli/Repository_snapshot_objects)
* Amazon Web Services,
[API Reference for Amazon Glacier](http://docs.aws.amazon.com/amazonglacier/latest/dev/amazon-glacier-api.html);
specifically
[Job Operations](http://docs.aws.amazon.com/amazonglacier/latest/dev/job-operations.html)
TODO
====
* **TODO** pagination using HATEOAS
* **TODO** authorization: the cooking API should be somehow controlled to avoid
obvious abuses (e.g., let's cache everything)
* **TODO** finalize repository snapshot proposal