Skip to content
Snippets Groups Projects

Compare revisions

Changes are shown as if the source revision was being merged into the target revision. Learn more about comparing revisions.

Source

Select target project
No results found

Target

Select target project
  • lunar/swh-deposit
  • anlambert/swh-deposit
  • swh/devel/swh-deposit
  • douardda/swh-deposit
  • ardumont/swh-deposit
  • marmoute/swh-deposit
  • rboyer/swh-deposit
7 results
Show changes
Showing
with 1499 additions and 890 deletions
@startuml
participant DEPOSIT as "deposit API"
participant DEPOSIT_DATABASE as "deposit DB"
participant LOADER_TASK as "loader task"
participant STORAGE as "swh-storage"
participant CELERY as "celery"
participant SCHEDULER as "swh-scheduler"
activate DEPOSIT
activate DEPOSIT_DATABASE
activate STORAGE
activate CELERY
activate SCHEDULER
SCHEDULER ->> CELERY: new "load-deposit"\ntask available
CELERY ->> LOADER_TASK: start task
activate LOADER_TASK
LOADER_TASK ->> DEPOSIT: GET /{collection}/{deposit_id}/raw/
DEPOSIT ->> DEPOSIT_DATABASE: get deposit requests
DEPOSIT_DATABASE ->> DEPOSIT: deposit requests
loop for each request
DEPOSIT ->> DEPOSIT_DATABASE: get archive
DEPOSIT_DATABASE ->> DEPOSIT: archive content
DEPOSIT ->> DEPOSIT: aggregate
end
DEPOSIT ->> LOADER_TASK: tarball
LOADER_TASK ->> LOADER_TASK: unpack on disk
loop
LOADER_TASK ->> LOADER_TASK: load objects
LOADER_TASK ->> STORAGE: store objects
end
LOADER_TASK -> DEPOSIT: PUT /{collection}/{deposit_id}/status
DEPOSIT ->> DEPOSIT_DATABASE: mark deposit as "done"
LOADER_TASK ->> CELERY: done
deactivate LOADER_TASK
CELERY ->> SCHEDULER: done
@enduml
@startuml
participant CLIENT as "SWORD client"
participant DEPOSIT as "deposit API"
participant DEPOSIT_DATABASE as "deposit DB"
participant STORAGE as "swh-storage"
participant SCHEDULER as "swh-scheduler"
activate CLIENT
activate DEPOSIT
activate DEPOSIT_DATABASE
activate STORAGE
activate SCHEDULER
CLIENT ->> DEPOSIT: Atom and/or archive
DEPOSIT ->> DEPOSIT_DATABASE: create new deposit
DEPOSIT_DATABASE -->> DEPOSIT: return deposit_id
DEPOSIT ->> DEPOSIT_DATABASE: record deposit request
loop while the previous request has "In-Progress: true"
DEPOSIT ->> CLIENT: deposit receipt\n("partial")
CLIENT ->> DEPOSIT: Atom and/or archive
DEPOSIT ->> DEPOSIT_DATABASE: record deposit request
end
alt if metadata-only
DEPOSIT ->> STORAGE: target exists?
STORAGE ->> DEPOSIT: true
DEPOSIT ->> STORAGE: insert metadata
DEPOSIT ->> DEPOSIT_DATABASE: mark deposit as "done"
else
DEPOSIT ->> SCHEDULER: schedule checks
DEPOSIT ->> DEPOSIT_DATABASE: mark deposit as "loading"
end
DEPOSIT ->> CLIENT: deposit receipt\n("done" or "loading")
@enduml
@startuml
hide empty description
state request <<choice>>
[*] --> request : POST Col-IRI
request --> deposited : [ without In-Progress: true ]
request --> partial : [ with In-Progress: true ]
partial --> request : PUT EM-IRI
partial --> expired : [ if no further requests are sent]
state validation <<choice>>
deposited --> validation : checker runs
validation --> verified
validation --> rejected : [ validation failed ]
verified --> loading : loader starts
loading --> done
loading --> failed
@enduml
.. _swh-deposit: .. _swh-deposit:
Software Heritage - Deposit .. include:: README.rst
===========================
Push-based deposit of software source code artifacts to the archive.
.. toctree:: .. toctree::
:maxdepth: 2 :maxdepth: 2
:caption: Contents: :caption: Contents:
getting-started api/index
spec-api internals/index
metadata specs/index
dev-info
sys-info
specs/specs
Reference Documentation Reference Documentation
...@@ -24,4 +17,13 @@ Reference Documentation ...@@ -24,4 +17,13 @@ Reference Documentation
.. toctree:: .. toctree::
:maxdepth: 2 :maxdepth: 2
/apidoc/swh.deposit cli
.. only:: standalone_package_doc
Indices and tables
------------------
* :ref:`genindex`
* :ref:`modindex`
* :ref:`search`
.. _authentication:
Authentication
==============
This is a description of the authentication mechanism used in the deposit server. Both
`basic authentication <https://tools.ietf.org/html/rfc7617>`_ and `keycloak`_ schemes
are supported through configuration.
Basic
-----
The first implementation uses `basic authentication
<https://tools.ietf.org/html/rfc7617>`_. The deposit server checks
the authentication credentials sent by the deposit client using its own database. If
authorized, the deposit client is allowed to continue its deposit. Otherwise, a 401
response is returned to the client.
.. figure:: ../images/deposit-authentication-basic.svg
:alt: Basic Authentication
Keycloak
--------
Recent changes introduced `keycloak`_, an Open Source Identity and Access Management
tool which is already used in other parts of the swh stack.
The authentication is delegated to the `swh keycloak instance
<https://auth.softwareheritage.org/auth/>`_ using the `Resource Owner Password
Credentials <https://tools.ietf.org/html/rfc6749#section-1.3.3>`_ scheme.
Deposit clients still uses the deposit as before. Transparently for them, the deposit
server forwards their credentials to keycloak for validation. If `keycloak`_ authorizes
the deposit client, the deposit further checks that the deposit client has the proper
permission "swh.deposit.api". If they do, they can post their deposits.
If any issue arises during one of the authentication check, the client receives a 401
response (unauthorized).
.. figure:: ../images/deposit-authentication-keycloak.svg
:alt: Keycloak Authentication
.. _keycloak: https://www.keycloak.org/
Hacking on swh-deposit .. _swh-deposit-dev-env:
======================
Running swh-deposit locally
===========================
There are multiple modes to run and test the server locally: There are multiple modes to run and test the server locally:
...@@ -20,14 +22,14 @@ db is expected to be called swh-deposit-dev. ...@@ -20,14 +22,14 @@ db is expected to be called swh-deposit-dev.
To simplify the use, the following makefile targets can be used: To simplify the use, the following makefile targets can be used:
schema schema
~~~~~~ ^^^^^^
.. code:: shell .. code:: shell
make db-create db-prepare db-migrate make db-create db-prepare db-migrate
data data
~~~~ ^^^^
Once the db is created, you need some data to be injected (request Once the db is created, you need some data to be injected (request
types, client, collection, etc...): types, client, collection, etc...):
...@@ -63,7 +65,7 @@ Add the following to ``../private-data.yaml``: ...@@ -63,7 +65,7 @@ Add the following to ``../private-data.yaml``:
url: https://hal.inria.fr url: https://hal.inria.fr
drop drop
~~~~ ^^^^
For information, you can drop the db: For information, you can drop the db:
...@@ -78,7 +80,7 @@ Development-like environment needs one configuration file to work ...@@ -78,7 +80,7 @@ Development-like environment needs one configuration file to work
properly. properly.
Configuration Configuration
~~~~~~~~~~~~~ ^^^^^^^^^^^^^
**``{/etc/softwareheritage | ~/.config/swh | ~/.swh}``/deposit/server.yml**: **``{/etc/softwareheritage | ~/.config/swh | ~/.swh}``/deposit/server.yml**:
...@@ -99,7 +101,7 @@ Configuration ...@@ -99,7 +101,7 @@ Configuration
max_upload_size: 20971520 max_upload_size: 20971520
Run Run
~~~ ^^^
Run the local server, using the default configuration file: Run the local server, using the default configuration file:
...@@ -116,7 +118,7 @@ configuration file to work properly. ...@@ -116,7 +118,7 @@ configuration file to work properly.
This is more close to what's actually running in production. This is more close to what's actually running in production.
Configuration Configuration
~~~~~~~~~~~~~ ^^^^^^^^^^^^^
This expects the same file describes in the previous chapter. Plus, an This expects the same file describes in the previous chapter. Plus, an
additional private section file containing private information that is additional private section file containing private information that is
...@@ -145,7 +147,7 @@ A production configuration file would look like: ...@@ -145,7 +147,7 @@ A production configuration file would look like:
password: user-password password: user-password
Run Run
~~~ ^^^
.. code:: shell .. code:: shell
......
.. _swh-deposit-internals:
Deposit internals
=================
This chapter describes how swh-deposit works internally,
and how to run it (either in production or locally for development).
.. toctree::
:maxdepth: 1
dev-environment
prod-environment
authentication
loading-workflow
Loading workflow
================
This section complements the :ref:`deposit-use-cases` documentation,
by detailing how deposits are handled internally after clients deposited them.
Reception
---------
For every HTTP request sent by a client, the deposit API checks some simple properties,
then creates a :class:`swh.deposit.models.DepositRequest`
object containing the data uploaded by the client verbatim (archive and/or metadata),
and inserts in the database
A corresponding :class:`swh.deposit.models.Deposit` object is also created
and inserted, if this is the initial request creating a deposit.
Upon receiving the last request, identified by the lack of the ``In-Progress: true``
header, the deposit server either:
* checks the targeting objects exists in :ref:`swh-storage <swh-storage>`,
then sends a request to swh-storage with the Atom metadata and updates the
deposit status to ``done``,
if it is a :ref:`metadata-only deposit <use-case-metadata-only-deposit>`
* updates the deposit status and schedules a checking task by querying
:ref:`swh-scheduler <swh-scheduler>`, otherwise
Graphically:
.. figure:: ../images/deposit-workflow-reception.svg
:alt:
For metadata-only deposits, this is the end of the story.
The next section narrates what happens next for "normal" deposits.
Checking
--------
As we saw above, the deposit API server's synchronous work ends after sending
a checking task.
This task is implemented by :class:`swh.deposit.loader.checker.DepositChecker`;
which is simply an other call to the deposit API,
implemented in :class:`swh.deposit.api.private.deposit_check.APIChecks`.
This API performs longer checks, which require inspecting the deposited archive
(or archives, for clients depositing archives in multiple steps).
This is why it is run by an asynchronous task instead of being checked immediately
when the client sent a query.
When it is done, it sets the deposit's status to "verified" (so clients polling
for the status know this step succeeded) and schedule a loading task.
Graphically:
.. figure:: ../images/deposit-workflow-checking.svg
:alt:
Note that the check task is actually just a thin wrapper around an API call.
While the checks could be done in the task itself, it would mean sending
all archives from the deposit API to the celery worker, which would be inefficient.
And the gains would not be great, as checking tasks only need to decompress archives,
which is not resource intensive.
Instead, this long-running call to the API proved to be a simpler
and more efficient solution at the current scale of the deposit.
Loading
-------
When the check task finished, it scheduled a load task, implemented by
:class:`swh.loader.package.deposit.loader.DepositLoader`.
It is part of the ``swh.loader.package`` package instead of ``swh-deposit``,
because its design is close to other :ref:`package loaders <swh-loader-core>`:
1. fetch a tarball
2. extract it
3. use :mod:`swh.model.from_disk` to build SWH objects from it
4. load these objects in :ref:`swh-storage <swh-storage>`
The only difference in this process is fetching the tarball from the deposit server,
instead of external repositories.
This tarball is returned by :class:`swh.deposit.api.private.deposit_read`,
which creates it by aggregating all archives sent by the client (usually
only one, but the SWORD protocol allows more).
Finally, when it is done, the loader updates the deposit status via the deposit API.
Graphically:
.. figure:: ../images/deposit-workflow-loading.svg
:alt:
.. _swh-deposit-prod-env:
Production deployment
=====================
The deposit is architectured around 3 parts:
- server: a django application exposing an xml api, discussing with a postgresql
backend (and optionally a keycloak instance)
- worker(s): 1 worker service dedicated to check the deposit archive and metadata are
correct (the checker), another worker service dedicated to actually ingest the
deposit into the swh archive.
- client: a python script ``swh deposit`` command line interface.
All those are packaged in 3 separated debian packages, created and uploaded to the swh
debian repository. The deposit server and workers configuration are managed by puppet
(cf. puppet-environment/swh-site, puppet-environment/swh-role,
puppet-environment/swh-profile)
In the following document, we will focus on the server actions that may be needed once
the server is installed or upgraded.
Prepare the database setup (existence, connection, etc...).
-----------------------------------------------------------
This is defined through the packaged module ``swh.deposit.settings.production`` and the
expected **/etc/softwareheritage/deposit/server.yml** configuration file.
Environment (production/staging)
--------------------------------
``SWH_CONFIG_FILENAME`` must be defined and target the deposit server configuration file.
So either 1. prefix the following commands or 2. export the environment variable in your
shell session. For the remaining part of the documentation, we assume 2. has been
configured.
.. code:: shell
export SWH_CONFIG_FILENAME=/etc/softwareheritage/deposit/server.yml
Migrate the db schema
---------------------
The debian package may integrate some new schema modifications. To run them:
.. code:: shell
sudo django-admin migrate --settings=swh.deposit.settings.production
.. _swh-deposit-add-client-and-collection:
Add client and collection
-------------------------
The deposit can be configured to use either the 1. django basic authentication framework
or the 2. swh keycloak instance. If the server uses 2., the password is managed by
keycloak so the option ``--password`` is ignored.
* basic
.. code:: shell
swh deposit admin \
--config-file $SWH_CONFIG_FILENAME \
--platform production \
user create \
--collection <collection-name> \
--username <client-name> \
--password <to-define>
This adds a user ``<client-name>`` which can access the collection
``<collection-name>``. The password will be used for checking the authentication access
to the deposit api (if 1. is used).
Note:
- If the collection does not exist, it is created alongside
- The password, if required, is passed as plain text but stored encrypted
Reschedule a deposit
---------------------
If for some reason, the loading failed, after fixing and deploying the new deposit
loader, you can reschedule the impacted deposit through:
.. code:: shell
swh deposit admin \
--config-file $SWH_CONFIG_FILENAME \
--platform production \
deposit reschedule \
--deposit-id <deposit-id>
This will:
- check the deposit's status to something reasonable (failed or done). That means that
the checks have passed but something went wrong during the loading (failed: loading
failed, done: loading ok, still for some reasons as in bugs, we need to reschedule it)
- reset the deposit's status to 'verified' (prior to any loading but after the checks
which are fine) and removes the different archives' identifiers (swh-id, ...)
- trigger back the loading task through the scheduler
Integration checks
------------------
There exists icinga checks running periodically on `staging`_ and `production`_
instances. If any problem arises, expect those to notify the #swh-sysadm irc channel.
.. _staging: https://icinga.softwareheritage.org/search?q=deposit#!/monitoring/service/show?host=pergamon.softwareheritage.org&service=staging%20Check%20deposit%20end-to-end
.. _production: https://icinga.softwareheritage.org/search?q=deposit#!/monitoring/service/show?host=pergamon.softwareheritage.org&service=production%20Check%20deposit%20end-to-end
.. _deposit-metadata: :orphan:
Deposit metadata This page was moved to: :ref:`deposit-metadata`.
================
When making a software deposit into the SWH archive, one can add
information describing the software artifact and the software project.
Metadata requirements
---------------------
- **the schema/vocabulary** used *MUST* be specified with a persistent url
(DublinCore, DOAP, CodeMeta, etc.)
.. code:: xml
<entry xmlns="http://www.w3.org/2005/Atom">
or
<entry xmlns="http://www.w3.org/2005/Atom"
xmlns:dcterms="http://purl.org/dc/terms/">
or
<entry xmlns="http://www.w3.org/2005/Atom"
xmlns:codemeta="https://doi.org/10.5063/SCHEMA/CODEMETA-2.0">
- **the name** of the software deposit *MUST* be provided [atom:title,
codemeta:name, dcterms:title]
- **the authors** of the software deposit *MUST* be provided
- **the url** representing the location of the source *MAY* be provided under
the url tag. The url will be used for creating an origin object in the
archive.
.. code:: xml
<codemeta:url>www.url-example.com</codemeta:url>
- **the external\_identifier** *MAY* be provided as an identifier
- **the external\_identifier** *SHOULD* match the Slug external-identifier in
the header
- **the description** of the software deposit *SHOULD* be provided
[codemeta:description]: short or long description of the software
- **the license/s** of the software
deposit *SHOULD* be provided [codemeta:license]
- other metadata *MAY* be added with terms defined by the schema in use.
Examples
--------
Using only Atom
~~~~~~~~~~~~~~~
.. code:: xml
<?xml version="1.0"?>
<entry xmlns="http://www.w3.org/2005/Atom">
<title>Awesome Compiler</title>
<id>urn:uuid:1225c695-cfb8-4ebb-aaaa-80da344efa6a</id>
<external_identifier>1785io25c695</external_identifier>
<updated>2017-10-07T15:17:08Z</updated>
<author>some awesome author</author>
</entry>
Using Atom with CodeMeta
~~~~~~~~~~~~~~~~~~~~~~~~
.. code:: xml
<?xml version="1.0"?>
<entry xmlns="http://www.w3.org/2005/Atom"
xmlns:codemeta="https://doi.org/10.5063/SCHEMA/CODEMETA-2.0">
<title>Awesome Compiler</title>
<id>urn:uuid:1225c695-cfb8-4ebb-aaaa-80da344efa6a</id>
<external_identifier>1785io25c695</external_identifier>
<codemeta:id>1785io25c695</codemeta:id>
<codemeta:url>origin url</codemeta:url>
<codemeta:identifier>other identifier, DOI, ARK</codemeta:identifier>
<codemeta:applicationCategory>Domain</codemeta:applicationCategory>
<codemeta:description>description</codemeta:description>
<codemeta:keywords>key-word 1</codemeta:keywords>
<codemeta:keywords>key-word 2</codemeta:keywords>
<codemeta:dateCreated>creation date</codemeta:dateCreated>
<codemeta:datePublished>publication date</codemeta:datePublished>
<codemeta:releaseNotes>comment</codemeta:releaseNotes>
<codemeta:referencePublication>
<codemeta:name> article name</codemeta:name>
<codemeta:identifier> article id </codemeta:identifier>
</codemeta:referencePublication>
<codemeta:isPartOf>
<codemeta:type> Collaboration/Projet </codemeta:type>
<codemeta:name> project name</codemeta:name>
<codemeta:identifier> id </codemeta:identifier>
</codemeta:isPartOf>
<codemeta:relatedLink>see also </codemeta:relatedLink>
<codemeta:funding>Sponsor A </codemeta:funding>
<codemeta:funding>Sponsor B</codemeta:funding>
<codemeta:operatingSystem>Platform/OS </codemeta:operatingSystem>
<codemeta:softwareRequirements>dependencies </codemeta:softwareRequirements>
<codemeta:softwareVersion>Version</codemeta:softwareVersion>
<codemeta:developmentStatus>active </codemeta:developmentStatus>
<codemeta:license>
<codemeta:name>license</codemeta:name>
<codemeta:url>url spdx</codemeta:url>
</codemeta:license>
<codemeta:runtimePlatform>.Net Framework 3.0 </codemeta:runtimePlatform>
<codemeta:runtimePlatform>Python2.3</codemeta:runtimePlatform>
<codemeta:author>
<codemeta:name> author1 </codemeta:name>
<codemeta:affiliation> Inria </codemeta:affiliation>
<codemeta:affiliation> UPMC </codemeta:affiliation>
</codemeta:author>
<codemeta:author>
<codemeta:name> author2 </codemeta:name>
<codemeta:affiliation> Inria </codemeta:affiliation>
<codemeta:affiliation> UPMC </codemeta:affiliation>
</codemeta:author>
<codemeta:codeRepository>http://code.com</codemeta:codeRepository>
<codemeta:programmingLanguage>language 1</codemeta:programmingLanguage>
<codemeta:programmingLanguage>language 2</codemeta:programmingLanguage>
<codemeta:issueTracker>http://issuetracker.com</codemeta:issueTracker>
</entry>
Using Atom with DublinCore and CodeMeta (multi-schema entry)
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. code:: xml
<?xml version="1.0"?>
<entry xmlns="http://www.w3.org/2005/Atom"
xmlns:dcterms="http://purl.org/dc/terms/"
xmlns:codemeta="https://doi.org/10.5063/SCHEMA/CODEMETA-2.0">
<title>Awesome Compiler</title>
<client>hal</client>
<id>urn:uuid:1225c695-cfb8-4ebb-aaaa-80da344efa6a</id>
<external_identifier>%s</external_identifier>
<dcterms:identifier>hal-01587361</dcterms:identifier>
<dcterms:identifier>doi:10.5281/zenodo.438684</dcterms:identifier>
<dcterms:title xml:lang="en">The assignment problem</dcterms:title>
<dcterms:title xml:lang="fr">AffectationRO</dcterms:title>
<dcterms:creator>author</dcterms:creator>
<dcterms:subject>[INFO] Computer Science [cs]</dcterms:subject>
<dcterms:subject>[INFO.INFO-RO] Computer Science [cs]/Operations Research [cs.RO]</dcterms:subject>
<dcterms:type>SOFTWARE</dcterms:type>
<dcterms:abstract xml:lang="en">Project in OR: The assignment problemA java implementation for the assignment problem first release</dcterms:abstract>
<dcterms:abstract xml:lang="fr">description fr</dcterms:abstract>
<dcterms:created>2015-06-01</dcterms:created>
<dcterms:available>2017-10-19</dcterms:available>
<dcterms:language>en</dcterms:language>
<codemeta:url>origin url</codemeta:url>
<codemeta:softwareVersion>1.0.0</codemeta:softwareVersion>
<codemeta:keywords>key word</codemeta:keywords>
<codemeta:releaseNotes>Comment</codemeta:releaseNotes>
<codemeta:referencePublication>Rfrence interne </codemeta:referencePublication>
<codemeta:relatedLink>link </codemeta:relatedLink>
<codemeta:funding>Sponsor </codemeta:funding>
<codemeta:operatingSystem>Platform/OS </codemeta:operatingSystem>
<codemeta:softwareRequirements>dependencies </codemeta:softwareRequirements>
<codemeta:developmentStatus>Ended </codemeta:developmentStatus>
<codemeta:license>
<codemeta:name>license</codemeta:name>
<codemeta:url>url spdx</codemeta:url>
</codemeta:license>
<codemeta:codeRepository>http://code.com</codemeta:codeRepository>
<codemeta:programmingLanguage>language 1</codemeta:programmingLanguage>
<codemeta:programmingLanguage>language 2</codemeta:programmingLanguage>
</entry>
Note
----
We aim on harmonizing the metadata from different origins and thus
metadata will be translated to the `CodeMeta
v.2 <https://doi.org/10.5063/SCHEMA/CODEMETA-2.0>`__ vocabulary if
possible.
API Specification :orphan:
=================
This is `Software Heritage <https://www.softwareheritage.org>`__'s This page was moved to: :ref:`deposit-api-specifications`.
`SWORD
2.0 <http://swordapp.github.io/SWORDv2-Profile/SWORDProfile.html>`__
Server implementation.
**S.W.O.R.D** (**S**\ imple **W**\ eb-Service **O**\ ffering
**R**\ epository **D**\ eposit) is an interoperability standard for
digital file deposit.
This implementation will permit interaction between a client (a repository) and
a server (SWH repository) to push deposits of software source code archives
with associated metadata.
*Note:*
* In the following document, we will use the ``archive`` or ``software source
code archive`` interchangeably.
* The supported archive formats are:
* zip: common zip archive (no multi-disk zip files).
* tar: tar archive without compression or optionally any of the following
compression algorithm gzip (.tar.gz, .tgz), bzip2 (.tar.bz2) , or lzma
(.tar.lzma)
Collection
----------
SWORD defines a ``collection`` concept. In SWH's case, this collection
refers to a group of deposits. A ``deposit`` is some form of software
source code archive(s) associated with metadata.
By default the client's collection will have the client's name.
Limitations
-----------
* upload limitation of 100Mib
* no mediation
API overview
------------
API access is over HTTPS.
The API is protected through basic authentication.
Endpoints
---------
The API endpoints are rooted at https://deposit.softwareheritage.org/1/.
Data is sent and received as XML (as specified in the SWORD 2.0
specification).
.. include:: endpoints/service-document.rst
.. include:: endpoints/collection.rst
.. include:: endpoints/update-media.rst
.. include:: endpoints/update-metadata.rst
.. include:: endpoints/status.rst
.. include:: endpoints/content.rst
Possible errors:
----------------
* common errors:
* 401 (unauthenticated) if a client does not provide credential or provide
wrong ones
* 403 (forbidden) if a client tries access to a collection it does not own
* 404 (not found) if a client tries access to an unknown collection
* 404 (not found) if a client tries access to an unknown deposit
* 415 (unsupported media type) if a wrong media type is provided to the
endpoint
* archive/binary deposit:
* 403 (forbidden) if the length of the archive exceeds the max size
configured
* 412 (precondition failed) if the length or hash provided mismatch the
reality of the archive.
* 415 (unsupported media type) if a wrong media type is provided
* multipart deposit:
* 412 (precondition failed) if the md5 hash provided mismatch the reality of
the archive
* 415 (unsupported media type) if a wrong media type is provided
* Atom entry deposit:
* 400 (bad request) if the request's body is empty (for creation only)
Sources
-------
* `SWORD v2 specification
<http://swordapp.github.io/SWORDv2-Profile/SWORDProfile.html>`__
* `arxiv documentation <https://arxiv.org/help/submit_sword>`__
* `Dataverse example <http://guides.dataverse.org/en/4.3/api/sword.html>`__
* `SWORD used on HAL <https://api.archives-ouvertes.fr/docs/sword>`__
* `xml examples for CCSD <https://github.com/CCSDForge/HAL/tree/master/Sword>`__
Use cases :orphan:
---------
This page was moved to: :ref:`deposit-use-cases`
Deposit creation
~~~~~~~~~~~~~~~~
From client's deposit repository server to SWH's repository server:
1. The client requests for the server's abilities and its associated collection
(GET query to the *SD/service document uri*)
2. The server answers the client with the service document which gives the
*collection uri* (also known as *COL/collection IRI*).
3. The client sends a deposit (optionally a zip archive, some metadata or both)
through the *collection uri*.
This can be done in:
* one POST request (metadata + archive).
* one POST request (metadata or archive) + other PUT or POST request to the
*update uris* (*edit-media iri* or *edit iri*)
a. Server validates the client's input or returns detailed error if any
b. Server stores information received (metadata or software archive source
code or both)
4. The server notifies the client it acknowledged the client's request. An
``http 201 Created`` response with a deposit receipt in the body response is
sent back. That deposit receipt will hold the necessary information to
eventually complete the deposit later on if it was incomplete (also known as
status ``partial``).
Schema representation
^^^^^^^^^^^^^^^^^^^^^
.. raw:: html
<!-- {F2884278} -->
.. figure:: ../images/deposit-create-chart.png
:alt:
Updating an existing deposit
~~~~~~~~~~~~~~~~~~~~~~~~~~~~
5. Client updates existing deposit through the *update uris* (one or more POST
or PUT requests to either the *edit-media iri* or *edit iri*).
1. Server validates the client's input or returns detailed error if any
2. Server stores information received (metadata or software archive source
code or both)
This would be the case for example if the client initially posted a
``partial`` deposit (e.g. only metadata with no archive, or an archive
without metadata, or a split archive because the initial one exceeded
the limit size imposed by swh repository deposit)
Schema representation
^^^^^^^^^^^^^^^^^^^^^
.. raw:: html
<!-- {F2884302} -->
.. figure:: ../images/deposit-update-chart.png
:alt:
Deleting deposit (or associated archive, or associated metadata)
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
6. Deposit deletion is possible as long as the deposit is still in ``partial``
state.
1. Server validates the client's input or returns detailed error if any
2. Server actually delete information according to request
Schema representation
^^^^^^^^^^^^^^^^^^^^^
.. raw:: html
<!-- {F2884311} -->
.. figure:: ../images/deposit-delete-chart.png
:alt:
Client asks for operation status
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
7. Operation status can be read through a GET query to the *state iri*.
Server: Triggering deposit checks
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Once the status ``deposited`` is reached for a deposit, checks for the
associated archive(s) and metadata will be triggered. If those checks
fail, the status is changed to ``rejected`` and nothing more happens
there. Otherwise, the status is changed to ``verified``.
Server: Triggering deposit load
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Once the status ``verified`` is reached for a deposit, loading the
deposit with its associated metadata will be triggered.
The loading will result on status update, either ``done`` or ``failed``
(depending on the loading's status).
This is described in the `loading document <./spec-loading.html>`__.
.. _swh-deposit-specs: .. _swh-deposit-specs:
Blueprint Specifications Specifications
========================= ==============
.. toctree:: .. toctree::
:maxdepth: 1 :maxdepth: 1
:caption: Contents: :caption: Contents:
blueprint.rst
spec-loading.rst spec-loading.rst
spec-sparse-deposit.rst protocol-reference.rst
spec-meta-deposit.rst spec-meta-deposit.rst
<?xml version="1.0"?> <?xml version="1.0"?>
<entry xmlns="http://www.w3.org/2005/Atom" <entry xmlns="http://www.w3.org/2005/Atom"
xmlns:codemeta="https://doi.org/10.5063/SCHEMA/CODEMETA-2.0" xmlns:codemeta="https://doi.org/10.5063/SCHEMA/CODEMETA-2.0"
xmlns:swh="https://www.softwareheritage.org/schema/2018/deposit"> xmlns:swh="https://www.softwareheritage.org/schema/2018/deposit">
<author> <author>
<name>HAL</name> <name>HAL</name>
<email>hal@ccsd.cnrs.fr</email> <email>hal@ccsd.cnrs.fr</email>
</author> </author>
<client>hal</client> <client>hal</client>
<external_identifier>hal-01243573</external_identifier> <codemeta:name>The assignment problem</codemeta:name>
<codemeta:name>The assignment problem</codemeta:name> <codemeta:url>https://hal.archives-ouvertes.fr/hal-01243573</codemeta:url>
<codemeta:url>https://hal.archives-ouvertes.fr/hal-01243573</codemeta:url> <codemeta:identifier>other identifier, DOI, ARK</codemeta:identifier>
<codemeta:identifier>other identifier, DOI, ARK</codemeta:identifier> <codemeta:applicationCategory>Domain</codemeta:applicationCategory>
<codemeta:applicationCategory>Domain</codemeta:applicationCategory> <codemeta:description>description</codemeta:description>
<codemeta:description>description</codemeta:description> <codemeta:author>
<codemeta:author> <codemeta:name> author1 </codemeta:name>
<codemeta:name> author1 </codemeta:name> <codemeta:affiliation> Inria </codemeta:affiliation>
<codemeta:affiliation> Inria </codemeta:affiliation> <codemeta:affiliation> UPMC </codemeta:affiliation>
<codemeta:affiliation> UPMC </codemeta:affiliation> </codemeta:author>
</codemeta:author> <codemeta:author>
<codemeta:author> <codemeta:name> author2 </codemeta:name>
<codemeta:name> author2 </codemeta:name> <codemeta:affiliation> Inria </codemeta:affiliation>
<codemeta:affiliation> Inria </codemeta:affiliation> <codemeta:affiliation> UPMC </codemeta:affiliation>
<codemeta:affiliation> UPMC </codemeta:affiliation> </codemeta:author>
</codemeta:author> <swh:deposit>
<swh:deposit> <swh:create_origin>
<swh:bindings> <swh:origin url="http://has.archives-ouvertes.fr/hal-01243573" />
<swh:binding source="path/to/file.txt" destination="aaaaaaaaaaa..."/> </swh:create_origin>
</swh:bindings> </swh:deposit>
</swh:deposit> </entry>
</entry>
.. _deposit-protocol:
Protocol reference
==================
The swh-deposit protocol is an extension SWORDv2_ protocol, and the
swh-deposit client and server should work with any other SWORDv2-compliant
implementation which provides some :ref:`mandatory attributes <mandatory-attributes>`
However, we define some extensions by the means of extra tags in the Atom
entries, that should be used when interacting with the server to use it optimally.
This means the swh-deposit server should work with a generic SWORDv2 client, but
works much better with these extensions.
All these tags are in the ``https://www.softwareheritage.org/schema/2018/deposit``
XML namespace, denoted using the ``swhdeposit`` prefix in this section.
.. _deposit-create_origin:
Origin creation with the ``<swhdeposit:create_origin>`` tag
-----------------------------------------------------------
Motivation
^^^^^^^^^^
This is the main extension we define.
This tag is used after a deposit is completed, to load it in the Software Heritage
archive.
The SWH archive references source code repositories by an URI, called the
:term:`origin` URL.
This URI is clearly defined when SWH pulls source code from such a repository;
but not for the push approach used by SWORD, as SWORD clients do not intrinsically
have an URL.
Usage
^^^^^
Instead, clients are expected to provide the origin URL themselves, by adding
a tag in the Atom entry they submit to the server, like this:
.. code:: xml
<atom:entry xmlns:atom="http://www.w3.org/2005/Atom"
xmlns:swh="https://www.softwareheritage.org/schema/2018/deposit">
<!-- ... -->
<swh:deposit>
<swh:create_origin>
<swh:origin url="https://example.org/b063bf3a-e98e-40a0-b918-3e42b06011ba" />
</swh:create_origin>
</swh:deposit>
<!-- ... -->
</atom:entry>
This will create an origin in the Software Heritage archive, that will point to
the source code artifacts of this deposit.
Semantics of origin URLs
^^^^^^^^^^^^^^^^^^^^^^^^
Origin URLs must be unique to an origin, ie. to a software project.
The exact definition of a "software project" is left to the clients of the deposit.
They should be designed so that future releases of the same software will have
the same origin URL.
As a guideline, consider that every GitHub/GitLab project is an origin,
and every package in Debian/NPM/PyPI is also an origin.
While origin URLs are not required to resolve to a source code artifact,
we recommend they point to a public resource describing the software project,
including a link to download its source code.
This is not a technical requirement, but it improves discoverability.
.. _swh-deposit-provider-url-definition:
Clients may not submit arbitrary URLs; the server will check the URLs they submit
belongs to a "namespace" they own, known as the ``provider_url`` of the client. For
example, if a client has their ``provider_url`` set to ``https://example.org/foo/`` they
will only be able to submit deposits to origins whose URL starts with
``https://example.org/foo/``.
Fallbacks
^^^^^^^^^
If the ``<swhdeposit:create_origin>`` is not provided (either because they are generic
SWORDv2 implementations or old implementations of an swh-deposit client), the server
falls back to creating one based on the ``provider_url`` and the ``Slug`` header
(as defined in the AtomPub_ specification) by concatenating them.
If the ``Slug`` header is missing, the server generates one randomly.
This fallback is provided for compliance with SWORDv2_ clients, but we do not
recommend relying on it, as it usually creates origins URL that are not meaningful.
.. _deposit-add_to_origin:
Adding releases to an origin, with the ``<swhdeposit:add_to_origin>`` tag
-------------------------------------------------------------------------
When depositing a source code artifact for an origin (ie. software project) that
was already deposited before, clients should not use ``<swhdeposit:create_origin>``,
as the origin was already created by the original deposit; and
``<swhdeposit:add_to_origin>`` should be used instead.
It is used very similarly to ``<swhdeposit:create_origin>``:
.. code:: xml
<atom:entry xmlns:atom="http://www.w3.org/2005/Atom"
xmlns:swh="https://www.softwareheritage.org/schema/2018/deposit">
<!-- ... -->
<swh:deposit>
<swh:add_to_origin>
<swh:origin url="https://example.org/~user/repo" />
</swh:add_to_origin>
</swh:deposit>
<!-- ... -->
</atom:entry>
This will create a new :term:`revision` object in the Software Heritage archive,
with the last deposit on this origin as its parent revision,
and reference it from the origin.
If the origin does not exist, it will error.
Metadata
--------
Format
^^^^^^
While the SWORDv2 specification recommends the use of DublinCore_,
we prefer the CodeMeta_ vocabulary, as we already use it in other components
of Software Heritage.
While CodeMeta is designed for use in JSON-LD, it is easy to reuse its vocabulary
and embed it in an XML document, in three steps:
1. use the `JSON-LD compact representation`_ of the CodeMeta document with
``@context: "https://doi.org/10.5063/SCHEMA/CODEMETA-2.0"`` and no other context;
which implies that:
1. Codemeta properties (whether in the ``https://codemeta.github.io/terms/``
or ``http://schema.org/`` namespaces) are unprefixed terms
2. other properties in the ``http://schema.org/`` namespace use `compact IRIs`_
with the ``schema`` prefix
3. other properties are absolute
2. replace ``@context`` declarations with a XMLNS declaration with
``https://doi.org/10.5063/SCHEMA/CODEMETA-2.0`` as namespace
(eg. ``xmlns="https://doi.org/10.5063/SCHEMA/CODEMETA-2.0"``
or ``xmlns:codemeta="https://doi.org/10.5063/SCHEMA/CODEMETA-2.0"``)
3. if using a non-default namespace, apply its prefix to any unprefixed term
(ie. any term defined in https://doi.org/10.5063/SCHEMA/CODEMETA-2.0 )
4. add XMLNS declarations for any other prefix (eg. ``xmlns:schema="http://schema.org/"``
if any property in that namespace is used)
5. unfold JSON lists to sibling XML subtrees
.. _JSON-LD compact representation: https://www.w3.org/TR/json-ld11/#compacted-document-form
.. _compact IRIs: https://www.w3.org/TR/json-ld11/#compact-iris
Example Codemeta document
"""""""""""""""""""""""""
.. code:: json
{
"@context": "https://doi.org/10.5063/SCHEMA/CODEMETA-2.0",
"name": "My Software",
"author": [
{
"name": "Author 1",
"email": "foo@example.org"
},
{
"name": "Author 2"
}
]
}
becomes this XML document:
.. code:: xml
<?xml version="1.0"?>
<atom:entry xmlns:atom="http://www.w3.org/2005/Atom"
xmlns="https://doi.org/10.5063/SCHEMA/CODEMETA-2.0">
<name>My Software</name>
<author>
<name>Author 1</name>
<email>foo@example.org</email>
</author>
<author>
<name>Author 2</name>
</author>
</atom:entry>
Or, equivalently:
.. code:: xml
<?xml version="1.0"?>
<entry xmlns="http://www.w3.org/2005/Atom"
xmlns:codemeta="https://doi.org/10.5063/SCHEMA/CODEMETA-2.0">
<codemeta:name>My Software</codemeta:name>
<codemeta:author>
<codemeta:name>Author 1</codemeta:name>
<codemeta:email>foo@example.org</codemeta:email>
</codemeta:author>
<codemeta:author>
<codemeta:name>Author 2</codemeta:name>
</codemeta:author>
</entry>
Note that in both these examples, ``codemeta:name`` is used even though
the property is actually ``http://schema.org/name``.
Example generic JSON-LD document
""""""""""""""""""""""""""""""""
Another example using properties not part of Codemeta:
.. code:: json
{
"@context": "https://doi.org/10.5063/SCHEMA/CODEMETA-2.0",
"name": "My Software",
"schema:sameAs": "http://example.org/my-software"
}
which is equivalent to:
.. code:: json
{
"@context": "https://doi.org/10.5063/SCHEMA/CODEMETA-2.0",
"name": "My Software",
"http://schema.org/sameAs": "http://example.org/my-software"
}
becomes this XML document:
.. code:: xml
<?xml version="1.0"?>
<atom:entry xmlns:atom="http://www.w3.org/2005/Atom"
xmlns="https://doi.org/10.5063/SCHEMA/CODEMETA-2.0"
xmlns:schema="http://schema.org/">
<name>My Software</name>
<schema:sameAs>http://example.org/my-software</schema:sameAs>
</atom:entry>
Or, equivalently:
.. code:: xml
<?xml version="1.0"?>
<entry xmlns="http://www.w3.org/2005/Atom"
xmlns:codemeta="https://doi.org/10.5063/SCHEMA/CODEMETA-2.0"
xmlns:schema="http://schema.org/">
<codemeta:name>My Software</codemeta:name>
<schema:sameAs>http://example.org/my-software</schema:sameAs>
</entry>
.. _mandatory-attributes:
Mandatory attributes
^^^^^^^^^^^^^^^^^^^^
All deposits must include:
* an ``<atom:author>`` tag with an ``<atom:name>`` and ``<atom:email>``, and
* either ``<atom:name>`` or ``<atom:title>``
We also highly recommend their CodeMeta equivalent, and any other relevant
metadata, but this is not enforced.
.. _metadata-only-deposit:
Metadata-only deposit
---------------------
The swh-deposit server can also be without a source code artifact, but only
to provide metadata that describes an arbitrary origin or object in
Software Heritage; known as extrinsic metadata.
Unlike regular deposits, there are no restricting on URL prefixes,
so any client can provide metadata on any origin; and no restrictions on which
objects can be described.
This is done by simply omitting the binary file deposit request of
a regular SWORDv2 deposit, and including information on which object the metadata
describes, by adding a ``<swhdeposit:reference>`` tag in the Atom document.
To describe an origin:
.. code:: xml
<?xml version="1.0"?>
<entry xmlns="http://www.w3.org/2005/Atom"
xmlns:swh="https://www.softwareheritage.org/schema/2018/deposit">
<!-- ... -->
<swh:deposit>
<swh:reference>
<swh:origin url='https://example.org/~user/repo'/>
</swh:reference>
</swh:deposit>
<!-- ... -->
</entry>
And to describe an object:
.. code:: xml
<?xml version="1.0"?>
<entry xmlns="http://www.w3.org/2005/Atom"
xmlns:swh="https://www.softwareheritage.org/schema/2018/deposit">
<!-- ... -->
<swh:deposit>
<swh:reference>
<swh:object swhid="swh:1:dir:31b5c8cc985d190b5a7ef4878128ebfdc2358f49" />
</swh:reference>
</swh:deposit>
<!-- ... -->
</entry>
For details on the semantics, see the
:ref:`metadata deposit specification <spec-metadata-deposit>`
.. _deposit-metadata-provenance:
Metadata provenance
-------------------
To indicate where the metadata is coming from, deposit clients can use a
``<swhdeposit:metadata-provenance>`` element in ``<swhdeposit:deposit>`` whose content is
the object the metadata is coming from,
preferably using the ``http://schema.org/`` namespace.
For example, when the metadata is coming from Wikidata, then the
``<swhdeposit:metadata-provenance>`` should be the page of a Q-entity, such as
``https://www.wikidata.org/wiki/Q16988498`` (not the Q-entity
``http://www.wikidata.org/entity/Q16988498`` itself, as the Q-entity **is** the
object described in the metadata)
Or when the metadata is coming from a curated repository like HAL, then
``<swhdeposit:metadata-provenance>`` should be the HAL project.
In particular, Software Heritage expects the ``<swhdeposit:metadata-provenance>`` object
to have a ``http://schema.org/url`` property, so that it can appropriately link
to the original page.
For example, to deposit metadata on GNU Hello:
.. code:: xml
<?xml version="1.0"?>
<entry xmlns="http://www.w3.org/2005/Atom"
xmlns:schema="http://schema.org/">
<!-- ... -->
<swh:deposit>
<swh:metadata-provenance>
<schema:url>https://www.wikidata.org/wiki/Q16988498</schema:url>
</swh:metadata-provenance>
</swh:deposit>
<!-- ... -->
</entry>
Here is a more complete example of a metadata-only deposit on version 2.9 of GNU Hello,
to show the interaction with other fields,
.. code:: xml
<?xml version="1.0"?>
<entry xmlns="http://www.w3.org/2005/Atom"
xmlns:swh="https://www.softwareheritage.org/schema/2018/deposit"
xmlns:schema="http://schema.org/"
xmlns:codemeta="https://doi.org/10.5063/SCHEMA/CODEMETA-2.0">
<swh:deposit>
<swh:reference>
<swh:object swhid="swh:1:dir:9b6f93b12a500f560796c8dffa383c7f4470a12f;origin=https://ftp.gnu.org/gnu/hello/;visit=swh:1:snp:1abd6aa1901ba0aa7f5b7db059250230957f8434;anchor=swh:1:rev:3d41fbdb693ba46fdebe098782be4867038503e2" />
</swh:reference>
<swh:metadata-provenance>
<schema:url>https://www.wikidata.org/wiki/Q16988498</schema:url>
</swh:metadata-provenance>
</swh:deposit>
<codemeta:name>GNU Hello</codemeta:name>
<codemeta:id>http://www.wikidata.org/entity/Q16988498</codemeta:id>
<codemeta:url>https://www.gnu.org/software/hello/</codemeta:url>
<!-- is part of the GNU project -->
<codemeta:isPartOf>http://www.wikidata.org/entity/Q7598</codemeta:isPartOf>
</entry>
Schema
------
Here is an XML schema to summarize the syntax described in this document:
https://gitlab.softwareheritage.org/swh/devel/swh-deposit/-/blob/master/swh/deposit/xsd/swh.xsd
.. _SWORDv2: http://swordapp.github.io/SWORDv2-Profile/SWORDProfile.html
.. _AtomPub: https://tools.ietf.org/html/rfc5023
.. _DublinCore: https://www.dublincore.org/
.. _CodeMeta: https://codemeta.github.io/
This diff is collapsed.
The metadata-deposit .. _spec-metadata-deposit:
====================
The metadata-only deposit
=========================
Goal Goal
---- ----
A client wishes to deposit only metadata about an object in the Software
Heritage archive.
The metadata-deposit is a special deposit where no content is A client may wish to deposit only metadata about an origin or object already
present in the Software Heritage archive.
The metadata-only deposit is a special deposit where no content is
provided and the data transferred to Software Heritage is only provided and the data transferred to Software Heritage is only
the metadata about an object or several objects in the archive. the metadata about an object in the archive.
Requirements Requirements
------------ ------------
The scope of the meta-deposit is different than the
sparse-deposit. While a sparse-deposit creates a revision with referenced
directories and content files, the metadata-deposit references one of the
following:
- origin
- snapshot
- revision
- release
1. Create a metadata-only deposit through a :ref:`POST request<API-create-deposit>`
2. It is composed of ONLY one Atom XML document
3. It MUST comply with :ref:`the metadata requirements<metadata-requirements>`
4. It MUST reference an **object** or an **origin** in a deposit tag
5. The reference SHOULD exist in the SWH archive
6. The **object** reference MUST be a SWHID on one of the following artifact types:
- origin
- snapshot
- release
- revision
- directory
- content
7. The SWHID MAY be a :ref:`core identifier <swhids-core>` with or without :ref:`qualifiers <swhids-qualifiers>`
8. The SWHID MUST NOT reference a fragment of code with the classifier ``lines``
A complete metadata example A complete metadata example
--------------------------- ---------------------------
The reference element is included in the metadata xml atomEntry under the The reference element is included in the metadata xml atomEntry under the
swh namespace: swh namespace:
TODO: publish schema at https://www.softwareheritage.org/schema/2018/deposit
.. code:: xml .. code:: xml
<?xml version="1.0"?> <?xml version="1.0"?>
<entry xmlns="http://www.w3.org/2005/Atom" <entry xmlns="http://www.w3.org/2005/Atom"
xmlns:codemeta="https://doi.org/10.5063/SCHEMA/CODEMETA-2.0" xmlns:codemeta="https://doi.org/10.5063/SCHEMA/CODEMETA-2.0"
xmlns:swh="https://www.softwareheritage.org/schema/2018/deposit"> xmlns:swh="https://www.softwareheritage.org/schema/2018/deposit">
<author> <author>
<name>HAL</name> <name>HAL</name>
<email>hal@ccsd.cnrs.fr</email> <email>hal@ccsd.cnrs.fr</email>
</author> </author>
<client>hal</client> <codemeta:name>The assignment problem</codemeta:name>
<external_identifier>hal-01243573</external_identifier> <codemeta:url>https://hal.archives-ouvertes.fr/hal-01243573</codemeta:url>
<codemeta:name>The assignment problem</codemeta:name> <codemeta:identifier>other identifier, DOI, ARK</codemeta:identifier>
<codemeta:url>https://hal.archives-ouvertes.fr/hal-01243573</codemeta:url> <codemeta:applicationCategory>Domain</codemeta:applicationCategory>
<codemeta:identifier>other identifier, DOI, ARK</codemeta:identifier> <codemeta:description>description</codemeta:description>
<codemeta:applicationCategory>Domain</codemeta:applicationCategory> <codemeta:author>
<codemeta:description>description</codemeta:description> <codemeta:name>Author1</codemeta:name>
<codemeta:author> <codemeta:affiliation>Inria</codemeta:affiliation>
<codemeta:name> author1 </codemeta:name> <codemeta:affiliation>UPMC</codemeta:affiliation>
<codemeta:affiliation> Inria </codemeta:affiliation> </codemeta:author>
<codemeta:affiliation> UPMC </codemeta:affiliation> <codemeta:author>
</codemeta:author> <codemeta:name>Author2</codemeta:name>
<codemeta:author> <codemeta:affiliation>Inria</codemeta:affiliation>
<codemeta:name> author2 </codemeta:name> <codemeta:affiliation>UPMC</codemeta:affiliation>
<codemeta:affiliation> Inria </codemeta:affiliation> </codemeta:author>
<codemeta:affiliation> UPMC </codemeta:affiliation> <swh:deposit>
</codemeta:author> <swh:reference>
<swh:deposit> <swh:origin url='https://github.com/user/repo'/>
<swh:reference> </swh:reference>
<swh:origin url='https://github.com/user/repo'/> </swh:deposit>
</swh:reference> </entry>
</swh:deposit>
</entry> References
----------
Examples by target type
^^^^^^^^^^^^^^^^^^^^^^^ The metadata reference can be either on:
Reference an origin: - an origin
- a graph object (core SWHID with or without qualifiers)
Origins
^^^^^^^
The metadata may be on an origin, identified by the origin's URL:
.. code:: xml .. code:: xml
<swh:deposit> <swh:deposit>
<swh:reference> <swh:reference>
<swh:origin url="https://github.com/user/repo"/> <swh:origin url="https://github.com/user/repo" />
</swh:reference> </swh:reference>
</swh:deposit> </swh:deposit>
Graph objects
^^^^^^^^^^^^^
Reference a snapshot, revision or release: It may also reference an object in the :ref:`SWH graph <data-model>`: contents,
directories, revisions, releases, and snapshots:
.. code:: xml .. code:: xml
With ${type} in {snp (snapshot), rev (revision), rel (release) }:
<swh:deposit> <swh:deposit>
<swh:reference> <swh:reference>
<swh:object id="swh:1:${type}:aaaaaaaaaaaaaa..."/> <swh:object swhid="swh:1:dir:31b5c8cc985d190b5a7ef4878128ebfdc2358f49" />
</swh:reference> </swh:reference>
</swh:deposit> </swh:deposit>
.. code:: xml
<swh:deposit>
<swh:reference>
<swh:object swhid="swh:1:dir:31b5c8cc985d190b5a7ef4878128ebfdc2358f49;origin=https://hal.archives-ouvertes.fr/hal-01243573;visit=swh:1:snp:4fc1e36fca86b2070204bedd51106014a614f321;anchor=swh:1:rev:9c5de20cfb54682370a398fcc733e829903c8cba;path=/moranegg-AffectationRO-df7f68b/" />
</swh:reference>
</swh:deposit>
The value of the ``swhid`` attribute must be a :ref:`SWHID <persistent-identifiers>`,
with any context qualifiers in this list:
* ``origin``
* ``visit``
* ``anchor``
* ``path``
and they should be provided whenever relevant, especially ``origin``.
Other qualifiers are not allowed (for example, ``line`` isn't because SWH
cannot store metadata at a finer level than entire contents).
Loading procedure Loading procedure
------------------ -----------------
In this case, the metadata-deposit will be injected as a metadata entry at the In this case, the metadata-deposit will be injected as a metadata entry of
appropriate level (origin_metadata, revision_metadata, etc.) with the information the relevant object, with the information about the contributor of the deposit.
about the contributor of the deposit. Contrary to the complete and sparse
deposit, there will be no object creation.
The sparse-deposit
==================
Goal
----
A client wishes to transfer a tarball for which part of the content is
already in the SWH archive.
Requirements
------------
To do so, a list of paths with targets must be provided in the metadata and
the paths to the missing directories/content should not be included
in the tarball. The list will be referred to
as the manifest list using the entry name 'bindings' in the metadata.
+----------------------+-------------------------------------+
| path | swh-id |
+======================+=====================================+
| path/to/file.txt | swh:1:cnt:aaaaaaaaaaaaaaaaaaaaa... |
+----------------------+-------------------------------------+
| path/to/dir/ | swh:1:dir:aaaaaaaaaaaaaaaaaaaaa... |
+----------------------+-------------------------------------+
Note: the *name* of the file or the directory is given by the path and is not
part of the identified object.
TODO: see if a trailing "/" is mandatory for implementation.
A concrete example
------------------
The manifest list is included in the metadata xml atomEntry under the
swh namespace:
TODO: publish schema at https://www.softwareheritage.org/schema/2018/deposit
.. code:: xml
<?xml version="1.0"?>
<entry xmlns="http://www.w3.org/2005/Atom"
xmlns:codemeta="https://doi.org/10.5063/SCHEMA/CODEMETA-2.0"
xmlns:swh="https://www.softwareheritage.org/schema/2018/deposit">
<author>
<name>HAL</name>it mandatory to have a trailing "/",
<email>hal@ccsd.cnrs.fr</email>
</author>
<client>hal</client>
<external_identifier>hal-01243573</external_identifier>
<codemeta:name>The assignment problem</codemeta:name>
<codemeta:url>https://hal.archives-ouvertes.fr/hal-01243573</codemeta:url>
<codemeta:identifier>other identifier, DOI, ARK</codemeta:identifier>
<codemeta:applicationCategory>Domain</codemeta:applicationCategory>
<codemeta:description>description</codemeta:description>
<codemeta:author>
<codemeta:name> author1 </codemeta:name>
<codemeta:affiliation> Inria </codemeta:affiliation>
<codemeta:affiliation> UPMC </codemeta:affiliation>
</codemeta:author>
<codemeta:author>
<codemeta:name> author2 </codemeta:name>
<codemeta:affiliation> Inria </codemeta:affiliation>
<codemeta:affiliation> UPMC </codemeta:affiliation>
</codemeta:author>
<swh:deposit>
<swh:bindings>
<swh:binding source="path/to/file.txt" destination="swh:1:cnt:aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa"/>
<swh:binding source="path/to/second_file.txt destination="swh:1:cnt:bbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbb"/>
<swh:binding source="path/to/dir/destination="swh:1:dir:ddddddddddddddddddddddddddddddddd"/>
</swh:bindings>
</swh:deposit>
</entry>
Deposit verification
--------------------
After checking the integrity of the deposit content and
metadata, the following checks should be added:
1. validate the manifest list structure with a correct swh-id for each path (syntax check on the swh-id format)
2. verify that the path name corresponds to the object type
3. locate the identifiers in the SWH archive
Each failing check should return a different error with the deposit
and result in a 'rejected' deposit.
Loading procedure
------------------
The injection procedure should include:
- load the tarball new data
- create new objects using the path name and create links from the path to the
SWH object using the identifier
- calculate identifier of the new objects at each level
- return final swh-id of the new revision
Invariant: the same content should yield the same swh-id,
that's why a complete deposit with all the content and
a sparse-deposit with the correct links will result
with the same root directory swh-id.
The same is expected with the revision swh-id if the metadata provided is
identical.
<?xml version="1.0" encoding="iso-8859-1"?>
<xsd:schema xmlns:xsd="http://www.w3.org/2001/XMLSchema">
<xsd:element name="deposit">
<xsd:complexType>
<xsd:choice>
<xsd:element name="reference">
<xsd:complexType>
<xsd:choice>
<xsd:element name="object">
<xsd:complexType>
<xsd:attribute type="xsd:string" name="id"/>
</xsd:complexType>
</xsd:element>
<xsd:element name="origin">
<xsd:complexType>
<xsd:attribute type="xsd:string" name="url"/>
</xsd:complexType>
</xsd:element>
</xsd:choice>
</xsd:complexType>
</xsd:element>
<xsd:element name="bindings">
<xsd:complexType>
<xsd:sequence>
<xsd:element name="binding" minOccurs="0" maxOccurs="unbounded">
<xsd:complexType>
<xsd:attribute type="xsd:string" name="source"/>
<xsd:attribute type="xsd:string" name="destination"/>
</xsd:complexType>
</xsd:element>
</xsd:sequence>
</xsd:complexType>
</xsd:element>
</xsd:choice>
</xsd:complexType>
</xsd:element>
</xsd:schema>
Deployment of the swh-deposit
=============================
As usual, the debian packaged is created and uploaded to the swh debian
repository. Once the package is installed, we need to do a few things in
regards to the database.
Prepare the database setup (existence, connection, etc...).
-----------------------------------------------------------
This is defined through the packaged ``swh.deposit.settings.production``
module and the expected **/etc/softwareheritage/deposit/server.yml**.
As usual, the expected configuration files are deployed through our
puppet manifest (cf. puppet-environment/swh-site,
puppet-environment/swh-role, puppet-environment/swh-profile)
Migrate/bootstrap the db schema
-------------------------------
.. code:: shell
sudo django-admin migrate --settings=swh.deposit.settings.production
Load minimum defaults data
--------------------------
.. code:: shell
sudo django-admin loaddata \
--settings=swh.deposit.settings.production deposit_data
This adds the minimal:
- deposit request type 'archive' and 'metadata'
- 'hal' collection
Note: swh.deposit.fixtures.deposit\_data is packaged
Add client and collection
-------------------------
.. code:: shell
swh deposit admin \
--config-file /etc/softwareheritage/deposit/server.yml \
--platform production \
user create \
--collection <collection-name> \
--username <client-name> \
--password <to-define>
This adds a user ``<client-name>`` which can access the collection
``<collection-name>``. The password will be used for the authentication
access to the deposit api.
Note:
- If the collection does not exist, it is created alongside
- The password is plain text but stored encrypted (so yes, for now
we know the user's password)
- For production platform, you must either set an
SWH_CONFIG_FILENAME environment variable or pass alongside the
`--config-file` parameter
Reschedule a deposit
---------------------
.. code:: shell
swh deposit admin \
--config-file /etc/softwareheritage/deposit/server.yml \
--platform production \
deposit reschedule \
--deposit-id <deposit-id>
This will:
- check the deposit's status to something reasonable (failed or done). That
means that the checks have passed alright but something went wrong during the
loading (failed: loading failed, done: loading ok, still for some reasons as
in bugs, we need to reschedule it)
- reset the deposit's status to 'verified' (prior to any loading but after the
checks which are fine) and removes the different archives' identifiers
(swh-id, ...)
- trigger back the loading task through the scheduler