Skip to content
Snippets Groups Projects

Compare revisions

Changes are shown as if the source revision was being merged into the target revision. Learn more about comparing revisions.

Source

Select target project
No results found

Target

Select target project
  • lunar/swh-deposit
  • anlambert/swh-deposit
  • swh/devel/swh-deposit
  • douardda/swh-deposit
  • ardumont/swh-deposit
  • marmoute/swh-deposit
  • rboyer/swh-deposit
7 results
Show changes
Showing
with 0 additions and 1423 deletions
import os
import django
os.environ.setdefault("DJANGO_SETTINGS_MODULE",
"swh.deposit.settings.development")
django.setup()
from swh.docs.sphinx.conf import * # NoQA
Hacking on swh-deposit
======================
There are multiple modes to run and test the server locally:
* development-like (automatic reloading when code changes)
* production-like (no reloading)
* integration tests (no side effects)
Except for the tests which are mostly side effects free (except for the
database access), the other modes will need some configuration files (up to 2)
to run properly.
Database
--------
swh-deposit uses a database to store the state of a deposit. The default
db is expected to be called swh-deposit-dev.
To simplify the use, the following makefile targets can be used:
schema
~~~~~~
.. code:: shell
make db-create db-prepare db-migrate
data
~~~~
Once the db is created, you need some data to be injected (request
types, client, collection, etc...):
.. code:: shell
make db-load-data db-load-private-data
The private data are about having a user (``hal``) with a password
(``hal``) who can access a collection (``hal``).
Add the following to ``../private-data.yaml``:
.. code:: yaml
- model: deposit.depositclient
fields:
user_ptr_id: 1
collections:
- 1
- model: auth.User
pk: 1
fields:
first_name: hal
last_name: hal
username: hal
password: "pbkdf2_sha256$30000$8lxjoGc9PiBm$DO22vPUJCTM17zYogBgBg5zr/97lH4pw10Mqwh85yUM="
- model: deposit.depositclient
fields:
user_ptr_id: 1
collections:
- 1
url: https://hal.inria.fr
drop
~~~~
For information, you can drop the db:
.. code:: shell
make db-drop
Development-like environment
----------------------------
Development-like environment needs one configuration file to work
properly.
Configuration
~~~~~~~~~~~~~
**``{/etc/softwareheritage | ~/.config/swh | ~/.swh}``/deposit/server.yml**:
.. code:: yaml
# dev option for running the server locally
host: 127.0.0.1
port: 5006
# production
authentication:
activated: true
white-list:
GET:
- /
# 20 Mib max size
max_upload_size: 20971520
Run
~~~
Run the local server, using the default configuration file:
.. code:: shell
make run-dev
Production-like environment
---------------------------
Production-like environment needs additional section in the
configuration file to work properly.
This is more close to what's actually running in production.
Configuration
~~~~~~~~~~~~~
This expects the same file describes in the previous chapter. Plus, an
additional private section file containing private information that is
not in the source code repository.
**``{/etc/softwareheritage | ~/.config/swh | ~/.swh}``/deposit/private.yml**:
.. code:: yaml
private:
secret_key: production-local
db:
name: swh-deposit-dev
A production configuration file would look like:
.. code:: yaml
private:
secret_key: production-secret-key
db:
name: swh-deposit-dev
host: db
port: 5467
user: user
password: user-password
Run
~~~
.. code:: shell
make run
Note: This expects gunicorn3 package installed on the system
Tests
-----
To run the tests:
.. code:: shell
make test
As explained, those tests are mostly side-effect free. The db part is
dealt with by django. The remaining part which patches those side-effect
behavior is dealt with in the ``swh/deposit/tests/__init__.py`` module.
Sum up
------
Prepare everything for your user to run:
.. code:: shell
make db-drop db-create db-prepare db-migrate db-load-private-data run-dev
Create deposit
^^^^^^^^^^^^^^^
.. http:post:: /1/<collection-name>/
Create deposit in a collection.
The client sends a deposit request to a specific collection with:
* an archive holding the software source code (binary upload)
* an envelop with metadata describing information regarding a deposit (atom
entry deposit)
Also known as: COL-IRI
:param text <name><pass>: the client's credentials
:param text Content-Type: accepted mimetype
:param int Content-Length: tarball size
:param text Content-MD5: md5 checksum hex encoded of the tarball
:param text Content-Disposition: attachment; filename=[filename]; the filename
parameter must be text (ascii)
:param text Content-Disposition: for the metadata file set name parameter
to 'atom'.
:param bool In-progress: true if not final; false when final request.
:statuscode 201: success for deposit on POST
:statuscode 401: Unauthorized
:statuscode 404: access to an unknown collection
:statuscode 415: unsupported media type
Sample request
~~~~~~~~~~~~~~~
.. code:: shell
curl -i -u hal:<pass> \
-F "file=@../deposit.json;type=application/zip;filename=payload" \
-F "atom=@../atom-entry.xml;type=application/atom+xml;charset=UTF-8" \
-H 'In-Progress: false' \
-H 'Slug: some-external-id' \
-XPOST https://deposit.softwareheritage.org/1/hal/
Sample response
~~~~~~~~~~~~~~~
.. code:: shell
HTTP/1.0 201 Created
Date: Tue, 26 Sep 2017 10:32:35 GMT
Server: WSGIServer/0.2 CPython/3.5.3
Vary: Accept, Cookie
Allow: GET, POST, PUT, DELETE, HEAD, OPTIONS
Location: /1/hal/10/metadata/
X-Frame-Options: SAMEORIGIN
Content-Type: application/xml
<entry xmlns="http://www.w3.org/2005/Atom"
xmlns:sword="http://purl.org/net/sword/"
xmlns:dcterms="http://purl.org/dc/terms/">
<deposit_id>10</deposit_id>
<deposit_date>Sept. 26, 2017, 10:32 a.m.</deposit_date>
<deposit_archive>None</deposit_archive>
<deposit_status>deposited</deposit_status>
<!-- Edit-IRI -->
<link rel="edit" href="/1/hal/10/metadata/" />
<!-- EM-IRI -->
<link rel="edit-media" href="/1/hal/10/media/"/>
<!-- SE-IRI -->
<link rel="http://purl.org/net/sword/terms/add" href="/1/hal/10/metadata/" />
<!-- State-IRI -->
<link rel="alternate" href="/1/<collection-name>/10/status/"/>
<sword:packaging>http://purl.org/net/sword/package/SimpleZip</sword:packaging>
</entry>
Display content
^^^^^^^^^^^^^^^^
.. http:get:: /1/<collection-name>/<deposit-id>/content/
Display information on the content's representation in the sword
server.
Also known as: CONT-FILE-IRI
:param text <name><pass>: the client's credentials
:statuscode 200: no error
:statuscode 401: Unauthorized
Service document
^^^^^^^^^^^^^^^^^
.. http:get:: /1/servicedocument/
This is the starting endpoint for the client to discover its initial
collection. The answer to this query will describes:
* the server's abilities
* connected client's collection information
Also known as: SD-IRI - The Service Document IRI
:param text <name><pass>: the client's credentials
:statuscode 200: no error
:statuscode 401: Unauthorized
Sample response
~~~~~~~~~~~~~~~
.. code:: xml
<?xml version="1.0" ?>
<service xmlns:dcterms="http://purl.org/dc/terms/"
xmlns:sword="http://purl.org/net/sword/terms/"
xmlns:atom="http://www.w3.org/2005/Atom"
xmlns="http://www.w3.org/2007/app">
<sword:version>2.0</sword:version>
<sword:maxUploadSize>20971520</sword:maxUploadSize>
<workspace>
<atom:title>The Software Heritage (SWH) archive</atom:title>
<collection href="https://deposit.softwareherigage.org/1/hal/">
<atom:title>SWH Software Archive</atom:title>
<accept>application/zip</accept>
<accept>application/x-tar</accept>
<sword:collectionPolicy>Collection Policy</sword:collectionPolicy>
<dcterms:abstract>Software Heritage Archive</dcterms:abstract>
<sword:mediation>false</sword:mediation>
<sword:metadataRelevantHeader>false</sword:metadataRelevantHeader>
<sword:treatment>Collect, Preserve, Share</sword:treatment>
<sword:acceptPackaging>http://purl.org/net/sword/package/SimpleZip</sword:acceptPackaging>
<sword:service>https://deposit.softwareheritage.org/1/hal/</sword:service>
</collection>
</workspace>
</service>
Retrieve status
^^^^^^^^^^^^^^^^
.. http:get:: /1/<collection-name>/<deposit-id>/
Returns deposit's status.
The different statuses:
- **partial**: multipart deposit is still ongoing
- **deposited**: deposit completed, ready for checks
- **rejected**: deposit failed the checks
- **verified**: content and metadata verified, ready for loading
- **loading**: loading in-progress
- **done**: loading completed successfully
- **failed**: the deposit loading has failed
Also known as STATE-IRI
:param text <name><pass>: the client's credentials
:statuscode 201: with the deposit's status
:statuscode 401: Unauthorized
:statuscode 404: access to an unknown deposit
Rejected deposit
~~~~~~~~~~~~~~~~
It so happens that deposit could be rejected. In that case, the
`deposit_status_detail` entry will explain failed checks.
Many reasons are possibles, here are some:
- Deposit without software archive (main goal of the deposit is to
deposit software source code)
- Deposit with malformed software archive (i.e archive within archive)
- Deposit with invalid software archive (corrupted archive, although,
this one should happen during upload and not during checks)
- Deposit with unsupported archive format
- Deposit with missing metadata
Sample response
~~~~~~~~~~~~~~~
Successful deposit:
.. code:: xml
<entry xmlns="http://www.w3.org/2005/Atom"
xmlns:sword="http://purl.org/net/sword/"
xmlns:dcterms="http://purl.org/dc/terms/">
<deposit_id>160</deposit_id>
<deposit_status>done</deposit_status>
<deposit_status_detail>The deposit has been successfully loaded into the Software Heritage archive</deposit_status_detail>
<deposit_swh_id>swh:1:dir:d83b7dda887dc790f7207608474650d4344b8df9</deposit_swh_id>
<deposit_swh_id_context>swh:1:dir:d83b7dda887dc790f7207608474650d4344b8df9;origin=https://forge.softwareheritage.org/source/jesuisgpl/</deposit_swh_id>
<deposit_swh_anchor_id>swh:1:rev:e76ea49c9ffbb7f73611087ba6e999b19e5d71eb</deposit_swh_id>
<deposit_swh_anchor_id_context>swh:1:rev:e76ea49c9ffbb7f73611087ba6e999b19e5d71eb;origin=https://forge.softwareheritage.org/source/jesuisgpl/</deposit_swh_id>
</entry>
Rejected deposit:
.. code:: xml
<entry xmlns="http://www.w3.org/2005/Atom"
xmlns:sword="http://purl.org/net/sword/"
xmlns:dcterms="http://purl.org/dc/terms/">
<deposit_id>148</deposit_id>
<deposit_status>rejected</deposit_status>
<deposit_status_detail>- At least one url field must be compatible with the client&#39;s domain name (codemeta:url)</deposit_status_detail>
</entry>
Update content
^^^^^^^^^^^^^^^
.. http:post:: /1/<collection-name>/<deposit-id>/media/
Add archive(s) to a deposit. Only possible if the deposit's status
is partial.
.. http:put:: /1/<collection-name>/<deposit-id>/media/
Replace all content by submitting a new archive. Only possible if
the deposit's status is partial.
Also known as: *update iri* (EM-IRI)
:param text <name><pass>: the client's credentials
:param text Content-Type: accepted mimetype
:param int Content-Length: tarball size
:param text Content-MD5: md5 checksum hex encoded of the tarball
:param text Content-Disposition: attachment; filename=[filename] ; the filename
parameter must be text (ascii)
:param bool In-progress: true if not final; false when final request.
:statuscode 204: success without payload on PUT
:statuscode 201: success for deposit on POST
:statuscode 401: Unauthorized
:statuscode 415: unsupported media type
Update metadata
^^^^^^^^^^^^^^^^
.. http:post:: /1/<collection-name>/<deposit-id>/metadata/
Add metadata to a deposit. Only possible if the deposit's status
is partial.
.. http:put:: /1/<collection-name>/<deposit-id>/metadata/
Replace all metadata by submitting a new metadata file. Only possible if
the deposit's status is partial.
Also known as: *update iri* (SE-IRI)
:param text <name><pass>: the client's credentials
:param text Content-Disposition: attachment; filename=[filename] ; the filename
parameter must be text (ascii), with a name parameter set to 'atom'.
:param bool In-progress: true if not final; false when final request.
:statuscode 204: success without payload on PUT
:statuscode 201: success for deposit on POST
:statuscode 401: Unauthorized
:statuscode 415: unsupported media type
Getting Started
===============
This is a guide for how to prepare and push a software deposit with
the `swh deposit` commands.
The API is rooted at https://deposit.softwareheritage.org/1.
For more details, see the `main documentation <./index.html>`__.
Requirements
------------
You need to be referenced on SWH's client list to have:
* credentials (needed for the basic authentication step)
- in this document we reference ``<name>`` as the client's name and
``<pass>`` as its associated authentication password.
* an associated collection_.
.. _collection: https://bitworking.org/projects/atom/rfc5023#rfc.section.8.3.3
`Contact us for more information.
<https://www.softwareheritage.org/contact/>`__
Prepare a deposit
-----------------
* compress the files in a supported archive format:
- zip: common zip archive (no multi-disk zip files).
- tar: tar archive without compression or optionally any of the
following compression algorithm gzip (`.tar.gz`, `.tgz`), bzip2
(`.tar.bz2`) , or lzma (`.tar.lzma`)
* (Optional) prepare a metadata file (more details :ref:`deposit-metadata`):
Push deposit
------------
You can push a deposit with:
* a single deposit (archive + metadata):
The user posts in one query a software
source code archive and associated metadata.
The deposit is directly marked with status ``deposited``.
* a multisteps deposit:
1. Create an incomplete deposit (marked with status ``partial``)
2. Add data to a deposit (in multiple requests if needed)
3. Finalize deposit (the status becomes ``deposited``)
Single deposit
^^^^^^^^^^^^^^
Once the files are ready for deposit, we want to do the actual deposit
in one shot, sending exactly one POST query:
* 1 archive (content-type ``application/zip`` or ``application/x-tar``)
* 1 metadata file in atom xml format (``content-type: application/atom+xml;type=entry``)
For this, we need to provide the:
* arguments: ``--username 'name' --password 'pass'`` as credentials
* archive's path (example: ``--archive path/to/archive-name.tgz``)
* software's name (optional if a metadata filepath is specified and the
artifact's name is included in the metadata file).
* author's name (optional if a metadata filepath is specified and the authors
are included in the metadata file). This can be specified multiple times in
case of multiple authors.
* (optionally) metadata file's path ``--metadata
path/to/file.metadata.xml``.
* (optionally) ``--slug 'your-id'`` argument, a reference to a unique identifier
the client uses for the software object. If not provided, A UUID will be
generated by SWH.
You can do this with the following command:
minimal deposit
.. code:: shell
$ swh deposit upload --username name --password secret \
--author "some@noone" --author "second@noone" \
--name 'je-suis-gpl' \
--archive je-suis-gpl.tgz
with client's external identifier (``slug``)
.. code:: shell
$ swh deposit upload --username name --password secret \
--author "some@noone" \
--name 'je-suis-gpl' \
--archive je-suis-gpl.tgz \
--slug je-suis-gpl
to a specific client's collection
.. code:: shell
$ swh deposit upload --username name --password secret \
--author "some@noone" \
--name 'je-suis-gpl' \
--archive je-suis-gpl.tgz \
--collection 'second-collection'
You just posted a deposit to your collection on Software Heritage
If everything went well, the successful response will contain the
elements below:
.. code:: shell
{
'deposit_status': 'deposited',
'deposit_id': '7',
'deposit_date': 'Jan. 29, 2018, 12:29 p.m.'
}
Note: As the deposit is in ``deposited`` status, you can no longer
update the deposit after this query. It will be answered with a 403
forbidden answer.
If something went wrong, an equivalent response will be given with the
`error` and `detail` keys explaining the issue, e.g.:
.. code:: shell
{
'error': 'Unknown collection name xyz',
'detail': None,
'deposit_status': None,
'deposit_status_detail': None,
'deposit_swh_id': None,
'status': 404
}
multisteps deposit
^^^^^^^^^^^^^^^^^^^^^^^^^
The steps to create a multisteps deposit:
1. Create an incomplete deposit
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
First use the ``--partial`` argument to declare there is more to come
.. code:: shell
$ swh deposit upload --username name --password secret \
--archive foo.tar.gz \
--partial
2. Add content or metadata to the deposit
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Continue the deposit by using the ``--deposit-id`` argument given as a response
for the first step. You can continue adding content or metadata while you use
the ``--partial`` argument.
To only add one new archive to the deposit:
.. code:: shell
$ swh deposit upload --username name --password secret \
--archive add-foo.tar.gz \
--deposit-id 42 \
--partial
To only add metadata to the deposit:
.. code:: shell
$ swh deposit upload --username name --password secret \
--metadata add-foo.tar.gz.metadata.xml \
--deposit-id 42 \
--partial
or:
.. code:: shell
$ swh deposit upload --username name --password secret \
--name 'add-foo' --author 'someone' \
--deposit-id 42 \
--partial
3. Finalize deposit
~~~~~~~~~~~~~~~~~~~
On your last addition (same command as before), by not declaring it
``--partial``, the deposit will be considered completed. Its status will be
changed to ``deposited``
Update deposit
----------------
* replace deposit:
- only possible if the deposit status is ``partial`` and
``--deposit-id <id>`` is provided
- by using the ``--replace`` flag
- ``--metadata-deposit`` replaces associated existing metadata
- ``--archive-deposit`` replaces associated archive(s)
- by default, with no flag or both, you'll replace associated
metadata and archive(s):
.. code:: shell
$ swh deposit upload --username name --password secret \
--deposit-id 11 \
--archive updated-je-suis-gpl.tgz \
--replace
* update a loaded deposit with a new version:
- by using the external-id with the ``--slug`` argument, you will
link the new deposit with its parent deposit:
.. code:: shell
$ swh deposit upload --username name --password secret \
--archive je-suis-gpl-v2.tgz \
--slug 'je-suis-gpl' \
Check the deposit's status
--------------------------
You can check the status of the deposit by using the ``--deposit-id`` argument:
.. code:: shell
$ swh deposit status --username name --password secret \
--deposit-id 11
.. code:: json
{
'deposit_id': '11',
'deposit_status': 'deposited',
'deposit_swh_id': None,
'deposit_status_detail': 'Deposit is ready for additional checks \
(tarball ok, metadata, etc...)'
}
The different statuses:
- **partial**: multipart deposit is still ongoing
- **deposited**: deposit completed
- **rejected**: deposit failed the checks
- **verified**: content and metadata verified
- **loading**: loading in-progress
- **done**: loading completed successfully
- **failed**: the deposit loading has failed
When the deposit has been loaded into the archive, the status will be
marked ``done``. In the response, will also be available the
<deposit_swh_id>, <deposit_swh_id_context>, <deposit_swh_anchor_id>,
<deposit_swh_anchor_id_context>. For example:
.. code:: json
{
'deposit_id': '11',
'deposit_status': 'done',
'deposit_swh_id': 'swh:1:dir:d83b7dda887dc790f7207608474650d4344b8df9',
'deposit_swh_id_context': 'swh:1:dir:d83b7dda887dc790f7207608474650d4344b8df9;origin=https://forge.softwareheritage.org/source/jesuisgpl/',
'deposit_swh_anchor_id': 'swh:1:rev:e76ea49c9ffbb7f73611087ba6e999b19e5d71eb',
'deposit_swh_anchor_id_context': 'swh:1:rev:e76ea49c9ffbb7f73611087ba6e999b19e5d71eb;origin=https://forge.softwareheritage.org/source/jesuisgpl/',
'deposit_status_detail': 'The deposit has been successfully \
loaded into the Software Heritage archive'
}
docs/images/deposit-create-chart.png

52.4 KiB

docs/images/deposit-delete-chart.png

55.6 KiB

docs/images/deposit-update-chart.png

60.1 KiB

.. _swh-deposit:
Software Heritage - Deposit
===========================
Push-based deposit of software source code artifacts to the archive.
.. toctree::
:maxdepth: 2
:caption: Contents:
getting-started
spec-api
metadata
dev-info
sys-info
specs/specs
Reference Documentation
-----------------------
.. toctree::
:maxdepth: 2
/apidoc/swh.deposit
.. _deposit-metadata:
Deposit metadata
================
When making a software deposit into the SWH archive, one can add
information describing the software artifact and the software project.
Metadata requirements
---------------------
- **the schema/vocabulary** used *MUST* be specified with a persistent url
(DublinCore, DOAP, CodeMeta, etc.)
.. code:: xml
<entry xmlns="http://www.w3.org/2005/Atom">
or
<entry xmlns="http://www.w3.org/2005/Atom"
xmlns:dcterms="http://purl.org/dc/terms/">
or
<entry xmlns="http://www.w3.org/2005/Atom"
xmlns:codemeta="https://doi.org/10.5063/SCHEMA/CODEMETA-2.0">
- **the name** of the software deposit *MUST* be provided [atom:title,
codemeta:name, dcterms:title]
- **the authors** of the software deposit *MUST* be provided
- **the url** representing the location of the source *MAY* be provided under
the url tag. The url will be used for creating an origin object in the
archive.
.. code:: xml
<codemeta:url>www.url-example.com</codemeta:url>
- **the external\_identifier** *MAY* be provided as an identifier
- **the external\_identifier** *SHOULD* match the Slug external-identifier in
the header
- **the description** of the software deposit *SHOULD* be provided
[codemeta:description]: short or long description of the software
- **the license/s** of the software
deposit *SHOULD* be provided [codemeta:license]
- other metadata *MAY* be added with terms defined by the schema in use.
Examples
--------
Using only Atom
~~~~~~~~~~~~~~~
.. code:: xml
<?xml version="1.0"?>
<entry xmlns="http://www.w3.org/2005/Atom">
<title>Awesome Compiler</title>
<id>urn:uuid:1225c695-cfb8-4ebb-aaaa-80da344efa6a</id>
<external_identifier>1785io25c695</external_identifier>
<updated>2017-10-07T15:17:08Z</updated>
<author>some awesome author</author>
</entry>
Using Atom with CodeMeta
~~~~~~~~~~~~~~~~~~~~~~~~
.. code:: xml
<?xml version="1.0"?>
<entry xmlns="http://www.w3.org/2005/Atom"
xmlns:codemeta="https://doi.org/10.5063/SCHEMA/CODEMETA-2.0">
<title>Awesome Compiler</title>
<id>urn:uuid:1225c695-cfb8-4ebb-aaaa-80da344efa6a</id>
<external_identifier>1785io25c695</external_identifier>
<codemeta:id>1785io25c695</codemeta:id>
<codemeta:url>origin url</codemeta:url>
<codemeta:identifier>other identifier, DOI, ARK</codemeta:identifier>
<codemeta:applicationCategory>Domain</codemeta:applicationCategory>
<codemeta:description>description</codemeta:description>
<codemeta:keywords>key-word 1</codemeta:keywords>
<codemeta:keywords>key-word 2</codemeta:keywords>
<codemeta:dateCreated>creation date</codemeta:dateCreated>
<codemeta:datePublished>publication date</codemeta:datePublished>
<codemeta:releaseNotes>comment</codemeta:releaseNotes>
<codemeta:referencePublication>
<codemeta:name> article name</codemeta:name>
<codemeta:identifier> article id </codemeta:identifier>
</codemeta:referencePublication>
<codemeta:isPartOf>
<codemeta:type> Collaboration/Projet </codemeta:type>
<codemeta:name> project name</codemeta:name>
<codemeta:identifier> id </codemeta:identifier>
</codemeta:isPartOf>
<codemeta:relatedLink>see also </codemeta:relatedLink>
<codemeta:funding>Sponsor A </codemeta:funding>
<codemeta:funding>Sponsor B</codemeta:funding>
<codemeta:operatingSystem>Platform/OS </codemeta:operatingSystem>
<codemeta:softwareRequirements>dependencies </codemeta:softwareRequirements>
<codemeta:softwareVersion>Version</codemeta:softwareVersion>
<codemeta:developmentStatus>active </codemeta:developmentStatus>
<codemeta:license>
<codemeta:name>license</codemeta:name>
<codemeta:url>url spdx</codemeta:url>
</codemeta:license>
<codemeta:runtimePlatform>.Net Framework 3.0 </codemeta:runtimePlatform>
<codemeta:runtimePlatform>Python2.3</codemeta:runtimePlatform>
<codemeta:author>
<codemeta:name> author1 </codemeta:name>
<codemeta:affiliation> Inria </codemeta:affiliation>
<codemeta:affiliation> UPMC </codemeta:affiliation>
</codemeta:author>
<codemeta:author>
<codemeta:name> author2 </codemeta:name>
<codemeta:affiliation> Inria </codemeta:affiliation>
<codemeta:affiliation> UPMC </codemeta:affiliation>
</codemeta:author>
<codemeta:codeRepository>http://code.com</codemeta:codeRepository>
<codemeta:programmingLanguage>language 1</codemeta:programmingLanguage>
<codemeta:programmingLanguage>language 2</codemeta:programmingLanguage>
<codemeta:issueTracker>http://issuetracker.com</codemeta:issueTracker>
</entry>
Using Atom with DublinCore and CodeMeta (multi-schema entry)
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. code:: xml
<?xml version="1.0"?>
<entry xmlns="http://www.w3.org/2005/Atom"
xmlns:dcterms="http://purl.org/dc/terms/"
xmlns:codemeta="https://doi.org/10.5063/SCHEMA/CODEMETA-2.0">
<title>Awesome Compiler</title>
<client>hal</client>
<id>urn:uuid:1225c695-cfb8-4ebb-aaaa-80da344efa6a</id>
<external_identifier>%s</external_identifier>
<dcterms:identifier>hal-01587361</dcterms:identifier>
<dcterms:identifier>doi:10.5281/zenodo.438684</dcterms:identifier>
<dcterms:title xml:lang="en">The assignment problem</dcterms:title>
<dcterms:title xml:lang="fr">AffectationRO</dcterms:title>
<dcterms:creator>author</dcterms:creator>
<dcterms:subject>[INFO] Computer Science [cs]</dcterms:subject>
<dcterms:subject>[INFO.INFO-RO] Computer Science [cs]/Operations Research [cs.RO]</dcterms:subject>
<dcterms:type>SOFTWARE</dcterms:type>
<dcterms:abstract xml:lang="en">Project in OR: The assignment problemA java implementation for the assignment problem first release</dcterms:abstract>
<dcterms:abstract xml:lang="fr">description fr</dcterms:abstract>
<dcterms:created>2015-06-01</dcterms:created>
<dcterms:available>2017-10-19</dcterms:available>
<dcterms:language>en</dcterms:language>
<codemeta:url>origin url</codemeta:url>
<codemeta:softwareVersion>1.0.0</codemeta:softwareVersion>
<codemeta:keywords>key word</codemeta:keywords>
<codemeta:releaseNotes>Comment</codemeta:releaseNotes>
<codemeta:referencePublication>Rfrence interne </codemeta:referencePublication>
<codemeta:relatedLink>link </codemeta:relatedLink>
<codemeta:funding>Sponsor </codemeta:funding>
<codemeta:operatingSystem>Platform/OS </codemeta:operatingSystem>
<codemeta:softwareRequirements>dependencies </codemeta:softwareRequirements>
<codemeta:developmentStatus>Ended </codemeta:developmentStatus>
<codemeta:license>
<codemeta:name>license</codemeta:name>
<codemeta:url>url spdx</codemeta:url>
</codemeta:license>
<codemeta:codeRepository>http://code.com</codemeta:codeRepository>
<codemeta:programmingLanguage>language 1</codemeta:programmingLanguage>
<codemeta:programmingLanguage>language 2</codemeta:programmingLanguage>
</entry>
Note
----
We aim on harmonizing the metadata from different origins and thus
metadata will be translated to the `CodeMeta
v.2 <https://doi.org/10.5063/SCHEMA/CODEMETA-2.0>`__ vocabulary if
possible.
API Specification
=================
This is `Software Heritage <https://www.softwareheritage.org>`__'s
`SWORD
2.0 <http://swordapp.github.io/SWORDv2-Profile/SWORDProfile.html>`__
Server implementation.
**S.W.O.R.D** (**S**\ imple **W**\ eb-Service **O**\ ffering
**R**\ epository **D**\ eposit) is an interoperability standard for
digital file deposit.
This implementation will permit interaction between a client (a repository) and
a server (SWH repository) to push deposits of software source code archives
with associated metadata.
*Note:*
* In the following document, we will use the ``archive`` or ``software source
code archive`` interchangeably.
* The supported archive formats are:
* zip: common zip archive (no multi-disk zip files).
* tar: tar archive without compression or optionally any of the following
compression algorithm gzip (.tar.gz, .tgz), bzip2 (.tar.bz2) , or lzma
(.tar.lzma)
Collection
----------
SWORD defines a ``collection`` concept. In SWH's case, this collection
refers to a group of deposits. A ``deposit`` is some form of software
source code archive(s) associated with metadata.
By default the client's collection will have the client's name.
Limitations
-----------
* upload limitation of 100Mib
* no mediation
API overview
------------
API access is over HTTPS.
The API is protected through basic authentication.
Endpoints
---------
The API endpoints are rooted at https://deposit.softwareheritage.org/1/.
Data is sent and received as XML (as specified in the SWORD 2.0
specification).
.. include:: endpoints/service-document.rst
.. include:: endpoints/collection.rst
.. include:: endpoints/update-media.rst
.. include:: endpoints/update-metadata.rst
.. include:: endpoints/status.rst
.. include:: endpoints/content.rst
Possible errors:
----------------
* common errors:
* 401 (unauthenticated) if a client does not provide credential or provide
wrong ones
* 403 (forbidden) if a client tries access to a collection it does not own
* 404 (not found) if a client tries access to an unknown collection
* 404 (not found) if a client tries access to an unknown deposit
* 415 (unsupported media type) if a wrong media type is provided to the
endpoint
* archive/binary deposit:
* 403 (forbidden) if the length of the archive exceeds the max size
configured
* 412 (precondition failed) if the length or hash provided mismatch the
reality of the archive.
* 415 (unsupported media type) if a wrong media type is provided
* multipart deposit:
* 412 (precondition failed) if the md5 hash provided mismatch the reality of
the archive
* 415 (unsupported media type) if a wrong media type is provided
* Atom entry deposit:
* 400 (bad request) if the request's body is empty (for creation only)
Sources
-------
* `SWORD v2 specification
<http://swordapp.github.io/SWORDv2-Profile/SWORDProfile.html>`__
* `arxiv documentation <https://arxiv.org/help/submit_sword>`__
* `Dataverse example <http://guides.dataverse.org/en/4.3/api/sword.html>`__
* `SWORD used on HAL <https://api.archives-ouvertes.fr/docs/sword>`__
* `xml examples for CCSD <https://github.com/CCSDForge/HAL/tree/master/Sword>`__
Use cases
---------
Deposit creation
~~~~~~~~~~~~~~~~
From client's deposit repository server to SWH's repository server:
1. The client requests for the server's abilities and its associated collection
(GET query to the *SD/service document uri*)
2. The server answers the client with the service document which gives the
*collection uri* (also known as *COL/collection IRI*).
3. The client sends a deposit (optionally a zip archive, some metadata or both)
through the *collection uri*.
This can be done in:
* one POST request (metadata + archive).
* one POST request (metadata or archive) + other PUT or POST request to the
*update uris* (*edit-media iri* or *edit iri*)
a. Server validates the client's input or returns detailed error if any
b. Server stores information received (metadata or software archive source
code or both)
4. The server notifies the client it acknowledged the client's request. An
``http 201 Created`` response with a deposit receipt in the body response is
sent back. That deposit receipt will hold the necessary information to
eventually complete the deposit later on if it was incomplete (also known as
status ``partial``).
Schema representation
^^^^^^^^^^^^^^^^^^^^^
.. raw:: html
<!-- {F2884278} -->
.. figure:: ../images/deposit-create-chart.png
:alt:
Updating an existing deposit
~~~~~~~~~~~~~~~~~~~~~~~~~~~~
5. Client updates existing deposit through the *update uris* (one or more POST
or PUT requests to either the *edit-media iri* or *edit iri*).
1. Server validates the client's input or returns detailed error if any
2. Server stores information received (metadata or software archive source
code or both)
This would be the case for example if the client initially posted a
``partial`` deposit (e.g. only metadata with no archive, or an archive
without metadata, or a split archive because the initial one exceeded
the limit size imposed by swh repository deposit)
Schema representation
^^^^^^^^^^^^^^^^^^^^^
.. raw:: html
<!-- {F2884302} -->
.. figure:: ../images/deposit-update-chart.png
:alt:
Deleting deposit (or associated archive, or associated metadata)
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
6. Deposit deletion is possible as long as the deposit is still in ``partial``
state.
1. Server validates the client's input or returns detailed error if any
2. Server actually delete information according to request
Schema representation
^^^^^^^^^^^^^^^^^^^^^
.. raw:: html
<!-- {F2884311} -->
.. figure:: ../images/deposit-delete-chart.png
:alt:
Client asks for operation status
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
7. Operation status can be read through a GET query to the *state iri*.
Server: Triggering deposit checks
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Once the status ``deposited`` is reached for a deposit, checks for the
associated archive(s) and metadata will be triggered. If those checks
fail, the status is changed to ``rejected`` and nothing more happens
there. Otherwise, the status is changed to ``verified``.
Server: Triggering deposit load
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Once the status ``verified`` is reached for a deposit, loading the
deposit with its associated metadata will be triggered.
The loading will result on status update, either ``done`` or ``failed``
(depending on the loading's status).
This is described in the `loading document <./spec-loading.html>`__.
<?xml version="1.0"?>
<entry xmlns="http://www.w3.org/2005/Atom"
xmlns:codemeta="https://doi.org/10.5063/SCHEMA/CODEMETA-2.0"
xmlns:swh="https://www.softwareheritage.org/schema/2018/deposit">
<author>
<name>HAL</name>
<email>hal@ccsd.cnrs.fr</email>
</author>
<client>hal</client>
<external_identifier>hal-01243573</external_identifier>
<codemeta:name>The assignment problem</codemeta:name>
<codemeta:url>https://hal.archives-ouvertes.fr/hal-01243573</codemeta:url>
<codemeta:identifier>other identifier, DOI, ARK</codemeta:identifier>
<codemeta:applicationCategory>Domain</codemeta:applicationCategory>
<codemeta:description>description</codemeta:description>
<codemeta:author>
<codemeta:name> author1 </codemeta:name>
<codemeta:affiliation> Inria </codemeta:affiliation>
<codemeta:affiliation> UPMC </codemeta:affiliation>
</codemeta:author>
<codemeta:author>
<codemeta:name> author2 </codemeta:name>
<codemeta:affiliation> Inria </codemeta:affiliation>
<codemeta:affiliation> UPMC </codemeta:affiliation>
</codemeta:author>
<swh:deposit>
<swh:bindings>
<swh:binding source="path/to/file.txt" destination="aaaaaaaaaaa..."/>
</swh:bindings>
</swh:deposit>
</entry>
Loading specification (draft)
=============================
This part discusses the deposit loading part on the server side.
Tarball Loading
---------------
The ``swh-loader-tar`` module is already able to inject tarballs in swh
with very limited metadata (mainly the origin).
The loading of the deposit will use the deposit's associated data:
* the metadata
* the archive(s)
We will use the ``synthetic`` revision notion.
To that revision will be associated the metadata. Those will be included
in the hash computation, thus resulting in a unique identifier.
Loading mapping
~~~~~~~~~~~~~~~
Some of those metadata will also be included in the ``origin_metadata``
table.
::
origin | https://hal.inria.fr/hal-id |
------------------------------------|----------------------------------------|
origin_visit | 1 :reception_date |
origin_metadata | aggregated metadata |
occurrence &amp; occurrence_history | branch: client's version n° (e.g hal) |
revision | synthetic_revision (tarball) |
directory | upper level of the uncompressed archive|
Questions raised concerning loading
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
- A deposit has one origin, yet an origin can have multiple deposits?
No, an origin can have multiple requests for the same deposit. Which
should end up in one single deposit (when the client pushes its final
request saying deposit 'done' through the header In-Progress).
Only update of existing 'partial' deposit is permitted. Other than that,
the deposit 'update' operation.
To create a new version of a software (already deposited), the client
must prior to this create a new deposit.
Illustration First deposit loading:
HAL's deposit 01535619 = SWH's deposit **01535619-1**
::
+ 1 origin with url:https://hal.inria.fr/medihal-01535619
+ 1 synthetic revision
+ 1 directory
HAL's update on deposit 01535619 = SWH's deposit **01535619-2**
(\*with HAL updates can only be on the metadata and a new version is
required if the content changes)
::
+ 1 origin with url:https://hal.inria.fr/medihal-01535619
+ new synthetic revision (with new metadata)
+ same directory
HAL's deposit 01535619-v2 = SWH's deposit **01535619-v2-1**
::
+ same origin
+ new revision
+ new directory
Technical details
-----------------
Requirements
~~~~~~~~~~~~
* one dedicated database to store the deposit's state - swh-deposit
* one dedicated temporary objstorage to store archives before loading
* one client to test the communication with SWORD protocol
Deposit reception schema
~~~~~~~~~~~~~~~~~~~~~~~~
* SWORD imposes the use of basic authentication, so we need a way to
authenticate client. Also, a client can access collections:
**deposit\_client** table: - id (bigint): Client's identifier - username
(str): Client's username - password (pass): Client's crypted password -
collections ([id]): List of collections the client can access
* Collections group deposits together:
**deposit\_collection** table: - id (bigint): Collection's identifier - name
(str): Collection's human readable name
* A deposit is the main object the repository is all about:
**deposit** table:
* id (bigint): deposit's identifier
* reception\_date (date): First deposit's reception date
* complete\_data (date): Date when the deposit is deemed complete and ready
for loading
* collection (id): The collection the deposit belongs to
* external id (text): client's internal identifier (e.g hal's id, etc...).
* client\_id (id) : Client which did the deposit
* swh\_id (str) : swh identifier result once the loading is complete
* status (enum): The deposit's current status
- As mentioned, a deposit can have a status, whose possible values are:
.. code:: text
'partial', -- the deposit is new or partially received since it
-- can be done in multiple requests
'expired', -- deposit has been there too long and is now deemed
-- ready to be garbage collected
'deposited' -- deposit complete, it is ready to be checked to ensure data consistency
'verified', -- deposit is fully received, checked, and ready for loading
'loading', -- loading is ongoing on swh's side
'done', -- loading is successful
'failed' -- loading is a failure
* A deposit is stateful and can be made in multiple requests:
**deposit\_request** table:
* id (bigint): identifier
* type (id): deposit request's type (possible values: 'archive', 'metadata')
* deposit\_id (id): deposit whose request belongs to
* metadata: metadata associated to the request
* date (date): date of the requests
Information sent along a request are stored in a ``deposit_request`` row.
They can be either of type ``metadata`` (atom entry, multipart's atom entry
part) or of type ``archive`` (binary upload, multipart's binary upload part).
When the deposit is complete (status ``deposited``), those ``metadata`` and
``archive`` deposit requests will be read and aggregated. They will then be
sent as parameters to the loading routine.
During loading, some of those metadata are kept in the ``origin_metadata``
table and some other are stored in the ``revision`` table (see `metadata
loading <#metadata-loading>`__).
The only update actions occurring on the deposit table are in regards of: -
status changing: - ``partial`` -> {``expired``/``deposited``}, -
``deposited`` -> {``rejected``/``verified``}, - ``verified`` -> ``loading`` -
``loading`` -> {``done``/``failed``} - ``complete_date`` when the deposit is
finalized (when the status is changed to ``deposited``) - ``swh-id`` is
populated once we have the loading result
SWH Identifier returned
^^^^^^^^^^^^^^^^^^^^^^^
::
The synthetic revision id
e.g.: swh:1:rev:47dc6b4636c7f6cba0df83e3d5490bf4334d987e
Scheduling loading
~~~~~~~~~~~~~~~~~~
All ``archive`` and ``metadata`` deposit requests should be aggregated before
loading.
The loading should be scheduled via the scheduler's api.
Only ``deposited`` deposit are concerned by the loading.
When the loading is done and successful, the deposit entry is updated: -
``status`` is updated to ``done`` - ``swh-id`` is populated with the resulting
hash (cf. `swh identifier <#swh-identifier-returned>`__) - ``complete_date`` is
updated to the loading's finished time
When the loading is failed, the deposit entry is updated: - ``status`` is
updated to ``failed`` - ``swh-id`` and ``complete_data`` remains as is
*Note:* As a further improvement, we may prefer having a retry policy with
graceful delays for further scheduling.
Metadata loading
~~~~~~~~~~~~~~~~
- the metadata received with the deposit should be kept in the
``origin_metadata`` table before translation as part of the loading process
and an indexation process should be scheduled.
- provider\_id and tool\_id are resolved by the prepare\_metadata method in the
loader-core
- the origin\_metadata entry is sent to storage by the send\_origin\_metadata
in the loader-core
origin\_metadata table:
::
id bigint PK
origin bigint
discovery_date date
provider_id bigint FK // (from provider table)
tool_id bigint FK // indexer_configuration_id tool used for extraction
metadata jsonb // before translation