Skip to content
Snippets Groups Projects

Compare revisions

Changes are shown as if the source revision was being merged into the target revision. Learn more about comparing revisions.

Source

Select target project
No results found

Target

Select target project
  • lunar/swh-deposit
  • anlambert/swh-deposit
  • swh/devel/swh-deposit
  • douardda/swh-deposit
  • ardumont/swh-deposit
  • marmoute/swh-deposit
  • rboyer/swh-deposit
7 results
Show changes
Commits on Source (893)
# Changes here will be overwritten by Copier
_commit: v0.3.3
_src_path: https://gitlab.softwareheritage.org/swh/devel/swh-py-template.git
description: Software Heritage deposit server
distribution_name: swh-deposit
have_cli: true
have_workers: true
package_root: swh/deposit
project_name: swh.deposit
python_minimal_version: '3.7'
readme_format: rst
# python: Reformat code with black
f5426d6722826972e2d611d4e7040abbf40c49a1
8a006aeebf7d0cf52abc71b07cd560cbd098349e
7b0fac22d29db6ad27cb650f835cae2f8786ad70
# isort
9c0d0496369828c8fad882d5d676978fb76105f8
*.egg-info/
*.pyc
*.sw?
*~
.coverage
.eggs/
.hypothesis
.mypy_cache
.tox
__pycache__
*.egg-info/
version.txt
/analysis.org
/swh/deposit/fixtures/private_data.yaml
/swh/deposit.json
/test.json
/swh/test
db.sqlite3
build/
dist/
# these are symlinks created by a hook in swh-docs' main sphinx conf.py
docs/README.rst
docs/README.md
# this should be a symlink for people who want to build the sphinx doc
# without using tox, generally created by the swh-env/bin/update script
docs/Makefile.sphinx
exclude: ^swh/deposit/tests/data/atom/.*$
repos:
- repo: https://github.com/pre-commit/pre-commit-hooks
rev: v5.0.0
hooks:
- id: trailing-whitespace
- id: check-json
- id: check-yaml
- repo: https://github.com/python/black
rev: 25.1.0
hooks:
- id: black
- repo: https://github.com/PyCQA/isort
rev: 6.0.0
hooks:
- id: isort
- repo: https://github.com/pycqa/flake8
rev: 7.1.1
hooks:
- id: flake8
additional_dependencies: [flake8-bugbear==24.12.12, flake8-pyproject]
- repo: https://github.com/codespell-project/codespell
rev: v2.4.1
hooks:
- id: codespell
name: Check source code spelling
args: [-L sur]
stages: [pre-commit]
- id: codespell
name: Check commit message spelling
stages: [commit-msg]
- repo: local
hooks:
- id: mypy
name: mypy
entry: env DJANGO_SETTINGS_MODULE=swh.deposit.settings.testing mypy
args: [swh]
pass_filenames: false
language: system
types: [python]
- id: twine-check
name: twine check
description: call twine check when pushing an annotated release tag
entry: bash -c "ref=$(git describe) &&
[[ $ref =~ ^v[0-9]+\.[0-9]+\.[0-9]+$ ]] &&
(python3 -m build --sdist && twine check $(ls -t dist/* | head -1)) || true"
pass_filenames: false
stages: [pre-push]
language: python
additional_dependencies: [twine, build]
# Software Heritage Code of Conduct
## Our Pledge
In the interest of fostering an open and welcoming environment, we as Software
Heritage contributors and maintainers pledge to making participation in our
project and our community a harassment-free experience for everyone, regardless
of age, body size, disability, ethnicity, sex characteristics, gender identity
and expression, level of experience, education, socioeconomic status,
nationality, personal appearance, race, religion, or sexual identity and
orientation.
## Our Standards
Examples of behavior that contributes to creating a positive environment
include:
* Using welcoming and inclusive language
* Being respectful of differing viewpoints and experiences
* Gracefully accepting constructive criticism
* Focusing on what is best for the community
* Showing empathy towards other community members
Examples of unacceptable behavior by participants include:
* The use of sexualized language or imagery and unwelcome sexual attention or
advances
* Trolling, insulting/derogatory comments, and personal or political attacks
* Public or private harassment
* Publishing others' private information, such as a physical or electronic
address, without explicit permission
* Other conduct which could reasonably be considered inappropriate in a
professional setting
## Our Responsibilities
Project maintainers are responsible for clarifying the standards of acceptable
behavior and are expected to take appropriate and fair corrective action in
response to any instances of unacceptable behavior.
Project maintainers have the right and responsibility to remove, edit, or
reject comments, commits, code, wiki edits, issues, and other contributions
that are not aligned to this Code of Conduct, or to ban temporarily or
permanently any contributor for other behaviors that they deem inappropriate,
threatening, offensive, or harmful.
## Scope
This Code of Conduct applies within all project spaces, and it also applies when
an individual is representing the project or its community in public spaces.
Examples of representing a project or community include using an official
project e-mail address, posting via an official social media account, or acting
as an appointed representative at an online or offline event. Representation of
a project may be further defined and clarified by project maintainers.
## Enforcement
Instances of abusive, harassing, or otherwise unacceptable behavior may be
reported by contacting the project team at `conduct@softwareheritage.org`. All
complaints will be reviewed and investigated and will result in a response that
is deemed necessary and appropriate to the circumstances. The project team is
obligated to maintain confidentiality with regard to the reporter of an
incident. Further details of specific enforcement policies may be posted
separately.
Project maintainers who do not follow or enforce the Code of Conduct in good
faith may face temporary or permanent repercussions as determined by other
members of the project's leadership.
## Attribution
This Code of Conduct is adapted from the [Contributor Covenant][homepage], version 1.4,
available at https://www.contributor-covenant.org/version/1/4/code-of-conduct.html
[homepage]: https://www.contributor-covenant.org
For answers to common questions about this code of conduct, see
https://www.contributor-covenant.org/faq
Ishan Bhanuka
include Makefile
include requirements.txt
include requirements-swh.txt
include version.txt
recursive-include swh/deposit/static *
recursive-include swh/deposit/fixtures *
recursive-include swh/deposit/templates *
FLAKEFLAGS='--exclude=swh/manage.py,swh/deposit/settings.py,swh/deposit/migrations/'
FLAKEFLAGS='--exclude=swh/deposit/manage.py,swh/deposit/settings.py,swh/deposit/migrations/'
MANAGE=python3 -m swh.manage
MANAGE=python3 -m swh.deposit.manage
db-drop:
dropdb swh-deposit-dev || return 0
......@@ -26,5 +26,9 @@ run-dev:
run:
gunicorn3 -b 127.0.0.1:5006 swh.deposit.wsgi
test:
cd swh && python3 -m manage test
# Override default rule to make sure DJANGO env var is properly set. It
# *should* work without any override thanks to the mypy django-stubs plugin,
# but it currently doesn't; see
# https://github.com/typeddjango/django-stubs/issues/166
check-mypy:
DJANGO_SETTINGS_MODULE=swh.deposit.settings.testing $(MYPY) $(MYPYFLAGS) swh
# Develop on swh-deposit
There are multiple modes to run and test the server locally:
- development-like (automatic reloading when code changes)
- production-like (no reloading)
- integration tests (no side effects)
Except for the tests which are mostly side effects free (except for
the database access), the other modes will need some configuration
files (up to 2) to run properly.
## Database
swh-deposit uses a database to store the state of a deposit.
The default db is expected to be called swh-deposit-dev.
To simplify the use, the following makefile targets can be used:
### schema
``` Shell
make db-create db-prepare db-migrate
```
### data
Once the db is created, you need some data to be injected (request
types, client, collection, etc...):
``` Shell
make db-load-data db-load-private-data
```
The private data are about having a user (`hal`) with a password
(`hal`) who can access a collection (`hal`).
Add the following to `../private-data.yaml`:
``` YAML
- model: deposit.depositclient
fields:
user_ptr_id: 1
collections:
- 1
- model: auth.User
pk: 1
fields:
first_name: hal
last_name: hal
username: hal
password: "pbkdf2_sha256$30000$8lxjoGc9PiBm$DO22vPUJCTM17zYogBgBg5zr/97lH4pw10Mqwh85yUM="
- model: deposit.depositclient
fields:
user_ptr_id: 1
collections:
- 1
```
### drop
For information, you can drop the db:
``` Shell
make db-drop
```
## Development-like environment
Development-like environment needs one configuration file to work
properly.
### Configuration
**`{/etc/softwareheritage | ~/.config/swh | ~/.swh}`/deposit/server.yml**:
``` YAML
# dev option for running the server locally
host: 127.0.0.1
port: 5006
# production
authentication:
activated: true
white-list:
GET:
- /
# 20 Mib max size
max_upload_size: 20971520
```
### Run
Run the local server, using the default configuration file:
``` Shell
make run-dev
```
## Production-like environment
Production-like environment needs two configuration files to work
properly.
This is more close to what's actually running in production.
### Configuration
This expects the same file describes in the previous chapter. Plus,
an additional private **settings.yml** file containing secret
information that is not in the source code repository.
**`{/etc/softwareheritage | ~/.config/swh | ~/.swh}`/deposit/private.yml**:
``` YAML
secret_key: production-local
db:
name: swh-deposit-dev
```
A production configuration file would look like:
``` YAML
secret_key: production-secret-key
db:
name: swh-deposit-dev
host: db
port: 5467
user: user
password: user-password
```
### Run
``` Shell
make run
```
Note: This expects gunicorn3 package installed on the system
## Tests
To run the tests:
``` Shell
make test
```
As explained, those tests are mostly side-effect free. The db part is
dealt with by django. The remaining part which patches those
side-effect behavior is dealt with in the
`swh/deposit/tests/__init__.py` module.
## Sum up
Prepare everything for your user to run:
``` Shell
make db-drop db-create db-prepare db-migrate db-load-private-data run-dev
```
# Getting Started
This is a getting started to demonstrate the deposit api use case with
a shell client.
The api is rooted at https://deposit.softwareheritage.org.
For more details, see the [main README](./README.md).
## Requirements
You need to be referenced on SWH's client list to have:
- a credential (needed for the basic authentication step).
- an associated collection
[Contact us for more information.](https://www.softwareheritage.org/contact/)
## Demonstration
For the rest of the document, we will:
- reference `<client-name>` as the client and `<pass>` as its
associated authentication password.
- use curl as example on how to request the api.
- present the main deposit use cases.
The use cases are:
- one single deposit step: The user posts in one query (one deposit) a
software source code archive and associated metadata (deposit is
finalized with status `ready`).
This will demonstrate the multipart query.
- another 3-steps deposit (which can be extended as more than 2
steps):
1. Create an incomplete deposit (status `partial`)
2. Update a deposit (and finalize it, so the status becomes `ready`)
3. Check the deposit's state
This will demonstrate the stateful nature of the sword protocol.
Those use cases share a common part, they must start by requesting the
`service document iri` (internationalized resource identifier) for
information about the collection's location.
### Common part - Start with the service document
First, to determine the *collection iri* onto which deposit data, the
client needs to ask the server where is its *collection* located. That
is the role of the *service document iri*.
For example:
``` Shell
curl -i --user <client-name>:<pass> https://deposit.softwareheritage.org/1/servicedocument/
```
If everything went well, you should have received a response similar
to this:
``` Shell
HTTP/1.0 200 OK
Server: WSGIServer/0.2 CPython/3.5.3
Content-Type: application/xml
<?xml version="1.0" ?>
<service xmlns:dcterms="http://purl.org/dc/terms/"
xmlns:sword="http://purl.org/net/sword/terms/"
xmlns:atom="http://www.w3.org/2005/Atom"
xmlns="http://www.w3.org/2007/app">
<sword:version>2.0</sword:version>
<sword:maxUploadSize>209715200</sword:maxUploadSize>
<workspace>
<atom:title>The Software Heritage (SWH) Archive</atom:title>
<collection href="https://deposit.softwareheritage.org/1/<collection-name>/">
<atom:title><client-name> Software Collection</atom:title>
<accept>application/zip</accept>
<sword:collectionPolicy>Collection Policy</sword:collectionPolicy>
<dcterms:abstract>Software Heritage Archive</dcterms:abstract>
<sword:treatment>Collect, Preserve, Share</sword:treatment>
<sword:mediation>false</sword:mediation>
<sword:acceptPackaging>http://purl.org/net/sword/package/SimpleZip</sword:acceptPackaging>
<sword:service>https://deposit.softwareheritage.org/1/<collection-name>/</sword:service>
</collection>
</workspace>
</service>
```
Explaining the response:
- `HTTP/1.0 200 OK`: the query is successful and returns a body response
- `Content-Type: application/xml`: The body response is in xml format
- `body response`: it is a service document describing that the client
`<client-name>` has a collection named `<collection-name>`. That
collection is available at the *collection iri*
`/1/<collection-name>/` (through POST query).
At this level, if something went wrong, this should be authentication related.
So the response would have been a 401 Unauthorized access.
Something like:
``` Shell
curl -i https://deposit.softwareheritage.org/1/<collection-name>/
HTTP/1.0 401 Unauthorized
Server: WSGIServer/0.2 CPython/3.5.3
Content-Type: application/xml
WWW-Authenticate: Basic realm=""
X-Frame-Options: SAMEORIGIN
<?xml version="1.0" encoding="utf-8"?>
<sword:error xmlns="http://www.w3.org/2005/Atom"
xmlns:sword="http://purl.org/net/sword/">
<summary>Access to this api needs authentication</summary>
<sword:treatment>processing failed</sword:treatment>
</sword:error>
```
### Single deposit
A single deposit translates to a multipart deposit request.
This means, in swh's deposit's terms, sending exactly one POST query
with:
- 1 archive (`content-type application/zip`)
- 1 atom xml content (`content-type: application/atom+xml;type=entry`)
The supported archive, for now are limited to zip files. Those
archives are expected to contain some form of software source
code. The atom entry content is some xml defining metadata about that
software.
Example of minimal atom entry file:
``` XML
<?xml version="1.0"?>
<entry xmlns="http://www.w3.org/2005/Atom"
xmlns:dcterms="http://purl.org/dc/terms/">
<title>Title</title>
<id>urn:uuid:1225c695-cfb8-4ebb-aaaa-80da344efa6a</id>
<updated>2005-10-07T17:17:08Z</updated>
<author><name>Contributor</name></author>
<summary type="text">The abstract</summary>
<!-- some embedded metadata -->
<dcterms:abstract>The abstract</dcterms:abstract>
<dcterms:accessRights>Access Rights</dcterms:accessRights>
<dcterms:alternative>Alternative Title</dcterms:alternative>
<dcterms:available>Date Available</dcterms:available>
<dcterms:bibliographicCitation>Bibliographic Citation</dcterms:bibliographicCitation>
<dcterms:contributor>Contributor</dcterms:contributor>
<dcterms:description>Description</dcterms:description>
<dcterms:hasPart>Has Part</dcterms:hasPart>
<dcterms:hasVersion>Has Version</dcterms:hasVersion>
<dcterms:identifier>Identifier</dcterms:identifier>
<dcterms:isPartOf>Is Part Of</dcterms:isPartOf>
<dcterms:publisher>Publisher</dcterms:publisher>
<dcterms:references>References</dcterms:references>
<dcterms:rightsHolder>Rights Holder</dcterms:rightsHolder>
<dcterms:source>Source</dcterms:source>
<dcterms:title>Title</dcterms:title>
<dcterms:type>Type</dcterms:type>
</entry>
```
Once the files are ready for deposit, we want to do the actual deposit
in one shot.
For this, we need to provide:
- the contents and their associated correct content-types
- either the header `In-Progress` to false (meaning, it's finished
after this query) or nothing (the server will assume it's not in
progress if not present).
- Optionally, the `Slug` header, which is a reference to a unique
identifier the client knows about and wants to provide us.
You can do this with the following command:
``` Shell
curl -i --user <client-name>:<pass> \
-F "file=@deposit.zip;type=application/zip;filename=payload" \
-F "atom=@atom-entry.xml;type=application/atom+xml;charset=UTF-8" \
-H 'In-Progress: false' \
-H 'Slug: some-external-id' \
-XPOST https://deposit.softwareheritage.org/1/<collection-name>/
```
You just posted a deposit to the collection <collection-name>
https://deposit.softwareheritage.org/1/<collection-name>/.
If everything went well, you should have received a response similar
to this:
``` Shell
HTTP/1.0 201 Created
Server: WSGIServer/0.2 CPython/3.5.3
Location: /1/<collection-name>/10/metadata/
Content-Type: application/xml
<entry xmlns="http://www.w3.org/2005/Atom"
xmlns:sword="http://purl.org/net/sword/"
xmlns:dcterms="http://purl.org/dc/terms/">
<deposit_id>9</deposit_id>
<deposit_date>Sept. 26, 2017, 10:11 a.m.</deposit_date>
<deposit_archive>payload</deposit_archive>
<!-- Edit-IRI -->
<link rel="edit" href="/1/<collection-name>/10/metadata/" />
<!-- EM-IRI -->
<link rel="edit-media" href="/1/<collection-name>/10/media/"/>
<!-- SE-IRI -->
<link rel="http://purl.org/net/sword/terms/add" href="/1/<collection-name>/10/metadata/" />
<sword:packaging>http://purl.org/net/sword/package/SimpleZip</sword:packaging>
</entry>
```
Explaining this response:
- `HTTP/1.0 201 Created`: the deposit is successful
- `Location: /1/<collection-name>/10/metadata/`: the EDIT-SE-IRI through which we can
update a deposit
- body response: it is a deposit receipt detailing all endpoints
available to manipulate the deposit (update, replace, delete,
etc...) It also explains the deposit identifier to be 9 (which is
useful for the remaining example).
Note: As the deposit is in `ready` status (meaning ready to be
injected), you cannot actually update anything after this query.
Well, the client can try, but it will be answered with a 403 forbidden
answer.
### Multi-steps deposit
1. Create a deposit
We will use the collection IRI again as the starting point.
We need to explicitely give to the server information about:
- the deposit's completeness (through header `In-Progress` to true, as
we want to do in multiple steps now).
- archive's md5 hash (through header `Content-MD5`)
- upload's type (through the headers `Content-Disposition` and
`Content-Type`)
The following command:
``` Shell
curl -i --user <client-name>:<pass> \
--data-binary @swh/deposit.zip \
-H 'In-Progress: true' \
-H 'Content-MD5: 0faa1ecbf9224b9bf48a7c691b8c2b6f' \
-H 'Content-Disposition: attachment; filename=[deposit.zip]' \
-H 'Slug: some-external-id' \
-H 'Packaging: http://purl.org/net/sword/package/SimpleZIP' \
-H 'Content-type: application/zip' \
-XPOST https://deposit.softwareheritage.org/1/<collection-name>/
```
The expected answer is the same as the previous sample.
2. Update deposit's metadata
To update a deposit, we can either add some more archives, some more
metadata or replace existing ones.
As we don't have defined metadata yet (except for the `slug` header),
we can add some to the `EDIT-SE-IRI` endpoint (/1/<collection-name>/10/metadata/).
That information is extracted from the deposit receipt sample.
Using here the same atom-entry.xml file presented in previous chapter.
For example, here is the command to update deposit metadata:
``` Shell
curl -i --user <client-name>:<pass> --data-binary @atom-entry.xml \
-H 'In-Progress: true' \
-H 'Slug: some-external-id' \
-H 'Content-Type: application/atom+xml;type=entry' \
-XPOST https://deposit.softwareheritage.org/1/<collection-name>/10/metadata/
HTTP/1.0 201 Created
Server: WSGIServer/0.2 CPython/3.5.3
Location: /1/<collection-name>/10/metadata/
Content-Type: application/xml
<entry xmlns="http://www.w3.org/2005/Atom"
xmlns:sword="http://purl.org/net/sword/"
xmlns:dcterms="http://purl.org/dc/terms/">
<deposit_id>10</deposit_id>
<deposit_date>Sept. 26, 2017, 10:32 a.m.</deposit_date>
<deposit_archive>None</deposit_archive>
<!-- Edit-IRI -->
<link rel="edit" href="/1/<collection-name>/10/metadata/" />
<!-- EM-IRI -->
<link rel="edit-media" href="/1/<collection-name>/10/media/"/>
<!-- SE-IRI -->
<link rel="http://purl.org/net/sword/terms/add" href="/1/<collection-name>/10/metadata/" />
<sword:packaging>http://purl.org/net/sword/package/SimpleZip</sword:packaging>
</entry>
```
3. Check the deposit's state
You need to check the STATE-IRI endpoint (/1/<collection-name>/10/status/).
``` Shell
curl -i --user <client-name>:<pass> https://deposit.softwareheritage.org/1/<collection-name>/10/status/
HTTP/1.0 200 OK
Date: Wed, 27 Sep 2017 08:25:53 GMT
Content-Type: application/xml
```
Response:
``` XML
<entry xmlns="http://www.w3.org/2005/Atom"
xmlns:sword="http://purl.org/net/sword/"
xmlns:dcterms="http://purl.org/dc/terms/">
<deposit_id>9</deposit_id>
<status>ready</status>
<detail>deposit is fully received and ready for injection</detail>
</entry>
```
# Injection specification (draft)
This part discusses the deposit injection part on the server side.
## Tarball Injection
The `swh-loader-tar` module is already able to inject tarballs in swh
with very limited metadata (mainly the origin).
The injection of the deposit will use the deposit's associated data:
- the metadata
- the archive(s)
We will use the `synthetic` revision notion.
To that revision will be associated the metadata. Those will be
included in the hash computation, thus resulting in a unique
identifier.
### Injection mapping
Some of those metadata will also be included in the `origin_metadata`
table.
origin | https://hal.inria.fr/hal-id
-------------------------------------|----------------------------------------
origin_visit | 1 :reception_date
occurrence &amp; occurrence_history | branch: client's version n° (e.g hal)
revision | synthetic_revision (tarball)
directory | upper level of the uncompressed archive
### Questions raised concerning injection
- A deposit has one origin, yet an origin can have multiple deposits?
No, an origin can have multiple requests for the same deposit.
Which should end up in one single deposit (when the client pushes its final
request saying deposit 'done' through the header In-Progress).
Only update of existing 'partial' deposit is permitted.
Other than that, the deposit 'update' operation.
To create a new version of a software (already deposited), the client
must prior to this create a new deposit.
Illustration First deposit injection:
HAL's deposit 01535619 = SWH's deposit **01535619-1**
+ 1 origin with url:https://hal.inria.fr/medihal-01535619
+ 1 synthetic revision
+ 1 directory
HAL's update on deposit 01535619 = SWH's deposit **01535619-2**
(*with HAL updates can only be on the metadata and a new version is required
if the content changes)
+ 1 origin with url:https://hal.inria.fr/medihal-01535619
+ new synthetic revision (with new metadata)
+ same directory
HAL's deposit 01535619-v2 = SWH's deposit **01535619-v2-1**
+ same origin
+ new revision
+ new directory
## Technical details
### Requirements
- one dedicated database to store the deposit's state - swh-deposit
- one dedicated temporary objstorage to store archives before
injection
- one client to test the communication with SWORD protocol
### Deposit reception schema
- SWORD imposes the use of basic authentication, so we need a way to
authenticate client. Also, a client can access collections:
**deposit_client** table:
- id (bigint): Client's identifier
- username (str): Client's username
- password (pass): Client's crypted password
- collections ([id]): List of collections the client can access
- Collections group deposits together:
**deposit_collection** table:
- id (bigint): Collection's identifier
- name (str): Collection's human readable name
- A deposit is the main object the repository is all about:
**deposit** table:
- id (bigint): deposit's identifier
- reception_date (date): First deposit's reception date
- complete_data (date): Date when the deposit is deemed complete and ready for injection
- collection (id): The collection the deposit belongs to
- external id (text): client's internal identifier (e.g hal's id, etc...).
- client_id (id) : Client which did the deposit
- swh_id (str) : swh identifier result once the injection is complete
- status (enum): The deposit's current status
- As mentioned, a deposit can have a status, whose possible values
are:
``` text
'partial', -- the deposit is new or partially received since it
-- can be done in multiple requests
'expired', -- deposit has been there too long and is now deemed
-- ready to be garbage collected
'ready', -- deposit is fully received and ready for injection
'injecting, -- injection is ongoing on swh's side
'success', -- injection is successful
'failure' -- injection is a failure
```
A deposit is stateful and can be made in multiple requests:
**deposit_request** table:
- id (bigint): identifier
- type (id): deposit request's type (possible values: 'archive', 'metadata')
- deposit_id (id): deposit whose request belongs to
- metadata: metadata associated to the request
- date (date): date of the requests
Information sent along a request are stored in a `deposit_request`
row.
They can be either of type `metadata` (atom entry, multipart's atom
entry part) or of type `archive` (binary upload, multipart's binary
upload part).
When the deposit is complete (status `ready`), those `metadata` and
`archive` deposit requests will be read and aggregated. They will then
be sent as parameters to the injection routine.
During injection, some of those metadata are kept in the
`origin_metadata` table and some other are stored in the `revision`
table (see [metadata injection](#metadata-injection)).
The only update actions occurring on the deposit table are in regards
of:
- status changing:
- `partial` -> {`expired`/`ready`},
- `ready` -> `injecting`,
- `injecting` -> {`success`/`failure`}
- `complete_date` when the deposit is finalized (when the status is
changed to ready)
- `swh-id` is populated once we have the injection result
#### SWH Identifier returned
swh-<client-name>-<synthetic-revision-id>
e.g: swh-hal-47dc6b4636c7f6cba0df83e3d5490bf4334d987e
### Scheduling injection
All `archive` and `metadata` deposit requests should be aggregated
before injection.
The injection should be scheduled via the scheduler's api.
Only `ready` deposit are concerned by the injection.
When the injection is done and successful, the deposit entry is
updated:
- `status` is updated to `success`
- `swh-id` is populated with the resulting hash
(cf. [swh identifier](#swh-identifier-returned))
- `complete_date` is updated to the injection's finished time
When the injection is failed, the deposit entry is updated:
- `status` is updated to `failure`
- `swh-id` and `complete_data` remains as is
*Note:* As a further improvement, we may prefer having a retry policy
with graceful delays for further scheduling.
### Metadata injection
- the metadata received with the deposit should be kept in the
`origin_metadata` table before translation as part of the injection
process and an indexation process should be scheduled.
origin_metadata table:
```
origin bigint PK FK
discovery_date date PK FK
translation_date date PK FK
provenance_type text // (enum: 'publisher', 'lister' needs to be completed)
raw_metadata jsonb // before translation
indexer_configuration_id bigint FK // tool used for translation
translated_metadata jsonb // with codemeta schema and terms
```
# Bootstrap swh-deposit on production
As usual, the debian packaged is created and uploaded to the swh
debian repository. Once the package is installed, we need to do a few
things in regards to the database.
## Prepare the database setup (existence, connection, etc...).
This is defined through the packaged `swh.deposit.settings.production`
module and the expected **/etc/softwareheritage/deposit/private.yml**.
As usual, the expected configuration files are deployed through our
puppet manifest (cf. puppet-environment/swh-site,
puppet-environment/swh-role, puppet-environment/swh-profile)
## Migrate/bootstrap the db schema
``` Shell
sudo django-admin migrate --settings=swh.deposit.settings.production
```
## Load minimum defaults data
``` Shell
sudo django-admin loaddata --settings=swh.deposit.settings.production deposit_data
```
This adds the minimal:
- deposit request type 'archive' and 'metadata'
- 'hal' collection
Note: swh.deposit.fixtures.deposit_data is packaged
## Add client and collection
``` Shell
python3 -m swh.deposit.create_user --platform production \
--collection <collection-name> \
--username <client-name> \
--password <to-define>
```
This adds a user `<client-name>` which can access the collection
`<collection-name>`. The password will be used for the authentication
access to the deposit api.
Note: This creation procedure needs to be improved.
This diff is collapsed.
Software Heritage - Deposit
===========================
Simple Web-Service Offering Repository Deposit (S.W.O.R.D) is an interoperability
standard for digital file deposit.
This repository is both the `SWORD v2`_ Server and a deposit command-line client
implementations.
This implementation allows interaction between a client (a repository) and a server (SWH
repository) to deposit software source code archives and associated metadata.
Description
-----------
Most of the software source code artifacts present in the SWH Archive are gathered by
the mean of `loader`_ workers run by the SWH project from source code
origins identified by `lister`_ workers. This is a pull mechanism: it's
the responsibility of the SWH project to gather and collect source code artifacts that
way.
Alternatively, SWH allows its partners to push source code artifacts and metadata
directly into the Archive with a push-based mechanism. By using this possibility
different actors, holding software artifacts or metadata, can preserve their assets
without having to pass through an intermediate collaborative development platform, which
is already harvested by SWH (e.g GitHub, Gitlab, etc.).
This mechanism is the ``deposit``.
The main idea is the deposit is an authenticated access to an API allowing the user to
provide source code artifacts -- with metadata -- to be ingested in the SWH Archive. The
result of that is a `SWHID`_ that can be used to uniquely
and persistently identify that very piece of source code.
This unique identifier can then be used to `reference the source code
<https://hal.archives-ouvertes.fr/hal-02446202>`_ (e.g. in a `scientific paper
<https://www.softwareheritage.org/2020/05/26/citing-software-with-style/>`_) and
retrieve it using the `vault`_ feature of the SWH Archive platform.
The differences between a piece of code uploaded using the deposit rather than simply
asking SWH to archive a repository using the `save code now`_ feature
are:
- a deposited artifact is provided from one of the SWH partners which is regarded as a
trusted authority,
- a deposited artifact requires metadata properties describing the source code artifact,
- a deposited artifact has a codemeta_ metadata entry attached to it,
- a deposited artifact has the same visibility on the SWH Archive than a collected
repository,
- a deposited artifact can be searched with its provided url property on the SWH
Archive,
- the deposit API uses the `SWORD v2`_ API, thus requires some tooling to send deposits
to SWH. These tools are provided with this repository.
See the `User Manual`_ page for more details on how to use the deposit client
command line tools to push a deposit in the SWH Archive.
See the `API Documentation`_ reference pages of the SWORDv2 API implementation
in ``swh.deposit`` if you want to do upload deposits using HTTP requests.
Read the `Deposit metadata`_ chapter to get more details on what metadata
are supported when doing a deposit.
See `Running swh-deposit locally`_ if you want to hack the code of the ``swh.deposit`` module.
See `Production deployment`_ if you want to deploy your own copy of the
`swh.deposit` stack.
.. _codemeta: https://codemeta.github.io/
.. _SWORD v2: http://swordapp.org/sword-v2/
.. _loader: https://docs.softwareheritage.org/devel/glossary.html#term-loader
.. _lister: https://docs.softwareheritage.org/devel/glossary.html#term-lister
.. _SWHID: https://docs.softwareheritage.org/devel/swh-model/persistent-identifiers.html#persistent-identifiers
.. _vault: https://docs.softwareheritage.org/devel/swh-vault/index.html#swh-vault
.. _save code now: https://archive.softwareheritage.org/save/
.. _User Manual: https://docs.softwareheritage.org/devel/swh-deposit/api/user-manual.html#deposit-user-manual
.. _API Documentation: https://docs.softwareheritage.org/devel/swh-deposit/api/api-documentation.html#deposit-api-specifications
.. _Deposit metadata: https://docs.softwareheritage.org/devel/swh-deposit/api/metadata.html#deposit-metadata
.. _Running swh-deposit locally: https://docs.softwareheritage.org/devel/swh-deposit/internals/dev-environment.html#swh-deposit-dev-env
.. _Production deployment: https://docs.softwareheritage.org/devel/swh-deposit/internals/prod-environment.html#swh-deposit-prod-env
\ No newline at end of file
DEPOSIT_ID=1
ARCHIVE=../../swh-deposit.zip
ARCHIVE2=../../swh-model.zip
STATUS=false
PARTIAL_STATUS=true
UPDATE_STATUS='success'
STATUS=--no-partial
PARTIAL_STATUS=--partial
UPDATE_STATUS='done'
ATOM_ENTRY=../../atom-entry.xml
EXTERNAL_ID='external-id'
create-archives:
7z a $(ARCHIVE) $(FOLDER)
......@@ -12,11 +14,14 @@ create-archives:
new:
./create_deposit.sh $(ARCHIVE) $(STATUS)
new-complete:
./create_deposit_with_metadata.sh $(ARCHIVE) $(ATOM_ENTRY) $(STATUS) $(EXTERNAL_ID)
new-partial:
make new STATUS=$(PARTIAL_STATUS) ARCHIVE=$(ARCHIVE)
update:
./update-deposit-with-another-archive.sh $(DEPOSIT_ID) $(ARCHIVE_2) $(STATUS)
./update-deposit-with-another-archive.sh $(DEPOSIT_ID) $(ARCHIVE2) $(STATUS)
update-partial:
make update DEPOSIT_ID=$(DEPOSIT_ID) ARCHIVE2=$(ARCHIVE2) STATUS=$(PARTIAL_STATUS)
......
#!/usr/bin/env bash
. ./default-setup
DEPOSIT_ID=${1-1}
curl -i -u "${CREDS}" ${SERVER}/1/${COLLECTION}/${DEPOSIT_ID}/content/
......@@ -4,18 +4,13 @@
ARCHIVE=${1-'../../deposit.zip'}
NAME=$(basename ${ARCHIVE})
MD5=$(md5sum ${ARCHIVE} | cut -f 1 -d' ')
STATUS=${2-'--no-partial'}
PROGRESS=${2-'false'}
curl -i -u "$CREDS" \
-X POST \
--data-binary @${ARCHIVE} \
-H "In-Progress: ${PROGRESS}" \
-H "Content-MD5: ${MD5}" \
-H "Content-Disposition: attachment; filename=${NAME}" \
-H 'Slug: external-id' \
-H 'Packaging: http://purl.org/net/sword/package/SimpleZip' \
-H 'Content-type: application/zip' \
${SERVER}/1/${COLLECTION}/
./swh-deposit \
--username ${USER} \
--password ${PASSWORD} \
--collection ${COLLECTION} \
--archive-deposit \
--archive ${ARCHIVE} \
${STATUS} \
--url ${SERVER}/1
......@@ -2,17 +2,20 @@
. ./default-setup
ARCHIVE=${1-'../../deposit.zip'}
ARCHIVE=${1-'../../swh-deposit.zip'}
ATOM_ENTRY=${2-'../../atom-entry.xml'}
NAME=$(basename $ARCHIVE)
MD5=$(md5sum $ARCHIVE | cut -f 1 -d' ')
STATUS=${3-'--no-partial'}
EXTERNAL_ID=${4-'external-id'}
PROGRESS=${3-'false'}
curl -i --user "${CREDS}" \
-F "file=@${ARCHIVE};type=application/zip;filename=payload" \
-F "atom=@${ATOM_ENTRY};type=application/atom+xml;charset=UTF-8" \
-H "In-Progress: ${PROGRESS}" \
-H 'Slug: some-external-id' \
-XPOST ${SERVER}/1/${COLLECTION}/
./swh-deposit \
--username ${USER} \
--password ${PASSWORD} \
--collection ${COLLECTION} \
--archive-deposit \
--archive ${ARCHIVE} \
--metadata-deposit \
--metadata ${ATOM_ENTRY} \
--slug ${EXTERNAL_ID} \
${STATUS} \
--url ${SERVER}/1
SERVER=http://127.0.0.1:5006
CREDS='hal:hal'
USER='hal'
PASSWORD='hal'
COLLECTION=hal
CREDS="$USER:$PASSWORD"
......@@ -4,4 +4,10 @@
DEPOSIT_ID=${1-1}
curl -i -u "${CREDS}" ${SERVER}/1/${COLLECTION}/${DEPOSIT_ID}/status/
./swh-deposit \
--username ${USER} \
--password ${PASSWORD} \
--collection ${COLLECTION} \
--status \
--deposit-id ${DEPOSIT_ID} \
--url ${SERVER}/1