diff --git a/Makefile.local b/Makefile.local index 1f9e09bedf5661c414a06c55d623d575b029fafd..a976450f935558fd1eb7f1d88f7672a77aa63297 100644 --- a/Makefile.local +++ b/Makefile.local @@ -1,5 +1,24 @@ +FLAKEFLAGS='--exclude=swh/manage.py,swh/deposit/settings.py,swh/deposit/migrations/' + +MANAGE=python3 -m swh.manage + +db-drop: + dropdb swh-deposit-dev + +db-create: + createdb swh-deposit-dev + +db-prepare: + $(MANAGE) makemigrations + +db-migrate: + $(MANAGE) migrate + +db-load-data: + $(MANAGE) loaddata deposit_data + run-dev: - python3 -m swh.deposit.server ./resources/deposit/server.yml + $(MANAGE) runserver run: - python3 -m swh.deposit.server ~/.config/swh/deposit/server.yml + gunicorn3 swh.deposit.wsgi diff --git a/PKG-INFO b/PKG-INFO index e9b0bbe86584577adaeb326174f4a8b3e5caf3ba..2d559c0343c5e24816889aa1a5a7449a451a8537 100644 --- a/PKG-INFO +++ b/PKG-INFO @@ -1,6 +1,6 @@ Metadata-Version: 1.0 Name: swh.deposit -Version: 0.0.2 +Version: 0.0.3 Summary: Software Heritage Deposit Server Home-page: https://forge.softwareheritage.org/source/swh-deposit/ Author: Software Heritage developers diff --git a/README b/README index a584224681ef7dfc1b44da216eb9d6ee0a05d89b..5c86fcb5a2863e7f044620d1c3ba7f3ba86fc49b 100644 --- a/README +++ b/README @@ -1,5 +1,604 @@ -swh-deposit -=========== += swh-deposit (draft) = -SWH's SWORD Deposit Server +This is SWH's SWORD Server implementation. +SWORD (Simple Web-Service Offering Repository Deposit) is an +interoperability standard for digital file deposit. + +This protocol will be used to interact between a client (a repository) +and a server (swh repository) to permit the deposit of software +tarballs. + +In this document, we will refer to a client (e.g. HAL server) and a +server (SWH's). + + +== Use cases == + +=== First deposit === + +From client's deposit repository server to SWH's repository server +(aka deposit). + +1. The client requests for the server's abilities. +(GET query to the *service document uri*) + +2. The server answers the client with the service document + +3. The client sends the deposit (an archive -> .zip, .tar.gz) +through the deposit *creation uri*. +(one or more POST requests since the archive and metadata can be sent +in multiple requests) + + +4. The server notifies the client it acknowledged the +client's request. ('http 201 Created' with a deposit receipt id in +the Location header of the response) + + +=== Updating an existing archive === + +5. Client updates existing archive through the deposit *update uri* +(one or more PUT requests, in effect chunking the artifact to deposit) + +=== Deleting an existing archive === + +6. Document deletion will not be implemented, +cf. limitation paragraph for detail + +=== Client asks for operation status and repository id === + +NOTE: add specifictions about operation status and injection + +== API overview == + +API access is over HTTPS. + +service document accessible at: +https://archive.softwareheritage.org/api/1/servicedocument/ + +API endpoints: + + - without a specific collection, are rooted at + https://archive.softwareheritage.org/api/1/deposit/. + + - with a specific and unique collection dubbed 'software', are rooted at + https://archive.softwareheritage.org/api/1/software/. + + +IMPORTANT: Determine which one of those solutions according to sword possibilities +(cf. 'unclear points' chapter below) + +== Limitations == + +Applying the SWORD protocol procedure will result with voluntary implementation +shortcomings during the first iteration: + +- upload limitation of 200Mib +- only tarballs (.zip, .tar.gz) will be accepted +- no removal (implementation-wise, this will possibly be a means + to hide the origin). +- no mediation (we do not know the other system's users) +- basic http authentication enforced at the application layer + on a per client basis (authentication: + http://swordapp.github.io/SWORDv2-Profile/SWORDProfile.html#authenticationmediateddeposit) + +== Unclear points == + +- SWORD defines a 'collection' concpet. should we apply the 'collection' concept + even thought SWH is software archive having one 'software' collection? + - option A: + The collection refers to a group of documents to which the document sent + (aka deposit) is part of. In this process with HAL, HAL is the collection, + maybe tomorrow we will do the same with MIT and MIT could be + the collection (the logic of the answer above is a result of this + link: https://hal.inria.fr/USPC for the USPC collection) + + **result**: 1 client being equivalent as 1 collection in this case. + The is client pushes us software in 'their' one collection. + The collection name could show up in the uri endpoint. + + - option B: + Define none? (is it possible? i don't think it is due to the service + document part listing the collection to act upon...) + + **result**: the deposited software has no other entry point via + collection name + + +== <a name="scenarios"> Scenarios == +=== 1. Client request for Service Document === + +This is the endpoint permitting the client to ask the server's abilities. + + +==== API endpoint ==== + +GET api/1/servicedocument/ + +Answer: +> 200, Content-Type: application/atomserv+xml: OK, with the body + described below + +==== Sample request:==== + +```lang=shell +GET https://archive.softwareheritage.org/api/1/servicedocument HTTP/1.1 +Host: archive.softwareheritage.org +``` + +=== 2. Sever respond for Service Document === + +The server returns its abilities with the service document in xml format: +- protocol sword version v2 +- accepted mime types: application/zip, application/gzip +- upload max size accepted, beyond that, it's expected the client + chunk the tarball into multiple ones +- the collections the client can act upon (swh supports only one software collection) +- mediation not supported + +==== Sample answer:==== +``` lang=xml +<?xml version="1.0" ?> +<service xmlns:dcterms="http://purl.org/dc/terms/" + xmlns:sword="http://purl.org/net/sword/terms/" + xmlns:atom="http://www.w3.org/2005/Atom" + xmlns="http://www.w3.org/2007/app"> + + <sword:version>2.0</sword:version> + <sword:maxUploadSize>${max_upload_size}</sword:maxUploadSize> + + <workspace> + <atom:title>The SWH archive</atom:title> + + <collection href="https://archive.softwareherigage.org/api/1/deposit/"> + <atom:title>SWH Collection</atom:title> + <accept>application/gzip</accept> + <accept alternate="multipart-related">application/gzip</accept> + <dcterms:abstract>Software Heritage Archive Deposit</dcterms:abstract> + <sword:mediation>false</sword:mediation> + <sword:acceptPackaging>http://purl.org/net/sword/package/SimpleZip</sword:acceptPackaging> + </collection> + </workspace> +</service> +``` + + +== Deposit Creation: client point of view == + +Process of deposit creation: + +-> [3] client request + + - [3.1] server validation + - [3.2] server temporary upload + - [3.3] server injects deposit into archive* + +<- [4] server returns deposit receipt id + + +NOTE: [3.3] Asynchronously, the server will inject the archive uploaded and the +associated metadata. The operation status mentioned +earlier is a reference to that injection operation. + +The image bellow represent only the communication and creation of +a deposit: +{F2403754} + +=== [3] client request === + +The client can send a deposit through one request deposit or multiple requests deposit. + +The deposit can contain: +- an archive holding the software source code, +- an envelop with metadata describing information regarding a deposit, +- or both (Multipart deposit). + +the client can deposit a binary file, supplying the following headers: +- Content-Type (text): accepted mimetype +- Content-Length (int): tarball size +- Content-MD5 (text): md5 checksum hex encoded of the tarball +- Content-Disposition (text): attachment; filename=[filename] ; the filename + parameter must be text (ascii) +- Packaging (IRI): http://purl.org/net/sword/package/SimpleZip +- In-Progress (bool): true to specify it's not the last request, false + to specify it's a final request and the server can go on with + processing the request's information. + +WARNING: if In-Progress is not present the server MUST assume that it is false + +==== API endpoint ==== + +POST /api/1/deposit/ + +==== One request deposit ==== + +The one request deposit is a single request containing both the metadata (body) +and the archive (attachment). + +A Multipart deposit is a request of an archive along with metadata about +that archive (can be applied in a one request deposit or multiple requests). + +Client provides: +- Content-Disposition (text): header of type 'attachment' on the Entry + Part with a name parameter set to 'atom' +- Content-Disposition (text): header of type 'attachment' on the Media + Part with a name parameter set to payload and a filename parameter + (the filename will be expressed in ASCII). +- Content-MD5 (text): md5 checksum hex encoded of the tarball +- Packaging (text): http://purl.org/net/sword/package/SimpleZip + (packaging format used on the Media Part) +- In-Progress (bool): true|false; true means partial upload and we can expect + other requests in the future, false means the deposit is done. +- add metadata formats or foreign markup to the atom:entry element + + +==== sample request for multipart deposit: ==== + +``` lang=xml +POST deposit HTTP/1.1 +Host: archive.softwareheritage.org +Content-Length: [content length] +Content-Type: multipart/related; + boundary="===============1605871705=="; + type="application/atom+xml" +In-Progress: false +MIME-Version: 1.0 + +Media Post +--===============1605871705== +Content-Type: application/atom+xml; charset="utf-8" +Content-Disposition: attachment; name="atom" +MIME-Version: 1.0 + +<?xml version="1.0"?> +<entry xmlns="http://www.w3.org/2005/Atom" + xmlns:dcterms="http://purl.org/dc/terms/"> + <title>Title</title> + <id>hal-or-other-archive-id</id> + <updated>2005-10-07T17:17:08Z</updated> + <author><name>Contributor</name></author> + + <!-- some embedded metadata TO BE DEFINED --> + +</entry> +--===============1605871705== +Content-Type: application/zip +Content-Disposition: attachment; name=payload; filename=[filename] +Packaging: http://purl.org/net/sword/package/SimpleZip +Content-MD5: [md5-digest] +MIME-Version: 1.0 + +[...binary package data...] +--===============1605871705==-- +``` + +== Deposit Creation - server point of view == + +The server receives the request and: + +=== [3.1] Validation of the header and body request === + + +=== [3.2] Server uploads the content in a temporary location == +(deposit table in a separated DB). +- saves the archives in a temporary location +- executes a md5 checksum on that archive and check it against the + same header information +- adds a deposit entry and retrieves the associated id + + +=== [4] Servers answers the client === +an 'http 201 Created' with a deposit receipt id in the Location header of +the response. + +The server possible answers are: +- OK: '201 created' + one header 'Location' holding the deposit receipt + id +- KO: with the error status code and associated message + (cf. [possible errors paragraph](#possible errors)). + + +=== [5] Deposit Update === + +The client previously uploaded an archive and wants to add either new +metadata information or a new version for that previous deposit +(possibly in multiple steps as well). The important thing to note +here is that for swh, this will result in a new version of the +previous deposit in any case. + +Providing the identifier of the previous version deposit received from +the status URI, the client executes a PUT request on the same URI as +the deposit one. + +After validation of the body request, the server: +- uploads such content in a temporary location (to be defined). + +- answers the client an 'http 204 (No content)'. In the Location + header of the response lies a deposit receipt id permitting the + client to check back the operation status later on. + +- Asynchronously, the server will inject the archive uploaded and the + associated metadata. The operation status mentioned earlier is a + reference to that injection operation. The fact that the version is + a new one is dealt with at the injection level. + + URL: PUT /1/deposit/<previous-swh-id> + +=== [6] Deposit Removal === + +[#limitation](As explained in the limitation paragraph), removal won't +be implemented. Nothing is removed from the SWH archive. + +The server answers a '405 Method not allowed' error. + +=== Operation Status === + +Providing a deposit receipt id, the client asks the operation status +of a prior upload. + + URL: GET /1/collection/{deposit_receipt} + +or + + GET /1/deposit/{deposit_receipt} + +NOTE: depends of the decision taken about collections + +## <a name="errors"> Possible errors + +### sword:ErrorContent + +IRI: http://purl.org/net/sword/error/ErrorContent + +The supplied format is not the same as that identified in the +Packaging header and/or that supported by the server Associated HTTP + +Status: 415 (Unsupported Media Type) or 406 (Not Acceptable) + +### sword:ErrorChecksumMismatch + +IRI: http://purl.org/net/sword/error/ErrorChecksumMismatch + +Checksum sent does not match the calculated checksum. The server MUST +also return a status code of 412 Precondition Failed + +### sword:ErrorBadRequest + +IRI: http://purl.org/net/sword/error/ErrorBadRequest + +Some parameters sent with the POST/PUT were not understood. The server +MUST also return a status code of 400 Bad Request. + +### sword:MediationNotAllowed + +IRI: http://purl.org/net/sword/error/MediationNotAllowed + +Used where a client has attempted a mediated deposit, but this is not +supported by the server. The server MUST also return a status code of +412 Precondition Failed. + +### sword:MethodNotAllowed + +IRI: http://purl.org/net/sword/error/MethodNotAllowed + +Used when the client has attempted one of the HTTP update verbs (POST, +PUT, DELETE) but the server has decided not to respond to such +requests on the specified resource at that time. The server MUST also +return a status code of 405 Method Not Allowed + +### sword:MaxUploadSizeExceeded + +IRI: http://purl.org/net/sword/error/MaxUploadSizeExceeded + +Used when the client has attempted to supply to the server a file +which exceeds the server's maximum upload size limit + +Associated HTTP Status: 413 (Request Entity Too Large) + +--------------- + +== Tarball Injection == + +Providing we use indeed synthetic revision to represent a version of a +tarball injected through the sword use case, this needs to be improved +so that the synthetic revision is created with a parent revision (the +previous known one for the same 'origin'). + + +=== Injection mapping === +| origin | https://hal.inria.fr/hal-id | +|-------------------------------------|---------------------------------------| +| origin_visit | 1 :reception_date | +| occurrence & occurrence_history | branch: client's version n° (e.g hal) | +| revision | synthetic_revision (tarball) | +| directory | upper level of the uncompressed archive| + + +=== Questions raised concerning injection: === +- A deposit has one origin, yet an origin can have multiple deposits ? + +No, an origin can have multiple requests for the same deposit, +which should end up in one single deposit (when the client pushes its final +request saying deposit 'done' through the header In-Progress). + +When an update of a deposit is requested, +the new version is identified with the external_id. + +Illustration First deposit injection: + +HAL's deposit 01535619 = SWH's deposit **01535619-1** + + + 1 origin with url:https://hal.inria.fr/medihal-01535619 + + + 1 synthetic revision + + + 1 directory + +HAL's update on deposit 01535619 = SWH's deposit **01535619-2** + +(*with HAL updates can only be on the metadata and a new version is required +if the content changes) + + 1 origin with url:https://hal.inria.fr/medihal-01535619 + + + new synthetic revision (with new metadata) + + + same directory + +HAL's deposit 01535619-v2 = SWH's deposit **01535619-v2-1** + + + same origin + + + new revision + + + new directory + + + +== Technical details == + +We will need: +- one dedicated db to store state - swh-deposit + +- one dedicated temporary storage to store archives before injection + +- one client to test the communication with SWORD protocol + +=== Deposit reception schema === + +- **deposit** table: + - id (bigint): deposit receipt id + + - external id (text): client's internal identifier (e.g hal's id, etc...). + + - origin id : null before injection + - swh_id : swh identifier result once the injection is complete + + - reception_date: first deposit date + + - complete_date: reception date of the last deposit which makes the deposit + complete + + - status (enum): +``` + 'partial', -- the deposit is new or partially received since it + -- can be done in multiple requests + 'expired', -- deposit has been there too long and is now deemed + -- ready to be garbage collected + 'ready', -- deposit is fully received and ready for injection + 'scheduled', -- injection is scheduled on swh's side + 'success', -- injection successful + 'failure' -- injection failure +``` +- **deposit_request** table: + - id (bigint): identifier + - deposit_id: deposit concerned by the request + - metadata: metadata associated to the request + +- **client** table: + - id (bigint): identifier + - name (text): client's name (e.g HAL) + - credentials + + +All metadata (declared metadata) are stored in deposit_request (with the +request they were sent with). +When the deposit is complete metadata fields are aggregated and sent +to injection. During injection the metadata is kept in the +origin_metadata table (see [metadata injection](#metadata-injection)). + +The only update actions occurring on the deposit table are in regards of: + - status changing + - partial -> {expired/ready}, + - ready -> scheduled, + - scheduled -> {success/failure} + - complete_date when the deposit is finalized + (when the status is changed to ready) + - swh-id being populated once we have the result of the injection + +==== SWH Identifier returned? ==== + + swh-<client-name>-<synthetic-revision-id> + + e.g: swh-hal-47dc6b4636c7f6cba0df83e3d5490bf4334d987e + + We could have a specific dedicated 'client' table to reference client + identifier. + +=== Scheduling injection === +All data and metadata separated with multiple requests should be aggregated +before injection. + +TODO: injection modeling + +=== Metadata injection === +- the metadata received with the deposit should be kept in the origin_metadata +table before translation as part of the injection process and a indexation +process should be scheduled. + +origin_metadata table: +``` +origin bigint PK FK +discovery_date date PK FK +translation_date date PK FK +provenance_type text // (enum: 'publisher', 'lister' needs to be completed) +raw_metadata jsonb // before translation +indexer_configuration_id bigint FK // tool used for translation +translated_metadata jsonb // with codemeta schema and terms +``` + +== Nomenclature == + +SWORD uses IRI. This means Internationalized Resource Identifier. In +this chapter, we will describe SWH's IRI. + +=== SD-IRI - The Service Document IRI === + +This is the IRI from which the root service document can be +located. + +=== Col-IRI - The Collection IRI === + +Only one collection of software is used in this repository. + +NOTE: +This is the IRI to which the initial deposit will take place, and +which are listed in the Service Document. +Discuss to check if we want to implement this or not. + +=== Cont-IRI - The Content IRI === + +This is the IRI from which the client will be able to retrieve +representations of the object as it resides in the SWORD server. + +=== EM-IRI - The Atom Edit Media IRI === + +To simplify, this is the same as the Cont-IRI. + +=== Edit-IRI - The Atom Entry Edit IRI === + +This is the IRI of the Atom Entry of the object, and therefore also of +the container within the SWORD server. + +=== SE-IRI - The SWORD Edit IRI === + +This is the IRI to which clients may POST additional content to an +Atom Entry Resource. This MAY be the same as the Edit-IRI, but is +defined separately as it supports HTTP POST explicitly while the +Edit-IRI is defined by [AtomPub] as limited to GET, PUT and DELETE +operations. + +=== State-IRI - The SWORD Statement IRI === + +This is the one of the IRIs which can be used to retrieve a +description of the object from the sword server, including the +structure of the object and its state. This will be used as the +operation status endpoint. + +== Sources == + +- [SWORD v2 specification](http://swordapp.github.io/SWORDv2-Profile/SWORDProfile.html) +- [arxiv documentation](https://arxiv.org/help/submit_sword) +- [Dataverse example](http://guides.dataverse.org/en/4.3/api/sword.html) +- [SWORD used on HAL](https://api.archives-ouvertes.fr/docs/sword) +- [xml examples for CCSD](https://github.com/CCSDForge/HAL/tree/master/Sword) diff --git a/debian/control b/debian/control index 70dc4641456a5ac90842a13743140e1da28cacf0..0a6377122522c44bb4f852dc46fb7e34f3063208 100644 --- a/debian/control +++ b/debian/control @@ -6,9 +6,8 @@ Build-Depends: debhelper (>= 9), dh-python, python3-setuptools, python3-all, - python3-aiohttp, python3-swh.core (>= 0.0.14~), - python3-jinja2, + python3-django, python3-click, python3-vcversioner Standards-Version: 3.9.6 diff --git a/doc/specs.md b/doc/specs.md deleted file mode 100644 index 8c7ef8524cc66adc9c6ca436d7def10658c227ea..0000000000000000000000000000000000000000 --- a/doc/specs.md +++ /dev/null @@ -1,615 +0,0 @@ -swh-deposit (draft) -=================== - -This is SWH's SWORD Server implementation. - -SWORD (Simple Web-Service Offering Repository Deposit) is an -interoperability standard for digital file deposit. - -This protocol will be used to interact between a client (a repository) -and a server (swh repository) to permit the deposit of software -tarballs. - -In this document, we will refer to a client (e.g. HAL server) and a -server (SWH's). - -Table of contents ---------------------- -1. [use cases](#uc) -2. [api overview](#api) -3. [limitations](#limitations) -4. [scenarios](#scenarios) -5. [errors](#errors) -6. [tarball injection](#tarball) -7. [technical](#technical) -8. [sources](#sources) - -# <a name="uc"> Use cases - -## First deposit - -From client's deposit repository server to SWH's repository server (aka deposit). - --[\[1\]](#1) The client requests for the server's abilities. - (GET query to the *service document uri*) - -- [\[2\]](#2)The server answers the client with the service document - -- [\[3\]](#3) The client sends the deposit (an archive -> .zip, .tar.gz) through the deposit - *creation uri*. - (one or more POST requests since the archive and metadata can be sent in multiple times) - - -- [\[4\]](#4) The server notifies the client it acknowledged the client's request. - ('http 201 Created' with a deposit receipt id in the Location header of the response) - - -## Updating an existing archive - --[\[5\]](#5) Client updates existing archive through the deposit *update uri* - (one or more PUT requests, in effect chunking the artifact to deposit) - -## Deleting an existing archive - -- [\[6\]](#6) Document deletion will not be implemented, cf. limitation paragraph for - detail - -## Client asks for operation status and repository id - -I'm not sure yet as to how this goes in the sword protocol. -I speak of operation status but i've yet to find a reference to this in the sword spec. - -- [\[7\]](#7)TODO: Detail this when clear - -# <a name="api"> API overview - -API access is over HTTPS. - -service document accessible at: https://archive.softwareheritage.org/api/1/servicedocument/ - -API endpoints: - - - without a specific collection, are rooted at https://archive.softwareheritage.org/api/1/deposit/. - - - with a specific and unique collection dubbed 'software', are rooted at https://archive.softwareheritage.org/api/1/software/. - - -TODO: Determine which one of those solutions according to sword possibilities (cf. 'unclear points' chapter below) - -# <a name="limitations"> Limitations - -With this SWORD protocol procedure there will be some voluntary implementation shortcomings: - -- no removal -- no mediation (we do not know the other system's users) -- upload limitation of 200Mib -- only tarballs (.zip, .tar.gz) will be accepted -- no authentication enforced at the application layer -- basic authentication at the server layer - -## unclear points - -- SWORD defines a 'collection' notion. But, as SWH is a software archive, we have only one 'software' collection. - -I think the collection refers to a group of documents to which the document sent (aka deposit) is part of -in this process with HAL, HAL is the collection, maybe tomorrow we will do the same with MIT and MIT could be the collection -(the logic of the anwser above is a result of this link: https://hal.inria.fr/USPC the USPC collection) - -that makes sense. -Still, i don't think we want to do this. -Or, objectively, i don't see how to implement this correctly. - -Specifically, I think, the client can push directly the documents to us. -If for some reasons, we want to list the 'documents', we could distinguish then -(as this could help in reducing the length of documents per client, 1 client being equivalent as 1 collection in this case). - -What should we do with this? - - Define one? - - Define none? (is it possible? i don't think it is due to the service document part listing the collection to act upon...) - - -# <a name="scenarios"> Scenarios -## <a name="1">[1] Client request for Service Document - -This is the endpoint permitting the client to ask the server's abilities. - - -### API endpoint - -GET api/1/servicedocument/ - -Answer: -- 200, Content-Type: application/atomserv+xml: OK, with the body - described below - -### Sample request: - -``` shell -GET https://archive.softwareheritage.org/api/1/servicedocument HTTP/1.1 -Host: archive.softwareheritage.org -``` - -## <a name="2"> [2] Sever respond for Service Document - -The server returns its abilities with the service document in xml format: -- protocol sword version v2 -- accepted mime types: application/zip, application/gzip -- upload max size accepted, beyond that, it's expected the client - chunk the tarball into multiple ones -- the collections the client can act upon (swh supports only one software collection) -- mediation not supported - -### Sample answer: -``` xml -<?xml version="1.0" ?> -<service xmlns:dcterms="http://purl.org/dc/terms/" - xmlns:sword="http://purl.org/net/sword/terms/" - xmlns:atom="http://www.w3.org/2005/Atom" - xmlns="http://www.w3.org/2007/app"> - - <sword:version>2.0</sword:version> - <sword:maxUploadSize>${max_upload_size}</sword:maxUploadSize> - - <workspace> - <atom:title>The SWH archive</atom:title> - - <collection href="https://archive.softwareherigage.org/api/1/deposit/"> - <atom:title>SWH Collection</atom:title> - <accept>application/gzip</accept> - <accept alternate="multipart-related">application/gzip</accept> - <dcterms:abstract>Software Heritage Archive Deposit</dcterms:abstract> - <sword:mediation>false</sword:mediation> - <sword:acceptPackaging>http://purl.org/net/sword/package/SimpleZip</sword:acceptPackaging> - </collection> - </workspace> -</service> -``` - - -## Deposit Creation: client point of view - -Process of deposit creation: - -> [3] client request -> - - ( [3.1] server validation -> [3.2] server temporary upload ) -> [3.3] server injects deposit into archive - - <- [4] server returns deposit receipt id - - -- [3.3] Asynchronously, the server will inject the archive uploaded and the - associated metadata. The operation status mentioned - earlier is a reference to that injection operation. - -## <a name="3"></a> [[3] client request - -The client can send a deposit through one request deposit or multiple requests deposit. - -The deposit can contain: -- an archive holding the software source code, -- an envelop with metadata describing information regarding a deposit, -- or both (Multipart deposit). - -the client can deposit a binary file, supplying the following headers: -- Content-Type (text): accepted mimetype -- Content-Length (int): -- Content-MD5 (text): md5 checksum hex encoded of the tarball (we may need to check for the possibility to support a more secure hash) -- Content-Disposition (text): attachment; filename=[filename] ; the filename - parameter must be text (ascii) -- Packaging (IRI): http://purl.org/net/sword/package/SimpleZip -- In-Progress (bool): true to specify it's not the last request, false - to specify it's a final request and the server can go on with - processing the request's information - -TODO: required fields (MUST, SHOULD) - -I think the optional one is In-Progress, which if not there should be considered done (I'll check the spec for this). - -### API endpoint - -POST /api/1/deposit/ - -### One request deposit - -The one request deposit is a single request containing both the metadata (body) and the archive (attachment). - -A Multipart deposit is a request of an archive along with metadata about -that archive (can be applied in a one request deposit or multiple requests). - -Client provides: -- Content-Disposition (text): header of type 'attachment' on the Entry - Part with a name parameter set to 'atom' -- Content-Disposition (text): header of type 'attachment' on the Media - Part with a name parameter set to payload and a filename parameter - (the filename will be expressed in ASCII). -- Content-MD5 (text): md5 checksum hex encoded of the tarball -- Packaging (text): http://purl.org/net/sword/package/SimpleZip - (packaging format used on the Media Part) -- In-Progress (bool): true|false; true means partial upload and we can expect - other requests in the future, false means the deposit is done. -- add metadata formats or foreign markup to the atom:entry element - - -### sample request for multipart deposit: - -``` xml -POST deposit HTTP/1.1 -Host: archive.softwareheritage.org -Content-Length: [content length] -Content-Type: multipart/related; - boundary="===============1605871705=="; - type="application/atom+xml" -In-Progress: false -MIME-Version: 1.0 - -Media Post ---===============1605871705== -Content-Type: application/atom+xml; charset="utf-8" -Content-Disposition: attachment; name="atom" -MIME-Version: 1.0 - -<?xml version="1.0"?> -<entry xmlns="http://www.w3.org/2005/Atom" - xmlns:dcterms="http://purl.org/dc/terms/"> - <title>Title</title> - <id>hal-or-other-archive-id</id> - <updated>2005-10-07T17:17:08Z</updated> - <author><name>Contributor</name></author> - - <!-- some embedded metadata TO BE DEFINED --> - -</entry> ---===============1605871705== -Content-Type: application/zip -Content-Disposition: attachment; name=payload; filename=[filename] -Packaging: http://purl.org/net/sword/package/SimpleZip -Content-MD5: [md5-digest] -MIME-Version: 1.0 - -[...binary package data...] ---===============1605871705==-- -``` - -## Deposit Creation - server point of view - -The server receives the request and: - -### [3.1] Validation of the header and body request - - -### [3.2] Server uploads such content in a temporary location (deposit table in a separated DB). -- saves the archives in a temporary location -- executes a md5 checksum on that archive and check it against the - same header information -- adds a deposit entry and retrieves the associated id - - -## <a name="4"></a> [[4] Servers answers the client an 'http 201 Created' with a deposit receipt id in the Location header of - the response. - -The server possible answers are: -- OK: '201 created' + one header 'Location' holding the deposit receipt - id -- KO: with the error status code and associated message - (cf. [possible errors paragraph](#possible errors)). - - -## <a name="5"></a> [5] Deposit Update - -The client previously uploaded an archive and wants to add either new -metadata information or a new version for that previous deposit -(possibly in multiple steps as well). The important thing to note -here is that for swh, this will result in a new version of the -previous deposit in any case. - -Providing the identifier of the previous version deposit received from -the status URI, the client executes a PUT request on the same URI as -the deposit one. - -After validation of the body request, the server: -- uploads such content in a temporary location (to be defined). - -- answers the client an 'http 204 (No content)'. In the Location - header of the response lies a deposit receipt id permitting the - client to check back the operation status later on. - -- Asynchronously, the server will inject the archive uploaded and the - associated metadata. The operation status mentioned earlier is a - reference to that injection operation. The fact that the version is - a new one is dealt with at the injection level. - -URL: PUT /1/deposit/<previous-swh-id> - -## <a name="6"></a> [6] Deposit Removal - -[#limitation](As explained in the limitation paragraph), removal won't -be implemented. Nothing is removed from the SWH archive. - -The server answers a '405 Method not allowed' error. - - -## <a name="7"></a>[7] Operation Status - -Providing a deposit receipt id, the client asks the operation status -of a prior upload. - -URL: GET /1/software/{deposit_receipt} - -# <a name="errors"> Possible errors - -## sword:ErrorContent - -IRI: http://purl.org/net/sword/error/ErrorContent - -The supplied format is not the same as that identified in the -Packaging header and/or that supported by the server Associated HTTP - -Status: 415 (Unsupported Media Type) or 406 (Not Acceptable) - -## sword:ErrorChecksumMismatch - -IRI: http://purl.org/net/sword/error/ErrorChecksumMismatch - -Checksum sent does not match the calculated checksum. The server MUST -also return a status code of 412 Precondition Failed - -## sword:ErrorBadRequest - -IRI: http://purl.org/net/sword/error/ErrorBadRequest - -Some parameters sent with the POST/PUT were not understood. The server -MUST also return a status code of 400 Bad Request. - -## sword:MediationNotAllowed - -IRI: http://purl.org/net/sword/error/MediationNotAllowed - -Used where a client has attempted a mediated deposit, but this is not -supported by the server. The server MUST also return a status code of -412 Precondition Failed. - -## sword:MethodNotAllowed - -IRI: http://purl.org/net/sword/error/MethodNotAllowed - -Used when the client has attempted one of the HTTP update verbs (POST, -PUT, DELETE) but the server has decided not to respond to such -requests on the specified resource at that time. The server MUST also -return a status code of 405 Method Not Allowed - -## sword:MaxUploadSizeExceeded - -IRI: http://purl.org/net/sword/error/MaxUploadSizeExceeded - -Used when the client has attempted to supply to the server a file -which exceeds the server's maximum upload size limit - -Associated HTTP Status: 413 (Request Entity Too Large) - -# <a name="tarball"> Tarball Injection - -Providing we use indeed synthetic revision to represent a version of a -tarball injected through the sword use case, this needs to be improved -so that the synthetic revision is created with a parent revision (the -previous known one for the same 'origin'). - - -Note: -- origin may no longer be the right term (we may need a new 'at the - same level' notion, maybe 'deposit'?) - - * deposit is used for the information - - we agreed that for now origin seems fine enough - - -- As there are no authentication, everyone can push a new version for - the same origin so we might need to use the synthetic revision's - author (or committer?) date to discriminate which is the last known - version for the same 'origin'. - Note: - We'll do something simple, the last version is the last one injected. - The order should be enforced by the scheduling part of the injection, respecting the reception date. - We may need another date, the one when the deposit is considered complete and use that date. - - -## Injection path - - origin --> origin_visit --> occurrence & occurrence_history --> revision --> directory (upper level of the uncompressed archive) - ok for me - https://hal.inria.fr/hal-01327170 --> 1 :reception_date --> branch: client's version n° (e.g hal) --> synthetic_revision (tarball) - - -Questions: - - can an update be on a version without having a new version? - No, if something is pushed for the same origin via PUT (update), it will result in a new version (well when the deposit will be complete, injection triggered and done that is) - - For example, depositing only new metadata for the same hal deposit version without providing a new archive can result in a new version targetting the same previous archive. - And in that case, we won't need the archive again since the targetted directory tree won't have changed, we can simply reuse it. - That is, we'll create a new synthetic revision targetting the same tree whose parent revision is the last know revision for that origin. - Is it clear? :D - so we keep raw metadata in the synthetic revision, yes (we need those to have different hash on revision, the revision metadata column is used to compute its hash). - - That makes me think that for the creation (POST). - Once the client has said, deposit done for an origin. - Any further request for that origin should be refused (since they should pass by the PUT endpoint as update). - - Shortcoming: - what about concurrent deposit for the same origin? - How do we distinguish them? - - A: The client should identify each package sent if it belongs to a chuncked deposit or a new request for same deposit - - On SWH, we should treat each request separately as a new deposit ??? i think yes (I'm answering myself) because the date of reception should be new - - and the depposit receipt id should be new as well - - -Actions possible on HAL after deposit is public: - - modify metadata - - add file - - deposit new version - - link ressource - - share property - - use as model - - - - A deposit has one origin, yet an origin can have multiple deposits ? - No, not multiple deposits, multiple requests for the same origin, but in the end, this should end up in one single deposit - (when the client pushes its final request saying deposit 'done' through the header In-Progress). - When I say multiple deposits, I mean multiple versions/ updates on a deposit identified with external_id ok - you are talking about multiple requests in the sense of chuncked deposits yes - - - HAL's deposit 01535619 = SWH's deposit 01535619-1 - - + 1 origin with url:https://hal.inria.fr/medihal-01535619 - - + 1 revision - - + 1 directory - - deposit 01535619-v2 = SWH's deposit 01535619-2 - - + same origin - - + new revision - - + new directory - - - -## <a name="technical"> Technical - -We will need: -- one dedicated db to store state - swh-deposit - -- one dedicated temporary storage to store archives before injection - -- 'deposit' table: - - id (bigint): deposit receipt id - - external id (text): client's internal identifier (e.g hal's id, etc...). - - origin id : null before injection - - revision id : null before full injection I don't think we should store this as this will move at each new version... - - reception_date: first deposit date - - complete_date: reception date of the last deposit which makes the deposit complete - - metadata: jsonb (raw format before translation) - - status (enum): - -'partially-received', -- when only a part of the deposit was received (through multiple requests) - - -'received', -- deposit is fully received (last request arrived) - - -'injecting', -- injection is ongoing on swh's side - - -'injected', -- injection is successfully done - - - 'failed' -- injection failed due to some error - -- the metadata received with the deposit should be kept in the origin_metadata table - - after translation as part of the injection process - - - what's the origin_metadata table? - - This is the new table we talked with Zack about - - yes, but i wanted some more details - - it's in swh db? - - nothing about metadata is implemented yet - - but it should be in the main db - - right - - still, the nice thing about what we are doing can be untangled yes it's nice - - That is we could run in production the simple deposit stuff (which does not do anything about the deposit injection yet) - - we accept query and store deposits (since we need the scheduling one-shot task as well... which can be worrisome about the delay) - - - - i remember zack and you spoke about it during the 'tech meeting' but i did not follow everything at that time. - - origin bigint PK FK - - visit bigint PK FK // ? - - date date - - provenance_type text // (enum: 'publisher', 'external_catalog' needs to be completed) - - location url // only needed if there are use cases where this differs from origin for external_catalogs - - raw_metadata jsonb // before translation - - indexer_configuration_id bigint FK // tool used for translation - - translated_metadata jsonb // with codemeta schema and terms - - -# SWH Identifier returned? - - swh-<client-name>-<synthetic-revision-id> - - e.g: swh-hal-47dc6b4636c7f6cba0df83e3d5490bf4334d987e - - We could have a specific dedicated client table. - -# <a name="nomenclature"> Nomenclature - -SWORD uses IRI. This means Internationalized Resource Identifier. In -this chapter, we will describe SWH's IRI. - -## SD-IRI - The Service Document IRI - -This is the IRI from which the root service document can be -located. - -## Col-IRI - The Collection IRI - -Only one collection of software is used in this repository. - -Note: -This is the IRI to which the initial deposit will take place, and -which are listed in the Service Document. -Discuss to check if we want to implement this or not. - -## Cont-IRI - The Content IRI - -This is the IRI from which the client will be able to retrieve -representations of the object as it resides in the SWORD server. - -## EM-IRI - The Atom Edit Media IRI - -To simplify, this is the same as the Cont-IRI. - -## Edit-IRI - The Atom Entry Edit IRI - -This is the IRI of the Atom Entry of the object, and therefore also of -the container within the SWORD server. - -## SE-IRI - The SWORD Edit IRI - -This is the IRI to which clients may POST additional content to an -Atom Entry Resource. This MAY be the same as the Edit-IRI, but is -defined separately as it supports HTTP POST explicitly while the -Edit-IRI is defined by [AtomPub] as limited to GET, PUT and DELETE -operations. - -## State-IRI - The SWORD Statement IRI - -This is the one of the IRIs which can be used to retrieve a -description of the object from the sword server, including the -structure of the object and its state. This will be used as the -operation status endpoint. - -# <a name="sources"> sources - -- [SWORD v2 specification](http://swordapp.github.io/SWORDv2-Profile/SWORDProfile.html) -- [arxiv documentation](https://arxiv.org/help/submit_sword) -- [Dataverse example](http://guides.dataverse.org/en/4.3/api/sword.html) -- [SWORD used on HAL]https://api.archives-ouvertes.fr/docs/sword -- [xml examples for CCSD] https://github.com/CCSDForge/HAL/tree/master/Sword diff --git a/requirements.txt b/requirements.txt index 004061b83fa923f0d47481bf3658fbc7635ed8ec..aef7b9e5feabf2345e209388c19ec6e3a34ed686 100644 --- a/requirements.txt +++ b/requirements.txt @@ -1,6 +1,5 @@ vcversioner -aiohttp -jinja retrying click +django diff --git a/resources/deposit/server.yml b/resources/deposit/server.yml index c8f33f3337d15ebfaa4b8f0ead2617b0ff0b9d64..013012d59509e972b59b768d38a62ab796f07ad1 100644 --- a/resources/deposit/server.yml +++ b/resources/deposit/server.yml @@ -1,5 +1,4 @@ -port: 5012 -host: localhost - # 200 Mib max size max_upload_size: 209715200 +verbose: False +debug: False diff --git a/setup.py b/setup.py index d1409c8d99f9f6e5a9dfc7f658a3844404dd636d..63d17209256e995600bcfa8ac637e75c945c09c3 100644 --- a/setup.py +++ b/setup.py @@ -13,15 +13,17 @@ def parse_requirements(): return requirements -# Edit this part to match your module -# full sample: https://forge.softwareheritage.org/diffusion/DCORE/browse/master/setup.py setup( name='swh.deposit', description='Software Heritage Deposit Server', author='Software Heritage developers', author_email='swh-devel@inria.fr', url='https://forge.softwareheritage.org/source/swh-deposit/', - packages=['swh.deposit'], + packages=['swh.deposit', + 'swh.deposit.fixtures', + 'swh.deposit.migrations', + 'swh.deposit.templates', + 'swh.deposit.templates.deposit'], scripts=[], # scripts to package install_requires=parse_requirements(), setup_requires=['vcversioner'], diff --git a/sql/swh-schema.sql b/sql/swh-schema.sql index fe9dd64ebadda8087beaf04357f38bbdf27a14f6..e6c87d6ddb530c5f0c3e26355c790949204afb48 100644 --- a/sql/swh-schema.sql +++ b/sql/swh-schema.sql @@ -48,7 +48,7 @@ comment on column deposit_type.name is 'Human readable name for the deposit type create table deposit( id bigserial primary key, reception_date timestamptz not null, - complete_date timestamptz not null, + complete_date timestamptz, type serial not null references deposit_type(id), external_id text not null, status deposit_status not null, diff --git a/swh.deposit.egg-info/PKG-INFO b/swh.deposit.egg-info/PKG-INFO index e9b0bbe86584577adaeb326174f4a8b3e5caf3ba..2d559c0343c5e24816889aa1a5a7449a451a8537 100644 --- a/swh.deposit.egg-info/PKG-INFO +++ b/swh.deposit.egg-info/PKG-INFO @@ -1,6 +1,6 @@ Metadata-Version: 1.0 Name: swh.deposit -Version: 0.0.2 +Version: 0.0.3 Summary: Software Heritage Deposit Server Home-page: https://forge.softwareheritage.org/source/swh-deposit/ Author: Software Heritage developers diff --git a/swh.deposit.egg-info/SOURCES.txt b/swh.deposit.egg-info/SOURCES.txt index 3040b821a37ac52016d400e960061b278eb4eb0a..d8254a89239eb500ebed37f32b258801105bb0fa 100644 --- a/swh.deposit.egg-info/SOURCES.txt +++ b/swh.deposit.egg-info/SOURCES.txt @@ -15,14 +15,27 @@ debian/control debian/copyright debian/rules debian/source/format -doc/specs.md resources/deposit/server.yml sql/swh-schema.sql +swh/manage.py swh.deposit.egg-info/PKG-INFO swh.deposit.egg-info/SOURCES.txt swh.deposit.egg-info/dependency_links.txt swh.deposit.egg-info/requires.txt swh.deposit.egg-info/top_level.txt -swh/deposit/backend.py -swh/deposit/server.py -swh/deposit/templates/service_document.xml \ No newline at end of file +swh/deposit/__init__.py +swh/deposit/admin.py +swh/deposit/apps.py +swh/deposit/models.py +swh/deposit/settings.py +swh/deposit/tests.py +swh/deposit/urls.py +swh/deposit/views.py +swh/deposit/wsgi.py +swh/deposit/fixtures/__init__.py +swh/deposit/fixtures/deposit_data.yaml +swh/deposit/migrations/0001_initial.py +swh/deposit/migrations/__init__.py +swh/deposit/templates/__init__.py +swh/deposit/templates/deposit/__init__.py +swh/deposit/templates/deposit/service_document.xml \ No newline at end of file diff --git a/swh.deposit.egg-info/requires.txt b/swh.deposit.egg-info/requires.txt index e5c13e67142350f8e4984840085ea3e22d9b2c01..94e1f402e9cf576031c28a9ea1ddc84a8e6eed5b 100644 --- a/swh.deposit.egg-info/requires.txt +++ b/swh.deposit.egg-info/requires.txt @@ -1,6 +1,5 @@ -aiohttp click -jinja +django retrying swh.core>=0.0.14 vcversioner diff --git a/swh/deposit/__init__.py b/swh/deposit/__init__.py new file mode 100644 index 0000000000000000000000000000000000000000..e69de29bb2d1d6434b8b29ae775ad8c2e48c5391 diff --git a/swh/deposit/admin.py b/swh/deposit/admin.py new file mode 100644 index 0000000000000000000000000000000000000000..c77709b9d8bb4c5583faa0071f1ff062ee49a004 --- /dev/null +++ b/swh/deposit/admin.py @@ -0,0 +1,8 @@ +from django.contrib import admin + +from .models import Client, DepositType, Deposit, DepositRequest + +admin.site.register(Client) +admin.site.register(DepositType) +admin.site.register(Deposit) +admin.site.register(DepositRequest) diff --git a/swh/deposit/apps.py b/swh/deposit/apps.py new file mode 100644 index 0000000000000000000000000000000000000000..a0c0db70fe30f5682ecd0afc1ee5ed4ee6ecb4f5 --- /dev/null +++ b/swh/deposit/apps.py @@ -0,0 +1,5 @@ +from django.apps import AppConfig + + +class DepositConfig(AppConfig): + name = 'swh.deposit' diff --git a/swh/deposit/backend.py b/swh/deposit/backend.py deleted file mode 100644 index 68a1e739f0246961e9d5d18cd4deb6d432ab46fa..0000000000000000000000000000000000000000 --- a/swh/deposit/backend.py +++ /dev/null @@ -1,152 +0,0 @@ -# Copyright (C) 2017 The Software Heritage developers -# See the AUTHORS file at the top-level directory of this distribution -# License: GNU General Public License version 3, or any later version -# See top-level LICENSE file for more information - -from functools import wraps - -import psycopg2 -import psycopg2.extras - - -psycopg2.extensions.register_adapter(dict, psycopg2.extras.Json) - - -def autocommit(fn): - @wraps(fn) - def wrapped(self, *args, **kwargs): - autocommit = False - if 'cursor' not in kwargs or not kwargs['cursor']: - autocommit = True - kwargs['cursor'] = self.cursor() - - try: - ret = fn(self, *args, **kwargs) - except: - if autocommit: - self.rollback() - raise - - if autocommit: - self.commit() - - return ret - - return wrapped - - -class DepositBackend(): - """Backend for the Software Heritage deposit database. - - """ - - def __init__(self, dbconn): - self.db = None - self.dbconn = dbconn - self.reconnect() - - def reconnect(self): - if not self.db or self.db.closed: - self.db = psycopg2.connect( - dsn=self.dbconn, - cursor_factory=psycopg2.extras.RealDictCursor, - ) - - def cursor(self): - """Return a fresh cursor on the database, with auto-reconnection in - case of failure - - """ - cur = None - - # Get a fresh cursor and reconnect at most three times - tries = 0 - while True: - tries += 1 - try: - cur = self.db.cursor() - cur.execute('select 1') - break - except psycopg2.OperationalError: - if tries < 3: - self.reconnect() - else: - raise - - return cur - - def commit(self): - """Commit a transaction""" - self.db.commit() - - def rollback(self): - """Rollback a transaction""" - self.db.rollback() - - deposit_keys = [ - 'reception_date', 'complete_date', 'type', 'external_id', - 'status', 'client_id', - ] - - def _format_query(self, query, keys): - """Format a query with the given keys""" - - query_keys = ', '.join(keys) - placeholders = ', '.join(['%s'] * len(keys)) - - return query.format(keys=query_keys, placeholders=placeholders) - - @autocommit - def deposit_add(self, deposit, cursor=None): - """Create a new deposit. - - A deposit is a dictionary with the following keys: - type (str): an identifier for the deposit type - reception_date (date): deposit's reception date - complete_date (date): deposit's date when the deposit is - deemed complete - external_id (str): the external identifier in the client's - information system - status (str): deposit status - client_id (integer): client's identifier - - """ - query = self._format_query( - """insert into deposit ({keys}) values ({placeholders})""", - self.deposit_keys, - ) - cursor.execute(query, [deposit[key] for key in self.deposi_keys]) - - @autocommit - def deposit_get(self, id, cursor=None): - """Retrieve the task type with id - - """ - query = self._format_query( - "select {keys} from deposit where type=%s", - self.deposit_keys, - ) - cursor.execute(query, (id,)) - ret = cursor.fetchone() - return ret - - @autocommit - def request_add(self, request, cursor=None): - pass - - @autocommit - def request_get(self, deposit_id, cursor=None): - pass - - @autocommit - def client_list(self, cursor=None): - cursor.execute('select id, name from client') - - return {row['name']: row['id'] for row in cursor.fetchall()} - - @autocommit - def client_get(self, id, cursor=None): - cursor.execute('select id, name, credential from client where id=%s', - (id, )) - - return cursor.fetchone() diff --git a/swh/deposit/fixtures/__init__.py b/swh/deposit/fixtures/__init__.py new file mode 100644 index 0000000000000000000000000000000000000000..e69de29bb2d1d6434b8b29ae775ad8c2e48c5391 diff --git a/swh/deposit/fixtures/deposit_data.yaml b/swh/deposit/fixtures/deposit_data.yaml new file mode 100644 index 0000000000000000000000000000000000000000..95ebdae9e40a6ef7e20743b6d0422826cf0de1b4 --- /dev/null +++ b/swh/deposit/fixtures/deposit_data.yaml @@ -0,0 +1,14 @@ +- model: deposit.dbversion + pk: 1 + fields: + release: '2017-07-31 08:10:00.303000+00:00' + description: Work in Progress +- model: deposit.client + pk: 1 + fields: + name: hal + credential: null +- model: deposit.deposittype + pk: 1 + fields: + name: hal diff --git a/swh/deposit/migrations/0001_initial.py b/swh/deposit/migrations/0001_initial.py new file mode 100644 index 0000000000000000000000000000000000000000..d75f93f2f589c9ff6a36ee47ce14f5566f52ba6a --- /dev/null +++ b/swh/deposit/migrations/0001_initial.py @@ -0,0 +1,82 @@ +# -*- coding: utf-8 -*- +# Generated by Django 1.10.7 on 2017-07-31 09:43 +from __future__ import unicode_literals + +import django.contrib.postgres.fields.jsonb +from django.db import migrations, models +import django.db.models.deletion +import django.utils.timezone + + +class Migration(migrations.Migration): + + initial = True + + dependencies = [ + ] + + operations = [ + migrations.CreateModel( + name='Client', + fields=[ + ('id', models.BigAutoField(primary_key=True, serialize=False)), + ('name', models.TextField()), + ('credential', models.BinaryField(blank=True, null=True)), + ], + options={ + 'db_table': 'client', + }, + ), + migrations.CreateModel( + name='Dbversion', + fields=[ + ('version', models.IntegerField(primary_key=True, serialize=False)), + ('release', models.DateTimeField(default=django.utils.timezone.now, null=True)), + ('description', models.TextField(blank=True, null=True)), + ], + options={ + 'db_table': 'dbversion', + }, + ), + migrations.CreateModel( + name='Deposit', + fields=[ + ('id', models.BigAutoField(primary_key=True, serialize=False)), + ('reception_date', models.DateTimeField()), + ('complete_date', models.DateTimeField(null=True)), + ('external_id', models.TextField()), + ('client_id', models.BigIntegerField()), + ('swh_id', models.TextField(blank=True, null=True)), + ('status', models.TextField(choices=[('partial', 'partial'), ('expired', 'expired'), ('ready', 'ready'), ('injecting', 'injecting'), ('success', 'success'), ('failure', 'failure')], default='partial')), + ], + options={ + 'db_table': 'deposit', + }, + ), + migrations.CreateModel( + name='DepositRequest', + fields=[ + ('id', models.BigAutoField(primary_key=True, serialize=False)), + ('metadata', django.contrib.postgres.fields.jsonb.JSONField(null=True)), + ('deposit', models.ForeignKey(on_delete=django.db.models.deletion.DO_NOTHING, to='deposit.Deposit')), + ], + options={ + 'db_table': 'deposit_request', + }, + ), + migrations.CreateModel( + name='DepositType', + fields=[ + ('id', models.BigAutoField(primary_key=True, serialize=False)), + ('name', models.TextField()), + ], + options={ + 'db_table': 'deposit_type', + }, + ), + migrations.AddField( + model_name='deposit', + name='type', + field=models.ForeignKey(db_column='type', on_delete=django.db.models.deletion.DO_NOTHING, to='deposit.DepositType'), + ), + ] diff --git a/swh/deposit/migrations/__init__.py b/swh/deposit/migrations/__init__.py new file mode 100644 index 0000000000000000000000000000000000000000..e69de29bb2d1d6434b8b29ae775ad8c2e48c5391 diff --git a/swh/deposit/models.py b/swh/deposit/models.py new file mode 100644 index 0000000000000000000000000000000000000000..886a805dd9e7291d7c22c92083fc5849814a6fdc --- /dev/null +++ b/swh/deposit/models.py @@ -0,0 +1,129 @@ +# Generated from: +# cd swh_deposit && \ +# python3 -m manage inspectdb + + +from django.contrib.postgres.fields import JSONField +from django.db import models +from django.utils.timezone import now + + +class Dbversion(models.Model): + """Db version + + """ + version = models.IntegerField(primary_key=True) + release = models.DateTimeField(default=now, null=True) + description = models.TextField(blank=True, null=True) + + class Meta: + db_table = 'dbversion' + + def __str__(self): + return str({ + 'version': self.version, + 'release': self.release, + 'description': self.description + }) + + +class Client(models.Model): + """Deposit's client references. + + """ + id = models.BigAutoField(primary_key=True) + # Human readable name for the client e.g hal, arXiv, etc... + name = models.TextField() + credential = models.BinaryField(blank=True, null=True) + + class Meta: + db_table = 'client' + + def __str__(self): + return str({'id': self.id, 'name': self.name}) + + +DEPOSIT_STATUS = [ + ('partial', 'partial'), # the deposit is new or partially received + # since it can be done in multiple requests + ('expired', 'expired'), # deposit has been there too long and is now + # deemed ready to be garbage collected + ('ready', 'ready'), # deposit is fully received and ready for + # injection + ('injecting', 'injecting'), # injection is ongoing on swh's side + ('success', 'success'), # injection successful + ('failure', 'failure'), # injection failure +] + + +class Deposit(models.Model): + """Deposit reception table + + """ + id = models.BigAutoField(primary_key=True) + + # First deposit reception date + reception_date = models.DateTimeField() + # Date when the deposit is deemed complete and ready for injection + complete_date = models.DateTimeField(null=True) + # Deposit reception source type + type = models.ForeignKey( + 'DepositType', models.DO_NOTHING, db_column='type') + # Deposit's uniue external identifier + external_id = models.TextField() + # Deposit client + client_id = models.BigIntegerField() + # SWH's injection result identifier + swh_id = models.TextField(blank=True, null=True) + # Deposit's status regarding injection + status = models.TextField( + choices=DEPOSIT_STATUS, + default='partial') + + class Meta: + db_table = 'deposit' + + def __str__(self): + return str({ + 'id': self.id, + 'reception_date': self.reception_date, + 'type': self.type, + 'external_id': self.external_id, + 'client_id': self.client_id, + 'status': self.status + }) + + +class DepositRequest(models.Model): + """Deposit request made by clients + + """ + + id = models.BigAutoField(primary_key=True) + # Deposit concerned by the request + deposit = models.ForeignKey(Deposit, models.DO_NOTHING) + # Deposit request information on the data to inject + metadata = JSONField(null=True) + + class Meta: + db_table = 'deposit_request' + + def __str__(self): + from json import dumps + return str({ + 'id': self.id, + 'deposit': self.deposit, + 'metadata': dumps(self.metadata), + }) + + +class DepositType(models.Model): + id = models.BigAutoField(primary_key=True) + # Human readable name for the deposit type e.g HAL, arXiv, etc... + name = models.TextField() + + class Meta: + db_table = 'deposit_type' + + def __str__(self): + return str({'id': self.id, 'name': self.name}) diff --git a/swh/deposit/server.py b/swh/deposit/server.py deleted file mode 100644 index c26341619f833f493532018ab1f6e39bb177cf6f..0000000000000000000000000000000000000000 --- a/swh/deposit/server.py +++ /dev/null @@ -1,133 +0,0 @@ -# Copyright (C) 2017 The Software Heritage developers -# See the AUTHORS file at the top-level directory of this distribution -# License: GNU General Public License version 3, or any later version -# See top-level LICENSE file for more information - -import asyncio -import aiohttp.web -import click -import jinja2 -import json - -from swh.core import config -from swh.core.config import SWHConfig -from swh.core.api_async import SWHRemoteAPI -from swh.deposit.backend import DepositBackend - -DEFAULT_CONFIG_PATH = 'deposit/server' -DEFAULT_CONFIG = { - 'host': ('str', '0.0.0.0'), - 'port': ('int', 5006), -} - - -def encode_data(data, template_name=None, **kwargs): - return aiohttp.web.Response( - body=data, - headers={'Content-Type': 'application/xml'}, - **kwargs - ) - - -class DepositWebServer(SWHConfig): - """Base class to define endpoints route. - - """ - - CONFIG_BASE_FILENAME = DEFAULT_CONFIG_PATH - - DEFAULT_CONFIG = { - 'max_upload_size': ('int', 209715200), - 'dbconn': ('str', 'dbname=softwareheritage-deposit-dev'), - } - - def __init__(self, config=None): - if config: - self.config = config - else: - self.config = self.parse_config_file() - template_loader = jinja2.FileSystemLoader( - searchpath=["swh/deposit/templates"]) - self.template_env = jinja2.Environment(loader=template_loader) - self.backend = DepositBackend(self.config['dbconn']) - - @asyncio.coroutine - def index(self, request): - return aiohttp.web.Response(text='SWH Deposit Server') - - @asyncio.coroutine - def service_document(self, request): - tpl = self.template_env.get_template('service_document.xml') - output = tpl.render( - noop=True, verbose=False, - max_upload_size=self.config['max_upload_size']) - return encode_data(data=output) - - @asyncio.coroutine - def create_document(self, request): - pass - - @asyncio.coroutine - def update_document(self, request): - pass - - @asyncio.coroutine - def status_operation(self, request): - pass - - @asyncio.coroutine - def delete_document(self, request): - raise ValueError('Not implemented') - - @asyncio.coroutine - def client_get(self, request): - clients = self.backend.client_list() - return aiohttp.web.Response( - body=json.dumps(clients), - headers={'Content-Type': 'application/json'}) - - -def make_app(config, **kwargs): - """Initialize server application. - - Returns: - Application ready for running and serving api endpoints. - - """ - app = SWHRemoteAPI(**kwargs) - server = DepositWebServer() - app.router.add_route('GET', '/', server.index) - app.router.add_route('GET', '/api/1/deposit/', server.service_document) - app.router.add_route('GET', '/api/1/status/', server.status_operation) - app.router.add_route('POST', '/api/1/deposit/', server.create_document) - app.router.add_route('PUT', '/api/1/deposit/', server.update_document) - app.router.add_route('DELETE', '/api/1/deposit/', server.delete_document) - app.router.add_route('GET', '/api/1/client/', server.client_get) - app.update(config) - return app - - -def make_app_from_configfile(config_path=DEFAULT_CONFIG_PATH, **kwargs): - """Initialize server application from configuration file. - - Returns: - Application ready for running and serving api endpoints. - - """ - return make_app(config.read(config_path, DEFAULT_CONFIG), **kwargs) - - -@click.command() -@click.argument('config-path', required=1) -@click.option('--host', default='0.0.0.0', help="Host to run the server") -@click.option('--port', default=5006, type=click.INT, - help="Binding port of the server") -@click.option('--debug/--nodebug', default=True, - help="Indicates if the server should run in debug mode") -def launch(config_path, host, port, debug): - app = make_app_from_configfile(config_path, debug=bool(debug)) - aiohttp.web.run_app(app, host=host, port=port) - - -if __name__ == '__main__': - launch() diff --git a/swh/deposit/settings.py b/swh/deposit/settings.py new file mode 100644 index 0000000000000000000000000000000000000000..95b7e660083723a2aaf7098c056802e1c7f392af --- /dev/null +++ b/swh/deposit/settings.py @@ -0,0 +1,119 @@ +""" +Django settings for swh project. + +Generated by 'django-admin startproject' using Django 1.10.7. + +For more information on this file, see +https://docs.djangoproject.com/en/1.10/topics/settings/ + +For the full list of settings and their values, see +https://docs.djangoproject.com/en/1.10/ref/settings/ +""" + +import os + +# Build paths inside the project like this: os.path.join(BASE_DIR, ...) +BASE_DIR = os.path.dirname(os.path.dirname(os.path.abspath(__file__))) + + +# Quick-start development settings - unsuitable for production +# See https://docs.djangoproject.com/en/1.10/howto/deployment/checklist/ + +# SECURITY WARNING: keep the secret key used in production secret! +SECRET_KEY = 'x^jrrm7f-45y!*cqh_vru^f=3w)#k-uj!chf@5(i1*vc(6!ltw' + +# SECURITY WARNING: don't run with debug turned on in production! +DEBUG = True + +ALLOWED_HOSTS = [] + + +# Application definition + +INSTALLED_APPS = [ + 'swh.deposit.apps.DepositConfig', + 'django.contrib.auth', + 'django.contrib.contenttypes', + 'django.contrib.staticfiles', + 'django.contrib.postgres', # for JSONField +] + +MIDDLEWARE = [ + 'django.middleware.security.SecurityMiddleware', + 'django.contrib.sessions.middleware.SessionMiddleware', + 'django.middleware.common.CommonMiddleware', + 'django.middleware.csrf.CsrfViewMiddleware', + 'django.contrib.auth.middleware.AuthenticationMiddleware', + 'django.contrib.messages.middleware.MessageMiddleware', + 'django.middleware.clickjacking.XFrameOptionsMiddleware', +] + +ROOT_URLCONF = 'swh.deposit.urls' + +TEMPLATES = [ + { + 'BACKEND': 'django.template.backends.django.DjangoTemplates', + 'DIRS': [], + 'APP_DIRS': True, + 'OPTIONS': { + 'context_processors': [ + 'django.template.context_processors.debug', + 'django.template.context_processors.request', + 'django.contrib.auth.context_processors.auth', + 'django.contrib.messages.context_processors.messages', + ], + }, + }, +] + +WSGI_APPLICATION = 'swh.deposit.wsgi.application' + + +# Database +# https://docs.djangoproject.com/en/1.10/ref/settings/#databases + +DATABASES = { + 'default': { + 'ENGINE': 'django.db.backends.postgresql', + 'NAME': 'swh-deposit-dev', + } +} + + +# Password validation +# https://docs.djangoproject.com/en/1.10/ref/settings/#auth-password-validators + +AUTH_PASSWORD_VALIDATORS = [ + { + 'NAME': 'django.contrib.auth.password_validation.UserAttributeSimilarityValidator', + }, + { + 'NAME': 'django.contrib.auth.password_validation.MinimumLengthValidator', + }, + { + 'NAME': 'django.contrib.auth.password_validation.CommonPasswordValidator', + }, + { + 'NAME': 'django.contrib.auth.password_validation.NumericPasswordValidator', + }, +] + + +# Internationalization +# https://docs.djangoproject.com/en/1.10/topics/i18n/ + +LANGUAGE_CODE = 'en-us' + +TIME_ZONE = 'UTC' + +USE_I18N = True + +USE_L10N = True + +USE_TZ = True + + +# Static files (CSS, JavaScript, Images) +# https://docs.djangoproject.com/en/1.10/howto/static-files/ + +STATIC_URL = '/static/' diff --git a/swh/deposit/templates/__init__.py b/swh/deposit/templates/__init__.py new file mode 100644 index 0000000000000000000000000000000000000000..e69de29bb2d1d6434b8b29ae775ad8c2e48c5391 diff --git a/swh/deposit/templates/deposit/__init__.py b/swh/deposit/templates/deposit/__init__.py new file mode 100644 index 0000000000000000000000000000000000000000..e69de29bb2d1d6434b8b29ae775ad8c2e48c5391 diff --git a/swh/deposit/templates/service_document.xml b/swh/deposit/templates/deposit/service_document.xml similarity index 100% rename from swh/deposit/templates/service_document.xml rename to swh/deposit/templates/deposit/service_document.xml diff --git a/swh/deposit/tests.py b/swh/deposit/tests.py new file mode 100644 index 0000000000000000000000000000000000000000..a79ca8be565f44aacce95bad20c1ee34d175ed20 --- /dev/null +++ b/swh/deposit/tests.py @@ -0,0 +1,3 @@ +# from django.test import TestCase + +# Create your tests here. diff --git a/swh/deposit/urls.py b/swh/deposit/urls.py new file mode 100644 index 0000000000000000000000000000000000000000..71f5df9d29249d45161e0f1febbdc72d711aab01 --- /dev/null +++ b/swh/deposit/urls.py @@ -0,0 +1,27 @@ +"""swh URL Configuration + +The `urlpatterns` list routes URLs to views. For more information please see: + https://docs.djangoproject.com/en/1.10/topics/http/urls/ +Examples: +Function views + 1. Add an import: from my_app import views + 2. Add a URL to urlpatterns: url(r'^$', views.home, name='home') +Class-based views + 1. Add an import: from other_app.views import Home + 2. Add a URL to urlpatterns: url(r'^$', Home.as_view(), name='home') +Including another URLconf + 1. Import the include() function: from django.conf.urls import url, include + 2. Add a URL to urlpatterns: url(r'^blog/', include('blog.urls')) +""" +from django.conf.urls import url +from django.contrib import admin + +from swh.deposit.views import index, clients, client, SWHDepositAPI + +urlpatterns = [ + url(r'^admin', admin.site.urls), + url(r'^deposit[/]+$', index), + url(r'^deposit/clients[/]+$', clients), + url(r'^deposit/client/(?P<client_id>[0-9]+)', client), + url(r'^deposit/sd', SWHDepositAPI().service_document) +] diff --git a/swh/deposit/views.py b/swh/deposit/views.py new file mode 100644 index 0000000000000000000000000000000000000000..6a283b8e797f50a6bfbdd935dce104432622d83b --- /dev/null +++ b/swh/deposit/views.py @@ -0,0 +1,50 @@ +from django.http import HttpResponse +from django.shortcuts import render, get_object_or_404 + +from swh.core.config import SWHConfig + +from .models import Client + + +def index(request): + return HttpResponse('SWH Deposit API - WIP') + + +def clients(request): + """List existing clients. + + """ + cs = Client.objects.all() + + return HttpResponse('Clients: %s' % ','.join((str(c) for c in cs))) + + +def client(request, client_id): + """List information about one client. + + """ + c = get_object_or_404(Client, pk=client_id) + return HttpResponse('Client {id: %s, name: %s}' % (c.id, c.name)) + + +class SWHDepositAPI(SWHConfig): + CONFIG_BASE_FILENAME = 'deposit/server' + + DEFAULT_CONFIG = { + 'max_upload_size': ('int', 209715200), + 'verbose': ('bool', False), + 'noop': ('bool', False), + } + + def __init__(self, **config): + self.config = self.parse_config_file() + self.config.update(config) + + def service_document(self, request): + context = { + 'max_upload_size': self.config['max_upload_size'], + 'verbose': self.config['verbose'], + 'noop': self.config['noop'], + } + return render(request, 'deposit/service_document.xml', + context, content_type='application/xml') diff --git a/swh/deposit/wsgi.py b/swh/deposit/wsgi.py new file mode 100644 index 0000000000000000000000000000000000000000..6f61e8a66a75840f145a445de17ffb70cba110bf --- /dev/null +++ b/swh/deposit/wsgi.py @@ -0,0 +1,18 @@ +""" +WSGI config for swh project. + +It exposes the WSGI callable as a module-level variable named ``application``. + +For more information on this file, see +https://docs.djangoproject.com/en/1.10/howto/deployment/wsgi/ +""" + +import os +import sys + +from django.core.wsgi import get_wsgi_application + +sys.path.append('/etc/softwareheritage') +os.environ.setdefault("DJANGO_SETTINGS_MODULE", "deposit.settings") + +application = get_wsgi_application() diff --git a/swh/manage.py b/swh/manage.py new file mode 100755 index 0000000000000000000000000000000000000000..8a402f0008750bb57716664e40b030ac35390d11 --- /dev/null +++ b/swh/manage.py @@ -0,0 +1,22 @@ +#!/usr/bin/env python3 +import os +import sys + +if __name__ == "__main__": + os.environ.setdefault("DJANGO_SETTINGS_MODULE", "swh.deposit.settings") + try: + from django.core.management import execute_from_command_line + except ImportError: + # The above import may fail for some other reason. Ensure that the + # issue is really that Django is missing to avoid masking other + # exceptions on Python 2. + try: + import django + except ImportError: + raise ImportError( + "Couldn't import Django. Are you sure it's installed and " + "available on your PYTHONPATH environment variable? Did you " + "forget to activate a virtual environment?" + ) + raise + execute_from_command_line(sys.argv) diff --git a/version.txt b/version.txt index e73fbdbb8322513f0fbba163d00cd9f83283a5c1..7d8209ac8cd53dfe43773b85528e94b7a48a70db 100644 --- a/version.txt +++ b/version.txt @@ -1 +1 @@ -v0.0.2-0-gb8a3e9c \ No newline at end of file +v0.0.3-0-g1fca331 \ No newline at end of file