Commits · 1fe5bf19136068f5df999601cf90cc4e3746a05a · Antoine R. Dumont / swh-deposit

Mar 08, 2022
- Warn about missing <swh:metadata-provenance> even on unrelated errors · 1fe5bf19
  vlorentz authored 3 years ago
  
  This behavior was (accidentally) removed in 74d5567b.
  1fe5bf19
Feb 28, 2022
- Move check_url_match_provider to api.checks instead of utils · dc1ea97a
  Nicolas Dandrimont authored 3 years ago
  
  This function is only used by server-side API checks. Having it defined in the main utils module makes the deposit client transitively depend on Django (via swh.deposit.errors), which does not seem necessary.
  dc1ea97a
- Use xmlschema to validate dates, instead of custom code. · 74d5567b
  vlorentz authored 3 years ago
  
  For now this increases code complexity, but this will allow addition of other check more easily.
  74d5567b
- Add schema validation of <swh:deposit> using swh.xsd · e0879560
  vlorentz authored 3 years ago
  
  e0879560
Feb 22, 2022

server: Use xml.etree.ElementTree instead of nested dicts internally · 55ae87b1

vlorentz authored 3 years ago

This commit does not touch the external API though; ie. `metadata_dict`
is still present in the JSON API, and the equivalent `jsonb` field remains
in the database. They will probably be removed in a future commit
because they are not very useful, though.

Rationale:

I find xmltodict's approach of translating XML tree to native structures
to be intrinsically flawed for non-trivial handling of XML, because the
data structure is:

* implementation-defined (by xmltodict, which is python-only) and it may
  change across versions
* does not intrinsically store namespaces, and relies on an internal
  prefix map  (though it isn't much of an issue right now, as we do not need
  composability and all the changed APIs are private)
* not stable; for example, `<a><b>foo</b></a>` and `<a><b>foo</b><b>bar</b></a>`
  are encoded completely differently (the former is a `Dict[str, str]`,
  the latter is `Dict[str, list]`.

And every operation manipulating this data structure needs to check
presence, number *and* type on every access. Consider this part of this
commit for example:

```
-    swh_deposit = metadata.get("swh:deposit")
-    if not swh_deposit:
-        return None
-
-    swh_reference = swh_deposit.get("swh:reference")
-    if not swh_reference:
-        return None
-
-    swh_origin = swh_reference.get("swh:origin")
-    if swh_origin:
-        url = swh_origin.get("@url")
-        if url:
-            return url
+    ref_origin = metadata.find(
+        "swh:deposit/swh:reference/swh:origin[@url]", namespaces=NAMESPACES
+    )
+    if ref_origin is not None:
+        return ref_origin.attrib["url"]
```

the use of XPath makes it considerably shorter; and the original version
did not even check number/type (ie. it would crash if an element was
duplicated).

55ae87b1

Feb 21, 2022

api.checks: Warn when suggested fields are missing from metadata · 339f7dd3

Antoine R. Dumont authored 3 years ago

This introduces a new check about the metadata provenance. While it's a suggested field,
it's definitely something that we want deposit clients to send us. So warn when it's not
the case. That does not reject the deposit but it's worth keeping that detail in the
backend.

Related to T3677

Verified

339f7dd3

Dec 21, 2020

Catch invalid dates before marking a deposit as verified. · 24ec1889

vlorentz authored 4 years ago

Otherwise, querying /1/private/<deposit_id>/meta/ will crash because it fails
to parse the date.

Resolves T2906.

24ec1889

Dec 10, 2020

Use string equality instead of substring search to check for mandatory fields. · c436adcf
vlorentz authored 4 years ago
```
eg. 'atom:authorblahblah' should not be accepted when we expect 'atom:author'
```
v0.7.1

c436adcf

Accept <codemeta:name> and <codemeta:author> as alternatives to... · 00795b41

vlorentz authored 4 years ago

Accept <codemeta:name> and <codemeta:author> as alternatives to <atom:name>/<atom:title> and <atom:author>.

This was broken by a8e86a92,
as the check_metadata() checks whether there is a tag that *contains*
the expected name, checking for 'author' (the old name of 'atom:author')
accidentally matched 'codemeta:author' as well.

This resulted in the right behavior in the majority of the cases
(accepted 'codemeta:author'), but for the wrong reason, and explicitly
renaming to atom:author broke this.

Ditto for name.

A future commit will remove the substring matching to remove
false positives (eg. 'atom:authorblahblah' should not be accepted
as 'atom:author')

00795b41

Nov 20, 2020

Explicitly use the atom: prefix internally. · a8e86a92

vlorentz authored 4 years ago

This mostly does not change the protocol used (except in the error messages),
it's just an internal change for the server and for the client.

The only change in the protocol is that local tags (eg. `<entry>...</entry>`)
are no longer assumed to be in the Atom namespace (like
`<entry xmlns="http://www.w3.org/2005/Atom">...</entry>`), but they
never should have been in the first place.

Default namespaces / unprefixed tags are a footgun because it's too
easy to add tags in the default namespace without noticing,
or use the wrong namespace.

a8e86a92

Sep 28, 2020
- deposit.api.checks: Add docstring to module · a2830a9b
  Antoine R. Dumont authored 4 years ago
  
  Verified
  
  a2830a9b
- Add functional metadata checks prior to updating them · 92f99745
  Antoine R. Dumont authored 4 years ago
  
  This refactor the common code executed by the checker so we functionally check everything the same way.
  Verified
  
  92f99745