Skip to content

server: Use xml.etree.ElementTree instead of nested dicts internally

This commit does not touch the external API though; ie. metadata_dict is still present in the JSON API, and the equivalent jsonb field remains in the database. They will probably be removed in a future commit because they are not very useful, though.

Rationale:

I find xmltodict's approach of translating XML tree to native structures to be intrinsically flawed for non-trivial handling of XML, because the data structure is:

  • implementation-defined (by xmltodict, which is python-only) and it may change across versions
  • does not intrinsically store namespaces, and relies on an internal prefix map (though it isn't much of an issue right now, as we do not need composability and all the changed APIs are private)
  • not stable; for example, <a><b>foo</b></a> and <a><b>foo</b><b>bar</b></a> are encoded completely differently (the former is a Dict[str, str], the latter is Dict[str, list].

And every operation manipulating this data structure needs to check presence, number and type on every access. Consider this part of this commit for example:

-    swh_deposit = metadata.get("swh:deposit")
-    if not swh_deposit:
-        return None
-
-    swh_reference = swh_deposit.get("swh:reference")
-    if not swh_reference:
-        return None
-
-    swh_origin = swh_reference.get("swh:origin")
-    if swh_origin:
-        url = swh_origin.get("@url")
-        if url:
-            return url
+    ref_origin = metadata.find(
+        "swh:deposit/swh:reference/swh:origin[@url]", namespaces=NAMESPACES
+    )
+    if ref_origin is not None:
+        return ref_origin.attrib["url"]

the use of XPath makes it considerably shorter; and the original version did not even check number/type (ie. it would crash if an element was duplicated).


Migrated from D7215 (view on Phabricator)

Merge request reports