support graph export for the cassandra backend

We want support for unsupervised export from the archive content (as captured by swh-storage) to its graph structure (as required as input by swh-graph). Currently this is not easily doable with the postgres backend (due to the huge join imposed by the directory entry layer), but it should be doable with the Cassandra backend.

The required export output is a pair of compressed textual files:

swh.nodes.csv.zst: one Merkle DAG node per line, represented as a SWH PID + one origin per line (as SWH PIDs too, using the "ori" qualifier); this file should be alphabetically sorted
swh.edges.csv.zst: a pair of Merkle DAG nodes (or origins) per line, represented as SWH PIDs and separated by a space. First element of the pair is the edge "from" node, second is the edge "to" node.
bonus point: also export swh.{nodes,edges}.count files, containing the total count of nodes/edges respectively

For Merkle DAG nodes the edges match the Merkle structure; for origin nodes outgoing edges point to the known snapshots of a given origin.

Examples

the most recent nodes/edges export can be found here: https://annex.softwareheritage.org/public/dataset/graph/latest/edges/ (files all.{nodes,edges}.{csv.gz,count}
SQL queries to export from Postgres to the above format can be found in snippets (warning: they do not work on the full Postgres DB, so don't try that; also: they are incomplete and do not export some edges, as noted down in comments in the SQL)

Migrated from T2053 (view on Phabricator)

Edited Jan 08, 2023 by Phabricator Migration user

To upload designs, you'll need to enable LFS and have an admin enable hashed storage. More information