From 8a44a63f8f1bb95d2589e0d1c37318ee3edcf249 Mon Sep 17 00:00:00 2001
From: Antoine Pietri <antoine.pietri1@gmail.com>
Date: Thu, 28 Apr 2022 16:19:46 +0200
Subject: [PATCH] docs: document graph dataset export

---
 docs/export.rst | 56 +++++++++++++++++++++++++++++++++++++++++++++++++
 docs/index.rst  |  1 +
 2 files changed, 57 insertions(+)
 create mode 100644 docs/export.rst

diff --git a/docs/export.rst b/docs/export.rst
new file mode 100644
index 0000000..56a17d3
--- /dev/null
+++ b/docs/export.rst
@@ -0,0 +1,56 @@
+===================
+Exporting a dataset
+===================
+
+This repository aims to contain various pipelines to generate datasets of
+Software Heritage data, so that they can be used internally or by external
+researchers.
+
+Graph dataset
+=============
+
+Right now, the only supported export pipeline is the *Graph Dataset*, a set of
+relational tables representing the Software Heritage Graph, as documented in
+:ref:`swh-graph-dataset`. It can be run using the ``swh dataset graph export``
+command.
+
+This dataset can be exported in two different formats: ``orc`` and ``edges``.
+To export a graph, you need to provide a comma-separated list of formats to
+export with the ``--formats`` option. You also need an export ID, a unique
+identifier used by the Kafka server to store the current progress of the
+export.
+
+Here is an example command to start a graph dataset export::
+
+    swh dataset -C graph_export_config.yml graph export \
+        --formats orc \
+        --export-id seirl-2022-04-25 \
+        -p 64 \
+        /srv/softwareheritage/hdd/graph/2022-04-25
+
+This command usually takes more than a week for a full export, it is
+therefore advised to run it in a service or a tmux session.
+
+The configuration file should contain the configuration for the swh-journal
+clients, as well as various configuration options for the exporters. Here is an
+example configuration file::
+
+    journal:
+        brokers:
+            - kafka1.internal.softwareheritage.org:9094
+            - kafka2.internal.softwareheritage.org:9094
+            - kafka3.internal.softwareheritage.org:9094
+            - kafka4.internal.softwareheritage.org:9094
+        security.protocol: SASL_SSL
+        sasl.mechanisms: SCRAM-SHA-512
+        max.poll.interval.ms: 1000000
+
+    remove_pull_requests: true
+
+
+The following configuration options can be used for the export:
+
+- ``remove_pull_requests``: remove all edges from origin to snapshot matching
+  ``refs/*`` but not matching ``refs/heads/*`` or ``refs/tags/*``. This removes
+  all the pull requests that are present in Software Heritage (archived with
+  ``git clone --mirror``).
diff --git a/docs/index.rst b/docs/index.rst
index 5703747..dbc4026 100644
--- a/docs/index.rst
+++ b/docs/index.rst
@@ -12,3 +12,4 @@
    :titlesonly:
 
    graph/index
+   export
-- 
GitLab