docs: Document how to export subdatasets and document/publish datasets

Reviewers: #reviewers, olasd Reviewed By: #reviewers, olasd Subscribers: olasd Differential Revision: https://forge.softwareheritage.org/D7711

docs: Document how to export subdatasets and document/publish datasets
Reviewers: #reviewers, olasd Reviewed By: #reviewers, olasd Subscribers: olasd Differential Revision: https://forge.softwareheritage.org/D7711
db331f27 · Antoine Pietri · 8a44a63f · db331f27 · db331f27 · db331f27
Commit db331f27 authored 2 years ago by Antoine Pietri
--- a/docs/export.rst
+++ b/docs/export.rst
+.. _swh-graph-export:
+
 ===================
 Exporting a dataset
 ===================
@@ -9,6 +11,9 @@ researchers.
 Graph dataset
 =============

+Exporting the full dataset
+--------------------------
+
 Right now, the only supported export pipeline is the *Graph Dataset*, a set of
 relational tables representing the Software Heritage Graph, as documented in
 :ref:`swh-graph-dataset`. It can be run using the ``swh dataset graph export``
@@ -20,11 +25,14 @@ export with the ``--formats`` option. You also need an export ID, a unique
 identifier used by the Kafka server to store the current progress of the
 export.

+**Note**: exporting as the ``edges`` format is discouraged, as it is redundant
+and can easily be generated directly from the ORC format.
+
 Here is an example command to start a graph dataset export::

    swh dataset -C graph_export_config.yml graph export \
        --formats orc \
-        --export-id seirl-2022-04-25 \
+        --export-id 2022-04-25 \
        -p 64 \
        /srv/softwareheritage/hdd/graph/2022-04-25

@@ -54,3 +62,37 @@ The following configuration options can be used for the export:
  ``refs/*`` but not matching ``refs/heads/*`` or ``refs/tags/*``. This removes
  all the pull requests that are present in Software Heritage (archived with
  ``git clone --mirror``).
+
+
+Uploading on S3 & on the annex
+------------------------------
+
+The dataset should then be made available publicly by uploading it on S3 and on
+the public annex.
+
+For S3::
+
+    aws s3 cp --recursive /srv/softwareheritage/hdd/graph/2022-04-25/orc s3://softwareheritage/graph/2022-04-25/orc
+
+For the annex::
+
+    scp -r 2022-04-25/orc saam.internal.softwareheritage.org:/srv/softwareheritage/annex/public/dataset/graph/2022-04-25/
+    ssh saam.internal.softwareheritage.org
+    cd /srv/softwareheritage/annex/public/dataset/graph
+    git annex add 2022-04-25
+    git annex sync --content
+
+
+Documenting the new dataset
+---------------------------
+
+In the ``swh-dataset`` repository, edit the the file ``docs/graph/dataset.rst``
+to document the availability of the new dataset. You should usually mention:
+
+- the name of the dataset version (e.g., 2022-04-25)
+- the number of nodes
+- the number of edges
+- the available formats (notably whether the graph is also available in its
+  compressed representation).
+- the total on-disk size of the dataset
+- the buckets/URIs to obtain the graph from S3 and from the annex
--- a/docs/generate_subdataset.rst
+++ b/docs/generate_subdataset.rst
+.. _swh-graph-export-subdataset:
+
+======================
+Exporting a subdataset
+======================
+
+.. highlight:: bash
+
+Because the entire graph is often too big to be practical for many research use
+cases, notably for prototyping, it is generally useful to publish "subdatasets"
+which only contain a subset of the entire graph.
+An example of a very useful subdataset is the graph containing only the top
+1000 most popular GitHub repositories (sorted by number of stars).
+
+This page details the various steps required to export a graph subdataset using
+swh-graph and Amazon Athena.
+
+
+Step 1. Obtain the list of origins
+----------------------------------
+
+You first need to obtain a list of origins that you want to include in the
+subdataset. Depending on the type of subdataset you want to create, this can be
+done in various ways, either manual or automated. The following is an example
+of how to get the list of the 1000 most popular GitHub repositories in the
+Python language, sorted by number of stars::
+
+    for i in $( seq 1 10 ); do \
+        curl -G https://api.github.com/search/repositories \
+            -d "page=$i" \
+            -d "s=stars" -d "order=desc" -d "q=language:python" -d 'per_page=100' | \
+            jq --raw-output '.items[].html_url'; \
+        sleep 6; \
+    done > origins.txt
+
+
+Step 2. Build the list of SWHIDs
+--------------------------------
+
+To generate a subdataset from an existing dataset, you need to generate the
+list of all the SWHIDs to include in the subdataset. The best way to achieve
+that is to use the compressed graph to perform a full visit of the compressed
+graph starting from the origin nodes, and to return the list of all the SWHIDs
+that are reachable from these origins.
+
+Unfortunately, there is currently no endpoint in the HTTP API to start a
+traversal from multiple nodes. The current best way to achieve this is
+therefore to visit the graph starting from each origin, one by one, and then to
+merge all the resulting lists of SWHIDs into a single sorted list of unique
+SWHIDs.
+
+If you use the internal graph API, you might need to convert the origin URLs in
+the Extended SWHID format (``swh:ori:1:<sha1(url)>``) to query the API.
+
+
+Step 3. Generate the subdataset on Athena
+-----------------------------------------
+
+Once you have obtained a text file containing all the SWHIDs to be included in
+the new dataset, it is possible to use AWS Athena to JOIN this list of SWHIDs
+with the tables of an existing dataset, and write the output as a new ORC
+dataset.
+
+First, make sure that your base dataset containing the entire graph is
+available as a database on AWS Athena, which can be set up by
+following the steps described in :ref:`swh-graph-athena`.
+
+The subdataset can then be generated with the ``aws dataset athena
+gensubdataset`` command::
+
+    swh dataset athena gensubdataset \
+        --swhids swhids.csv \
+        --database swh_20210323
+        --subdataset-database swh_20210323_popular3kpython \
+        --subdataset-location s3://softwareheritage/graph/2021-03-23-popular-3k-python/
+
+
+Step 4. Upload and document the newly generated subdataset
+----------------------------------------------------------
+
+After having executed the previous step, there should now be a new dataset
+located at the S3 path given as the parameter to ``--subdataset-location``.
+You can upload, publish and document this new subdataset by following the
+procedure described in :ref:`swh-graph-export`.
--- a/docs/graph/athena.rst
+++ b/docs/graph/athena.rst
+.. _swh-graph-athena:
+
 Setup on Amazon Athena
 ======================


--- a/docs/index.rst
+++ b/docs/index.rst
@@ -13,3 +13,4 @@

   graph/index
   export
+   generate_subdataset