Skip to content
Snippets Groups Projects
Commit db331f27 authored by Antoine Pietri's avatar Antoine Pietri
Browse files

docs: Document how to export subdatasets and document/publish datasets

Reviewers: #reviewers, olasd

Reviewed By: #reviewers, olasd

Subscribers: olasd

Differential Revision: https://forge.softwareheritage.org/D7711
parent 8a44a63f
No related branches found
No related tags found
No related merge requests found
.. _swh-graph-export:
===================
Exporting a dataset
===================
......@@ -9,6 +11,9 @@ researchers.
Graph dataset
=============
Exporting the full dataset
--------------------------
Right now, the only supported export pipeline is the *Graph Dataset*, a set of
relational tables representing the Software Heritage Graph, as documented in
:ref:`swh-graph-dataset`. It can be run using the ``swh dataset graph export``
......@@ -20,11 +25,14 @@ export with the ``--formats`` option. You also need an export ID, a unique
identifier used by the Kafka server to store the current progress of the
export.
**Note**: exporting as the ``edges`` format is discouraged, as it is redundant
and can easily be generated directly from the ORC format.
Here is an example command to start a graph dataset export::
swh dataset -C graph_export_config.yml graph export \
--formats orc \
--export-id seirl-2022-04-25 \
--export-id 2022-04-25 \
-p 64 \
/srv/softwareheritage/hdd/graph/2022-04-25
......@@ -54,3 +62,37 @@ The following configuration options can be used for the export:
``refs/*`` but not matching ``refs/heads/*`` or ``refs/tags/*``. This removes
all the pull requests that are present in Software Heritage (archived with
``git clone --mirror``).
Uploading on S3 & on the annex
------------------------------
The dataset should then be made available publicly by uploading it on S3 and on
the public annex.
For S3::
aws s3 cp --recursive /srv/softwareheritage/hdd/graph/2022-04-25/orc s3://softwareheritage/graph/2022-04-25/orc
For the annex::
scp -r 2022-04-25/orc saam.internal.softwareheritage.org:/srv/softwareheritage/annex/public/dataset/graph/2022-04-25/
ssh saam.internal.softwareheritage.org
cd /srv/softwareheritage/annex/public/dataset/graph
git annex add 2022-04-25
git annex sync --content
Documenting the new dataset
---------------------------
In the ``swh-dataset`` repository, edit the the file ``docs/graph/dataset.rst``
to document the availability of the new dataset. You should usually mention:
- the name of the dataset version (e.g., 2022-04-25)
- the number of nodes
- the number of edges
- the available formats (notably whether the graph is also available in its
compressed representation).
- the total on-disk size of the dataset
- the buckets/URIs to obtain the graph from S3 and from the annex
.. _swh-graph-export-subdataset:
======================
Exporting a subdataset
======================
.. highlight:: bash
Because the entire graph is often too big to be practical for many research use
cases, notably for prototyping, it is generally useful to publish "subdatasets"
which only contain a subset of the entire graph.
An example of a very useful subdataset is the graph containing only the top
1000 most popular GitHub repositories (sorted by number of stars).
This page details the various steps required to export a graph subdataset using
swh-graph and Amazon Athena.
Step 1. Obtain the list of origins
----------------------------------
You first need to obtain a list of origins that you want to include in the
subdataset. Depending on the type of subdataset you want to create, this can be
done in various ways, either manual or automated. The following is an example
of how to get the list of the 1000 most popular GitHub repositories in the
Python language, sorted by number of stars::
for i in $( seq 1 10 ); do \
curl -G https://api.github.com/search/repositories \
-d "page=$i" \
-d "s=stars" -d "order=desc" -d "q=language:python" -d 'per_page=100' | \
jq --raw-output '.items[].html_url'; \
sleep 6; \
done > origins.txt
Step 2. Build the list of SWHIDs
--------------------------------
To generate a subdataset from an existing dataset, you need to generate the
list of all the SWHIDs to include in the subdataset. The best way to achieve
that is to use the compressed graph to perform a full visit of the compressed
graph starting from the origin nodes, and to return the list of all the SWHIDs
that are reachable from these origins.
Unfortunately, there is currently no endpoint in the HTTP API to start a
traversal from multiple nodes. The current best way to achieve this is
therefore to visit the graph starting from each origin, one by one, and then to
merge all the resulting lists of SWHIDs into a single sorted list of unique
SWHIDs.
If you use the internal graph API, you might need to convert the origin URLs in
the Extended SWHID format (``swh:ori:1:<sha1(url)>``) to query the API.
Step 3. Generate the subdataset on Athena
-----------------------------------------
Once you have obtained a text file containing all the SWHIDs to be included in
the new dataset, it is possible to use AWS Athena to JOIN this list of SWHIDs
with the tables of an existing dataset, and write the output as a new ORC
dataset.
First, make sure that your base dataset containing the entire graph is
available as a database on AWS Athena, which can be set up by
following the steps described in :ref:`swh-graph-athena`.
The subdataset can then be generated with the ``aws dataset athena
gensubdataset`` command::
swh dataset athena gensubdataset \
--swhids swhids.csv \
--database swh_20210323
--subdataset-database swh_20210323_popular3kpython \
--subdataset-location s3://softwareheritage/graph/2021-03-23-popular-3k-python/
Step 4. Upload and document the newly generated subdataset
----------------------------------------------------------
After having executed the previous step, there should now be a new dataset
located at the S3 path given as the parameter to ``--subdataset-location``.
You can upload, publish and document this new subdataset by following the
procedure described in :ref:`swh-graph-export`.
.. _swh-graph-athena:
Setup on Amazon Athena
======================
......
......@@ -13,3 +13,4 @@
graph/index
export
generate_subdataset
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment