From c5d3734746d7f59ffdb87e25db46dbbd44fe1b92 Mon Sep 17 00:00:00 2001
From: Valentin Lorentz <vlorentz@softwareheritage.org>
Date: Tue, 22 Nov 2022 12:02:35 +0100
Subject: [PATCH] Add a guide for scientists to get started with SWH data

---
 docs/user/index.rst            |   1 +
 docs/user/using_data/index.rst | 145 +++++++++++++++++++++++++++++++++
 user/index.rst                 |   1 +
 3 files changed, 147 insertions(+)
 create mode 100644 docs/user/using_data/index.rst

diff --git a/docs/user/index.rst b/docs/user/index.rst
index 4b58bcc7..6f304d79 100644
--- a/docs/user/index.rst
+++ b/docs/user/index.rst
@@ -11,6 +11,7 @@ Software Heritage - User Documentation
    listers/index
    loaders/index
    save_code_now/webhooks/index
+   using_data/index
 
 .. only:: user_doc
 
diff --git a/docs/user/using_data/index.rst b/docs/user/using_data/index.rst
new file mode 100644
index 00000000..88f1d85c
--- /dev/null
+++ b/docs/user/using_data/index.rst
@@ -0,0 +1,145 @@
+.. _using-swh-data:
+
+Using Software Heritage data
+============================
+
+This page documents the various ways Software Heritage provides programmatic
+access to data in the archive, and pointers to use them.
+
+First, please familiarize yourself with:
+
+* the :ref:`data model <data-model>`,
+* the `content policy`_,
+* your local data protection legislation, and
+* if relevant, your employer's/university's
+guidelines regarding research data.
+
+.. _content policy: https://www.softwareheritage.org/legal/content-policy/
+
+Data sources
+------------
+
+Software Heritage provides several ways to access the archive, with different
+tradeoffs suitable for different access patterns.
+
+REST API
+^^^^^^^^
+
+The `REST API`_ allows non-bulk read access to the whole archive,
+as well as requesting archival of specific repositories or forges,
+and downloading tarballs of individual repositories.
+
+It is available anonymously, but we recommend `authenticating
+<https://archive.softwareheritage.org/api/#authentication>`__ in order to
+benefit from higher rate limits, and request access to beta features.
+
+This API provides non-pseudonymized access to archive data; but some
+content may be taken down, or author names may be amended, according to
+the content policy.
+
+.. _REST API: https://archive.softwareheritage.org/api/
+
+Compressed graph
+^^^^^^^^^^^^^^^^
+
+:ref:`swh-graph <swh-graph>` provides three APIs to perform large traversal
+on the graph of the archive
+-- even in the opposite direction of the data model's DAG.
+
+It also has limited capabilities to read or filter on node/edge labels
+(ie. directory and file names, commit messages, ...) and does not
+include file content.
+
+For example, it allows getting a list of origins containing a specific
+file or directory.
+
+The APIs are:
+
+* an :ref:`HTTP RPC API <swh-graph-api>`, which is available at
+  https://archive.softwareheritage.org/api/1/graph/ on request.
+  `Contact us`_ and tell us about your use case, we are interested to know
+  what you plan to do with it
+* a :ref:`gRPC API <swh-graph-grpc-api>`, for language-agnostic access
+  to more advanced features
+* a :ref:`Java API <swh-graph-java-api>` for full access to its features.
+
+The latter two are currently not hosted publicly.
+However, you can run your own using the same data we have on your own computers,
+by download the "Compressed graph" files from the :ref:`swh-graph-dataset`.
+
+Beware that this is resource-intensive, as the full dataset takes about 150GB
+of disk and RAM for each of the two graphs (forward and backward edges);
+and swapping severely affects its performance.
+
+Producing this dataset is computationally intensive, and is not yet automated;
+so it is currently published only once a year.
+
+Author/committer name and email are pseudonymized.
+
+.. _contact us: https://www.softwareheritage.org/community/scientists/
+
+Dataset export
+^^^^^^^^^^^^^^
+
+The :ref:`swh-graph-dataset` also includes a raw export of all of
+the archive's database tables (as ORC files) and graph structure (as compressed CSV).
+It does not include file content.
+
+The ORC dataset takes about 11TB on disk.
+
+Producing this dataset is not yet automated; so it is currently published
+only once a year.
+
+Author/committer name and email are pseudonymized.
+
+Contents on S3
+^^^^^^^^^^^^^^
+
+Finally, to complement the compressed graph and dataset export, we provide
+public access to file content via a S3 bucket, accessible at
+``s3://softwareheritage/content/<sha1>`` and
+``https://softwareheritage.s3.amazonaws.com/content/<sha1>``
+where ``<sha1>`` is the hexadecimal representation of the content's
+``sha1`` hash (not to be confused with ``sha1_git`` hash used in some places
+in the datasets and in SWHID).
+
+
+Possible bias
+-------------
+
+Statistical analyses on the archive may be biased by the way source code is
+collected by the archive. This section details the main ones to be aware of
+when performing research on the archive.
+
+
+Code and configuration changes
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+Software Heritage's codebase evolves over time, and the archive adds support
+for new forges regularly.
+Major changes are documented in the `archive changelog`_
+
+Typically, this means that source code deleted from a given forge before
+Software Heritage started archiving that forge is missing
+-- which may lead to code hosted in less popular places to be underrepresented
+in the archive.
+
+.. _archive changelog: archive-changelog
+
+Large objects
+^^^^^^^^^^^^^
+
+Some source code repositories, such as Chromium's and Linux's git repositories
+and their clones, are particularly large.
+This is a challenge for loaders, which may fail to load them at a higher frequency
+than smaller repositories.
+
+Software Heritage also does not archive any object larger than 300MB, as they
+are unlikely to be source code, and would put unreasonable load on the archive.
+
+Non-code objects
+^^^^^^^^^^^^^^^^
+
+Software Heritage collects data indiscriminately from code hosting places.
+Sometimes, this includes repositories used to host non-code content and/or
+binary code.
diff --git a/user/index.rst b/user/index.rst
index 7bc3bea4..49352f68 100644
--- a/user/index.rst
+++ b/user/index.rst
@@ -9,6 +9,7 @@ Getting started
 * :ref:`listers`
 * :ref:`loaders`
 * :ref:`swh_scn_webhooks`
+* :ref:`using-swh-data`
 
 
 Indices and tables
-- 
GitLab