From c5d3734746d7f59ffdb87e25db46dbbd44fe1b92 Mon Sep 17 00:00:00 2001 From: Valentin Lorentz <vlorentz@softwareheritage.org> Date: Tue, 22 Nov 2022 12:02:35 +0100 Subject: [PATCH] Add a guide for scientists to get started with SWH data --- docs/user/index.rst | 1 + docs/user/using_data/index.rst | 145 +++++++++++++++++++++++++++++++++ user/index.rst | 1 + 3 files changed, 147 insertions(+) create mode 100644 docs/user/using_data/index.rst diff --git a/docs/user/index.rst b/docs/user/index.rst index 4b58bcc7..6f304d79 100644 --- a/docs/user/index.rst +++ b/docs/user/index.rst @@ -11,6 +11,7 @@ Software Heritage - User Documentation listers/index loaders/index save_code_now/webhooks/index + using_data/index .. only:: user_doc diff --git a/docs/user/using_data/index.rst b/docs/user/using_data/index.rst new file mode 100644 index 00000000..88f1d85c --- /dev/null +++ b/docs/user/using_data/index.rst @@ -0,0 +1,145 @@ +.. _using-swh-data: + +Using Software Heritage data +============================ + +This page documents the various ways Software Heritage provides programmatic +access to data in the archive, and pointers to use them. + +First, please familiarize yourself with: + +* the :ref:`data model <data-model>`, +* the `content policy`_, +* your local data protection legislation, and +* if relevant, your employer's/university's +guidelines regarding research data. + +.. _content policy: https://www.softwareheritage.org/legal/content-policy/ + +Data sources +------------ + +Software Heritage provides several ways to access the archive, with different +tradeoffs suitable for different access patterns. + +REST API +^^^^^^^^ + +The `REST API`_ allows non-bulk read access to the whole archive, +as well as requesting archival of specific repositories or forges, +and downloading tarballs of individual repositories. + +It is available anonymously, but we recommend `authenticating +<https://archive.softwareheritage.org/api/#authentication>`__ in order to +benefit from higher rate limits, and request access to beta features. + +This API provides non-pseudonymized access to archive data; but some +content may be taken down, or author names may be amended, according to +the content policy. + +.. _REST API: https://archive.softwareheritage.org/api/ + +Compressed graph +^^^^^^^^^^^^^^^^ + +:ref:`swh-graph <swh-graph>` provides three APIs to perform large traversal +on the graph of the archive +-- even in the opposite direction of the data model's DAG. + +It also has limited capabilities to read or filter on node/edge labels +(ie. directory and file names, commit messages, ...) and does not +include file content. + +For example, it allows getting a list of origins containing a specific +file or directory. + +The APIs are: + +* an :ref:`HTTP RPC API <swh-graph-api>`, which is available at + https://archive.softwareheritage.org/api/1/graph/ on request. + `Contact us`_ and tell us about your use case, we are interested to know + what you plan to do with it +* a :ref:`gRPC API <swh-graph-grpc-api>`, for language-agnostic access + to more advanced features +* a :ref:`Java API <swh-graph-java-api>` for full access to its features. + +The latter two are currently not hosted publicly. +However, you can run your own using the same data we have on your own computers, +by download the "Compressed graph" files from the :ref:`swh-graph-dataset`. + +Beware that this is resource-intensive, as the full dataset takes about 150GB +of disk and RAM for each of the two graphs (forward and backward edges); +and swapping severely affects its performance. + +Producing this dataset is computationally intensive, and is not yet automated; +so it is currently published only once a year. + +Author/committer name and email are pseudonymized. + +.. _contact us: https://www.softwareheritage.org/community/scientists/ + +Dataset export +^^^^^^^^^^^^^^ + +The :ref:`swh-graph-dataset` also includes a raw export of all of +the archive's database tables (as ORC files) and graph structure (as compressed CSV). +It does not include file content. + +The ORC dataset takes about 11TB on disk. + +Producing this dataset is not yet automated; so it is currently published +only once a year. + +Author/committer name and email are pseudonymized. + +Contents on S3 +^^^^^^^^^^^^^^ + +Finally, to complement the compressed graph and dataset export, we provide +public access to file content via a S3 bucket, accessible at +``s3://softwareheritage/content/<sha1>`` and +``https://softwareheritage.s3.amazonaws.com/content/<sha1>`` +where ``<sha1>`` is the hexadecimal representation of the content's +``sha1`` hash (not to be confused with ``sha1_git`` hash used in some places +in the datasets and in SWHID). + + +Possible bias +------------- + +Statistical analyses on the archive may be biased by the way source code is +collected by the archive. This section details the main ones to be aware of +when performing research on the archive. + + +Code and configuration changes +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +Software Heritage's codebase evolves over time, and the archive adds support +for new forges regularly. +Major changes are documented in the `archive changelog`_ + +Typically, this means that source code deleted from a given forge before +Software Heritage started archiving that forge is missing +-- which may lead to code hosted in less popular places to be underrepresented +in the archive. + +.. _archive changelog: archive-changelog + +Large objects +^^^^^^^^^^^^^ + +Some source code repositories, such as Chromium's and Linux's git repositories +and their clones, are particularly large. +This is a challenge for loaders, which may fail to load them at a higher frequency +than smaller repositories. + +Software Heritage also does not archive any object larger than 300MB, as they +are unlikely to be source code, and would put unreasonable load on the archive. + +Non-code objects +^^^^^^^^^^^^^^^^ + +Software Heritage collects data indiscriminately from code hosting places. +Sometimes, this includes repositories used to host non-code content and/or +binary code. diff --git a/user/index.rst b/user/index.rst index 7bc3bea4..49352f68 100644 --- a/user/index.rst +++ b/user/index.rst @@ -9,6 +9,7 @@ Getting started * :ref:`listers` * :ref:`loaders` * :ref:`swh_scn_webhooks` +* :ref:`using-swh-data` Indices and tables -- GitLab