Add first doc draft

ee85fda2 · Antoine Pietri · 37259bcf · ee85fda2 · ee85fda2 · ee85fda2
Commit ee85fda2 authored 5 years ago by Antoine Pietri
--- a/docs/_images/athena_tables.png
+++ b/docs/_images/athena_tables.png
--- a/docs/athena.rst
+++ b/docs/athena.rst
+Setup on Amazon Athena
+======================
+
+The Software Heritage Graph Dataset is available as a public dataset in `Amazon
+Athena <https://aws.amazon.com/athena/>`_. Athena uses `presto
+<https://prestodb.github.io/>`_, a distributed SQL query engine, to
+automatically scale queries on large datasets.
+
+The pricing of Athena depends on the amount of data scanned by each query,
+generally at a cost of $5 per TiB of data scanned. Full pricing details are
+available `here <https://aws.amazon.com/athena/pricing/>`_.
+
+Note that because the Software Heritage Graph Dataset is available as a public
+dataset, you **do not have to pay for the storage, only for the queries**
+(except for the data you store on S3 yourself, like query results).
+
+
+Loading the tables
+------------------
+
+.. highlight:: bash
+
+AWS account
+~~~~~~~~~~~
+
+In order to use Amazon Athena, you will first need to `create an AWS account
+and setup billing
+<https://aws.amazon.com/premiumsupport/knowledge-center/create-and-activate-aws-account/>`_.
+
+
+Setup
+~~~~~
+
+Athena needs to be made aware of the location and the schema of the Parquet
+files available as a public dataset. Unfortunately, since Athena does not
+support queries that contain multiple commands, it is not as simple as pasting
+an installation script in the console. Instead, we provide a Python script that
+can be run locally on your machine, that will communicate with Athena to create
+the tables automatically with the appropriate schema.
+
+To run this script, you will need to install a few dependencies on your
+machine:
+
+- For **Ubuntu** and **Debian**::
+
+    sudo apt install python3 python3-boto3 awscli
+
+- For **Archlinux**::
+
+    sudo pacman -S --needed python python-boto3 aws-cli
+
+Once the dependencies are installed, run::
+
+  aws configure
+
+This will ask for an AWS Access Key ID and an AWS Secret Access Key in
+order to give Python access to your AWS account. These keys can be generated at
+`this address
+<https://console.aws.amazon.com/iam/home#/security_credentials>`_.
+
+It will also ask for the region in which you want to run the queries. We
+recommand to use ``us-east-1``, since that's where the public dataset is
+located.
+
+Creating the tables
+~~~~~~~~~~~~~~~~~~~
+
+Download and run the Python script that will create the tables on your account:
+
+.. tabs::
+
+  .. group-tab:: full
+
+    ::
+
+      wget https://annex.softwareheritage.org/public/dataset/graph/latest/athena/tables.py
+      wget https://annex.softwareheritage.org/public/dataset/graph/latest/athena/gen_schema.py
+      ./gen_schema.py
+
+  .. group-tab:: popular-4k
+
+    This dataset is not available on Athena yet.
+
+  .. group-tab:: popular-3k-python
+
+    This dataset is not available on Athena yet.
+
+To check that the tables have been successfully created in your account, you
+can open your `Amazon Athena console
+<https://console.aws.amazon.com/athena/home>`_. You should be able to select
+the database corresponding to your dataset, and see the tables:
+
+.. image:: _images/athena_tables.png
+
+
+Running queries
+---------------
+
+.. highlight:: sql
+
+From the console, once you have selected the database of your dataset, you can
+run SQL queries directly from the Query Editor.
+
+Try for instance this query that computes the most frequent file names in the
+archive::
+
+  SELECT from_utf8(name, '?') AS name, COUNT(DISTINCT target) AS cnt
+  FROM directory_entry_file
+  GROUP BY name
+  ORDER BY cnt DESC
+  LIMIT 10;
+
+Other examples are available in the preprint of our article: `The Software
+Heritage Graph Dataset: Public software development under one roof.
+<https://upsilon.cc/~zack/research/publications/msr-2019-swh.pdf>`_
--- a/docs/datasets.rst
+++ b/docs/datasets.rst
+Datasets
+========
+
+We provide the full graph dataset along with two, smaller datasets that can be
+used for smaller-scale experiments.
+
+The main URLs of the datasets are relative to our dataset prefix:
+`https://annex.softwareheritage.org/public/dataset/ <https://annex.softwareheritage.org/public/dataset/>`__
+
+full
+----
+
+The ``full`` dataset contains the full Software Heritage Graph. It is available
+in the following formats:
+
+- **PostgreSQL (compressed)**:
+
+  - **URL**: `/graph/latest/sql/
+    <https://annex.softwareheritage.org/public/dataset/graph/latest/sql/>`_
+  - **Total size**: 1.2 TiB
+
+- **Apache Parquet**:
+
+  - **URL**: `/graph/latest/parquet/
+    <https://annex.softwareheritage.org/public/dataset/graph/latest/parquet/>`_
+  - **Total size**: 1.2 TiB
+
+popular-4k
+----------
+
+The ``popular-4k`` dataset contains a subset of 4000 popular
+repositories from GitHub, Gitlab, PyPI and Debian. The selection criteria to
+pick the software origins was the following:
+
+- The 1000 most popular GitHub projects (by number of stars)
+- The 1000 most popular Gitlab projects (by number of stars)
+- The 1000 most popular PyPI projects (by usage statistics, according to the
+  `Top PyPI Packages <https://hugovk.github.io/top-pypi-packages/>`_ database),
+- The 1000 most popular Debian packages (by "votes" according to the `Debian
+  Popularity Contest <https://popcon.debian.org/>`_ database)
+
+This dataset is available in the following formats:
+
+- **PostgreSQL (compressed)**:
+
+  - **URL**: `/graph/latest/popular-4k/sql/
+    <https://annex.softwareheritage.org/public/dataset/graph/latest/popular-4k/sql/>`_
+  - **Total size**: TODO
+
+- **Apache Parquet**:
+
+  - **URL**: `/graph/latest/popular-4k/parquet/
+    <https://annex.softwareheritage.org/public/dataset/graph/latest/popular-4k/parquet/>`_
+  - **Total size**: TODO
+
+popular-3k-python
+-----------------
+
+The ``popular-3k-python`` dataset contains a subset of 3052 popular
+repositories **tagged as being written in the Python language**, from GitHub,
+Gitlab, PyPI and Debian. The selection criteria to pick the software origins
+was the following, similar to ``popular-4k``:
+
+- the 1000 most popular GitHub projects written in Python (by number of stars),
+- the 131 Gitlab projects written in Python that have 2 stars or more,
+- the 1000 most popular PyPI projects (by usage statistics, according to the
+  `Top PyPI Packages <https://hugovk.github.io/top-pypi-packages/>`_ database),
+- the 1000 most popular Debian packages with the
+  `debtag <https://debtags.debian.org/>`_ ``implemented-in::python`` (by
+  "votes" according to the `Debian Popularity Contest
+  <https://popcon.debian.org/>`_ database).
+
+- **PostgreSQL (compressed)**:
+
+  - **URL**: `/graph/latest/popular-3k-python/sql/
+    <https://annex.softwareheritage.org/public/dataset/graph/latest/popular-3k-python/sql/>`_
+  - **Total size**: TODO
+
+- **Apache Parquet**:
+
+  - **URL**: `/graph/latest/popular-3k-python/sql/
+    <https://annex.softwareheritage.org/public/dataset/graph/latest/popular-3k-python/parquet/>`_
+  - **Total size**: TODO
--- a/docs/index.rst
+++ b/docs/index.rst
 .. _swh-py-template:

-Software Heritage - Python module template
-==========================================
+Software Heritage Graph Dataset
+===============================

-Python module template, used as skeleton to create new modules.
+This is the Software Heritage graph dataset: a fully-deduplicated Merkle
+DAG representation of the Software Heritage archive. The dataset links
+together file content identifiers, source code directories, Version
+Control System (VCS) commits tracking evolution over time, up to the
+full states of VCS repositories as observed by Software Heritage during
+periodic crawls. The dataset’s contents come from major development
+forges (including `GitHub <https://github.com/>`__ and
+`GitLab <https://gitlab.com>`__), FOSS distributions (e.g.,
+`Debian <debian.org>`__), and language-specific package managers (e.g.,
+`PyPI <https://pypi.org/>`__). Crawling information is also included,
+providing timestamps about when and where all archived source code
+artifacts have been observed in the wild.

+The Software Heritage graph dataset is available in multiple formats,
+including downloadable CSV dumps and Apache Parquet files for local use,
+as well as a public instance on Amazon Athena interactive query service
+for ready-to-use powerful analytical processing.
+
+By accessing the dataset, you agree with the Software Heritage `Ethical
+Charter for using the archive
+data <https://www.softwareheritage.org/legal/users-ethical-charter/>`__,
+and the `terms of use for bulk
+access <https://www.softwareheritage.org/legal/bulk-access-terms-of-use/>`__.
+
+
+If you use this dataset for research purposes, please cite the following paper:
+
+* 
+    | Antoine Pietri, Diomidis Spinellis, Stefano Zacchiroli.
+    | *The Software Heritage Graph Dataset: Public software development under one roof.*
+    | In proceedings of `MSR 2019 <http://2019.msrconf.org/>`_: The 16th International Conference on Mining Software Repositories, May 2019, Montreal, Canada. Co-located with `ICSE 2019 <https://2019.icse-conferences.org/>`_.
+    | `preprint <https://upsilon.cc/~zack/research/publications/msr-2019-swh.pdf>`_, `bibtex <https://upsilon.cc/~zack/research/publications/msr-2019-swh.bib>`_

 .. toctree::
   :maxdepth: 2
   :caption: Contents:

+   datasets.rst
+   postgresql.rst
+   athena.rst
+

 Indices and tables
 ==================

--- a/docs/postgresql.rst
+++ b/docs/postgresql.rst
+Setup on a PostgreSQL instance
+==============================
+
+This tutorial will guide you through the steps required to setup the Software
+Heritage Graph Dataset in a PostgreSQL database.
+
+.. highlight:: bash
+
+PostgreSQL local setup
+----------------------
+
+You need to have access to a running PostgreSQL instance to load the dataset.
+This section contains information on how to setup PostgreSQL for the first
+time.
+
+*If you already have a PostgreSQL server running on your machine, you can skip
+to the next section.*
+
+- For **Ubuntu** and **Debian**::
+
+  sudo apt install postgresql
+
+- For **Archlinux**::
+
+  sudo pacman -S --needed postgresql
+  sudo -u postgres initdb -D '/var/lib/postgres/data'
+  sudo systemctl enable --now postgresql
+
+Once PostgreSQL is running, you also need an user that will be able to create
+databases and run queries. The easiest way to achieve that is simply to create
+an account that has the same name as your username and that can create
+databases::
+
+    sudo -u postgres createuser --createdb $USER
+
+
+Retrieving the dataset
+----------------------
+
+You need to download the dataset in SQL format. Use the following command on
+your machine, after making sure that it has enough available space for the
+dataset you chose:
+
+.. tabs::
+
+  .. group-tab:: full
+
+    ::
+
+      mkdir full && cd full
+      wget -c -A gz,sql -nd -r -np -nH https://annex.softwareheritage.org/public/dataset/graph/2019-01-28/sql/
+
+  .. group-tab:: popular-4k
+
+    ::
+
+      mkdir full && cd full
+      wget -c -A gz,sql -nd -r -np -nH https://annex.softwareheritage.org/public/dataset/graph/2019-01-28/popular-4k/sql/
+
+  .. group-tab:: popular-3k-python
+
+    ::
+
+      mkdir full && cd full
+      wget -c -A gz,sql -nd -r -np -nH https://annex.softwareheritage.org/public/dataset/graph/2019-01-28/popular-3k-python/sql/
+
+Loading the dataset
+-------------------
+
+Once you have retrieved the dataset of your choice, create a database that will
+contain it, and load the database:
+
+.. tabs::
+
+  .. group-tab:: full
+
+    ::
+
+      createdb softwareheritage-full
+      psql softwareheritage-full < swh_import.sql
+
+  .. group-tab:: popular-4k
+
+    ::
+
+      createdb softwareheritage-popular-4k
+      psql softwareheritage-popular-4k < swh_import.sql
+
+  .. group-tab:: popular-3k-python
+
+    ::
+
+      createdb softwareheritage-popular-3k-python
+      psql softwareheritage-popular-3k-python < swh_import.sql
+
+
+You can now run SQL queries on your database. Run ``psql <database_name>`` to
+start an interactive PostgreSQL console.