Skip to content
Snippets Groups Projects
Commit ee85fda2 authored by Antoine Pietri's avatar Antoine Pietri
Browse files

Add first doc draft

parent 37259bcf
No related branches found
No related tags found
No related merge requests found
docs/_images/athena_tables.png

53.4 KiB

Setup on Amazon Athena
======================
The Software Heritage Graph Dataset is available as a public dataset in `Amazon
Athena <https://aws.amazon.com/athena/>`_. Athena uses `presto
<https://prestodb.github.io/>`_, a distributed SQL query engine, to
automatically scale queries on large datasets.
The pricing of Athena depends on the amount of data scanned by each query,
generally at a cost of $5 per TiB of data scanned. Full pricing details are
available `here <https://aws.amazon.com/athena/pricing/>`_.
Note that because the Software Heritage Graph Dataset is available as a public
dataset, you **do not have to pay for the storage, only for the queries**
(except for the data you store on S3 yourself, like query results).
Loading the tables
------------------
.. highlight:: bash
AWS account
~~~~~~~~~~~
In order to use Amazon Athena, you will first need to `create an AWS account
and setup billing
<https://aws.amazon.com/premiumsupport/knowledge-center/create-and-activate-aws-account/>`_.
Setup
~~~~~
Athena needs to be made aware of the location and the schema of the Parquet
files available as a public dataset. Unfortunately, since Athena does not
support queries that contain multiple commands, it is not as simple as pasting
an installation script in the console. Instead, we provide a Python script that
can be run locally on your machine, that will communicate with Athena to create
the tables automatically with the appropriate schema.
To run this script, you will need to install a few dependencies on your
machine:
- For **Ubuntu** and **Debian**::
sudo apt install python3 python3-boto3 awscli
- For **Archlinux**::
sudo pacman -S --needed python python-boto3 aws-cli
Once the dependencies are installed, run::
aws configure
This will ask for an AWS Access Key ID and an AWS Secret Access Key in
order to give Python access to your AWS account. These keys can be generated at
`this address
<https://console.aws.amazon.com/iam/home#/security_credentials>`_.
It will also ask for the region in which you want to run the queries. We
recommand to use ``us-east-1``, since that's where the public dataset is
located.
Creating the tables
~~~~~~~~~~~~~~~~~~~
Download and run the Python script that will create the tables on your account:
.. tabs::
.. group-tab:: full
::
wget https://annex.softwareheritage.org/public/dataset/graph/latest/athena/tables.py
wget https://annex.softwareheritage.org/public/dataset/graph/latest/athena/gen_schema.py
./gen_schema.py
.. group-tab:: popular-4k
This dataset is not available on Athena yet.
.. group-tab:: popular-3k-python
This dataset is not available on Athena yet.
To check that the tables have been successfully created in your account, you
can open your `Amazon Athena console
<https://console.aws.amazon.com/athena/home>`_. You should be able to select
the database corresponding to your dataset, and see the tables:
.. image:: _images/athena_tables.png
Running queries
---------------
.. highlight:: sql
From the console, once you have selected the database of your dataset, you can
run SQL queries directly from the Query Editor.
Try for instance this query that computes the most frequent file names in the
archive::
SELECT from_utf8(name, '?') AS name, COUNT(DISTINCT target) AS cnt
FROM directory_entry_file
GROUP BY name
ORDER BY cnt DESC
LIMIT 10;
Other examples are available in the preprint of our article: `The Software
Heritage Graph Dataset: Public software development under one roof.
<https://upsilon.cc/~zack/research/publications/msr-2019-swh.pdf>`_
Datasets
========
We provide the full graph dataset along with two, smaller datasets that can be
used for smaller-scale experiments.
The main URLs of the datasets are relative to our dataset prefix:
`https://annex.softwareheritage.org/public/dataset/ <https://annex.softwareheritage.org/public/dataset/>`__
full
----
The ``full`` dataset contains the full Software Heritage Graph. It is available
in the following formats:
- **PostgreSQL (compressed)**:
- **URL**: `/graph/latest/sql/
<https://annex.softwareheritage.org/public/dataset/graph/latest/sql/>`_
- **Total size**: 1.2 TiB
- **Apache Parquet**:
- **URL**: `/graph/latest/parquet/
<https://annex.softwareheritage.org/public/dataset/graph/latest/parquet/>`_
- **Total size**: 1.2 TiB
popular-4k
----------
The ``popular-4k`` dataset contains a subset of 4000 popular
repositories from GitHub, Gitlab, PyPI and Debian. The selection criteria to
pick the software origins was the following:
- The 1000 most popular GitHub projects (by number of stars)
- The 1000 most popular Gitlab projects (by number of stars)
- The 1000 most popular PyPI projects (by usage statistics, according to the
`Top PyPI Packages <https://hugovk.github.io/top-pypi-packages/>`_ database),
- The 1000 most popular Debian packages (by "votes" according to the `Debian
Popularity Contest <https://popcon.debian.org/>`_ database)
This dataset is available in the following formats:
- **PostgreSQL (compressed)**:
- **URL**: `/graph/latest/popular-4k/sql/
<https://annex.softwareheritage.org/public/dataset/graph/latest/popular-4k/sql/>`_
- **Total size**: TODO
- **Apache Parquet**:
- **URL**: `/graph/latest/popular-4k/parquet/
<https://annex.softwareheritage.org/public/dataset/graph/latest/popular-4k/parquet/>`_
- **Total size**: TODO
popular-3k-python
-----------------
The ``popular-3k-python`` dataset contains a subset of 3052 popular
repositories **tagged as being written in the Python language**, from GitHub,
Gitlab, PyPI and Debian. The selection criteria to pick the software origins
was the following, similar to ``popular-4k``:
- the 1000 most popular GitHub projects written in Python (by number of stars),
- the 131 Gitlab projects written in Python that have 2 stars or more,
- the 1000 most popular PyPI projects (by usage statistics, according to the
`Top PyPI Packages <https://hugovk.github.io/top-pypi-packages/>`_ database),
- the 1000 most popular Debian packages with the
`debtag <https://debtags.debian.org/>`_ ``implemented-in::python`` (by
"votes" according to the `Debian Popularity Contest
<https://popcon.debian.org/>`_ database).
- **PostgreSQL (compressed)**:
- **URL**: `/graph/latest/popular-3k-python/sql/
<https://annex.softwareheritage.org/public/dataset/graph/latest/popular-3k-python/sql/>`_
- **Total size**: TODO
- **Apache Parquet**:
- **URL**: `/graph/latest/popular-3k-python/sql/
<https://annex.softwareheritage.org/public/dataset/graph/latest/popular-3k-python/parquet/>`_
- **Total size**: TODO
.. _swh-py-template:
Software Heritage - Python module template
==========================================
Software Heritage Graph Dataset
===============================
Python module template, used as skeleton to create new modules.
This is the Software Heritage graph dataset: a fully-deduplicated Merkle
DAG representation of the Software Heritage archive. The dataset links
together file content identifiers, source code directories, Version
Control System (VCS) commits tracking evolution over time, up to the
full states of VCS repositories as observed by Software Heritage during
periodic crawls. The dataset’s contents come from major development
forges (including `GitHub <https://github.com/>`__ and
`GitLab <https://gitlab.com>`__), FOSS distributions (e.g.,
`Debian <debian.org>`__), and language-specific package managers (e.g.,
`PyPI <https://pypi.org/>`__). Crawling information is also included,
providing timestamps about when and where all archived source code
artifacts have been observed in the wild.
The Software Heritage graph dataset is available in multiple formats,
including downloadable CSV dumps and Apache Parquet files for local use,
as well as a public instance on Amazon Athena interactive query service
for ready-to-use powerful analytical processing.
By accessing the dataset, you agree with the Software Heritage `Ethical
Charter for using the archive
data <https://www.softwareheritage.org/legal/users-ethical-charter/>`__,
and the `terms of use for bulk
access <https://www.softwareheritage.org/legal/bulk-access-terms-of-use/>`__.
If you use this dataset for research purposes, please cite the following paper:
*
| Antoine Pietri, Diomidis Spinellis, Stefano Zacchiroli.
| *The Software Heritage Graph Dataset: Public software development under one roof.*
| In proceedings of `MSR 2019 <http://2019.msrconf.org/>`_: The 16th International Conference on Mining Software Repositories, May 2019, Montreal, Canada. Co-located with `ICSE 2019 <https://2019.icse-conferences.org/>`_.
| `preprint <https://upsilon.cc/~zack/research/publications/msr-2019-swh.pdf>`_, `bibtex <https://upsilon.cc/~zack/research/publications/msr-2019-swh.bib>`_
.. toctree::
:maxdepth: 2
:caption: Contents:
datasets.rst
postgresql.rst
athena.rst
Indices and tables
==================
......
Setup on a PostgreSQL instance
==============================
This tutorial will guide you through the steps required to setup the Software
Heritage Graph Dataset in a PostgreSQL database.
.. highlight:: bash
PostgreSQL local setup
----------------------
You need to have access to a running PostgreSQL instance to load the dataset.
This section contains information on how to setup PostgreSQL for the first
time.
*If you already have a PostgreSQL server running on your machine, you can skip
to the next section.*
- For **Ubuntu** and **Debian**::
sudo apt install postgresql
- For **Archlinux**::
sudo pacman -S --needed postgresql
sudo -u postgres initdb -D '/var/lib/postgres/data'
sudo systemctl enable --now postgresql
Once PostgreSQL is running, you also need an user that will be able to create
databases and run queries. The easiest way to achieve that is simply to create
an account that has the same name as your username and that can create
databases::
sudo -u postgres createuser --createdb $USER
Retrieving the dataset
----------------------
You need to download the dataset in SQL format. Use the following command on
your machine, after making sure that it has enough available space for the
dataset you chose:
.. tabs::
.. group-tab:: full
::
mkdir full && cd full
wget -c -A gz,sql -nd -r -np -nH https://annex.softwareheritage.org/public/dataset/graph/2019-01-28/sql/
.. group-tab:: popular-4k
::
mkdir full && cd full
wget -c -A gz,sql -nd -r -np -nH https://annex.softwareheritage.org/public/dataset/graph/2019-01-28/popular-4k/sql/
.. group-tab:: popular-3k-python
::
mkdir full && cd full
wget -c -A gz,sql -nd -r -np -nH https://annex.softwareheritage.org/public/dataset/graph/2019-01-28/popular-3k-python/sql/
Loading the dataset
-------------------
Once you have retrieved the dataset of your choice, create a database that will
contain it, and load the database:
.. tabs::
.. group-tab:: full
::
createdb softwareheritage-full
psql softwareheritage-full < swh_import.sql
.. group-tab:: popular-4k
::
createdb softwareheritage-popular-4k
psql softwareheritage-popular-4k < swh_import.sql
.. group-tab:: popular-3k-python
::
createdb softwareheritage-popular-3k-python
psql softwareheritage-popular-3k-python < swh_import.sql
You can now run SQL queries on your database. Run ``psql <database_name>`` to
start an interactive PostgreSQL console.
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment