Skip to content
Snippets Groups Projects
Forked from Platform / Development / swh-export
101 commits behind the upstream repository.
dataset.rst 6.58 KiB

Dataset

We aim to provide regular exports of the Software Heritage graph in two different formats:

  • Columnar data storage: a set of relational tables stored in a columnar format such as Apache ORC, which is particularly suited for scale-out analyses on data lakes and big data processing ecosystems such as the Hadoop environment.
  • Compressed graph: a compact and highly-efficient representation of the graph dataset, suited for scale-up analysis on high-end machines with large amounts of memory. The graph is compressed in Boldi-Vigna representation, designed to be loaded by the WebGraph framework, specifically using our swh-graph library.

Summary of dataset versions

Full graph:

Name # Nodes # Edges Columnar Compressed
2021-03-23 20,667,308,808 232,748,148,441
2020-12-15 19,330,739,526 213,848,749,638
2020-05-20 17,075,708,289 203,351,589,619
2019-01-28 11,683,687,950 159,578,271,511

Teaser datasets:

Name # Nodes # Edges Columnar Compressed
2020-12-15-gitlab-all 1,083,011,764 27,919,670,049
2020-12-15-gitlab-100k 304,037,235 9,516,984,175
2019-01-28-popular-4k ? ?
2019-01-28-popular-3k-python 27,363,226 346,413,337

Full graph datasets

2021-03-23

A full export of the graph dated from March 2021.

2020-12-15

A full export of the graph dated from December 2020. Only available in compressed representation.

2020-05-20

A full export of the graph dated from May 2020. Only available in compressed representation. (DEPRECATED: known issue with missing snapshot edges.)

2019-01-28

A full export of the graph dated from January 2019. The export was done in two phases, one of them called "2018-09-25" and the other "2019-01-28". They both refer to the same dataset, but the different formats have various inconsistencies between them. (DEPRECATED: early export pipeline, various inconsistencies).

Teaser datasets

If the above datasets are too big, we also provide "teaser" datasets that can get you started and have a smaller size fingerprint.

2020-12-15-gitlab-all

A teaser dataset containing the entirety of Gitlab, exported in December 2020. Available in compressed graph format.

2020-12-15-gitlab-100k

A teaser dataset containing the 100k most popular Gitlab repositories, exported in December 2020. Available in compressed graph format.

2019-01-28-popular-4k

This teaser dataset contains a subset of 4000 popular repositories from GitHub, Gitlab, PyPI and Debian. The selection criteria to pick the software origins was the following:

  • The 1000 most popular GitHub projects (by number of stars)
  • The 1000 most popular Gitlab projects (by number of stars)
  • The 1000 most popular PyPI projects (by usage statistics, according to the Top PyPI Packages database),
  • The 1000 most popular Debian packages (by "votes" according to the Debian Popularity Contest database)
  • Columnar (Apache Parquet):

2019-01-28-popular-3k-python

The popular-3k-python teaser contains a subset of 3052 popular repositories tagged as being written in the Python language, from GitHub, Gitlab, PyPI and Debian. The selection criteria to pick the software origins was the following, similar to popular-4k:

  • the 1000 most popular GitHub projects written in Python (by number of stars),
  • the 131 Gitlab projects written in Python that have 2 stars or more,
  • the 1000 most popular PyPI projects (by usage statistics, according to the Top PyPI Packages database),
  • the 1000 most popular Debian packages with the debtag implemented-in::python (by "votes" according to the Debian Popularity Contest database).
  • Columnar (Apache Parquet):