Skip to content
GitLab
Explore
Sign in
Register
Primary navigation
Search or go to…
Project
S
swh-docs
Manage
Activity
Members
Labels
Plan
Issues
Issue boards
Milestones
Wiki
Code
Merge requests
Repository
Branches
Commits
Tags
Repository graph
Compare revisions
Snippets
Build
Pipelines
Jobs
Pipeline schedules
Artifacts
Deploy
Releases
Package Registry
Model registry
Operate
Environments
Terraform modules
Monitor
Incidents
Analyze
Value stream analytics
Contributor analytics
CI/CD analytics
Repository analytics
Model experiments
Help
Help
Support
GitLab documentation
Compare GitLab plans
Community forum
Contribute to GitLab
Provide feedback
Keyboard shortcuts
?
Snippets
Groups
Projects
Show more breadcrumbs
Vincent Sellier
swh-docs
Commits
c5d37347
Commit
c5d37347
authored
2 years ago
by
vlorentz
Browse files
Options
Downloads
Patches
Plain Diff
Add a guide for scientists to get started with SWH data
parent
c5834f11
No related branches found
Branches containing commit
No related tags found
No related merge requests found
Changes
3
Hide whitespace changes
Inline
Side-by-side
Showing
3 changed files
docs/user/index.rst
+1
-0
1 addition, 0 deletions
docs/user/index.rst
docs/user/using_data/index.rst
+145
-0
145 additions, 0 deletions
docs/user/using_data/index.rst
user/index.rst
+1
-0
1 addition, 0 deletions
user/index.rst
with
147 additions
and
0 deletions
docs/user/index.rst
+
1
−
0
View file @
c5d37347
...
@@ -11,6 +11,7 @@ Software Heritage - User Documentation
...
@@ -11,6 +11,7 @@ Software Heritage - User Documentation
listers/index
listers/index
loaders/index
loaders/index
save_code_now/webhooks/index
save_code_now/webhooks/index
using_data/index
.. only:: user_doc
.. only:: user_doc
...
...
This diff is collapsed.
Click to expand it.
docs/user/using_data/index.rst
0 → 100644
+
145
−
0
View file @
c5d37347
.. _using-swh-data:
Using Software Heritage data
============================
This page documents the various ways Software Heritage provides programmatic
access to data in the archive, and pointers to use them.
First, please familiarize yourself with:
* the :ref:`data model <data-model>`,
* the `content policy`_,
* your local data protection legislation, and
* if relevant, your employer's/university's
guidelines regarding research data.
.. _content policy: https://www.softwareheritage.org/legal/content-policy/
Data sources
------------
Software Heritage provides several ways to access the archive, with different
tradeoffs suitable for different access patterns.
REST API
^^^^^^^^
The `REST API`_ allows non-bulk read access to the whole archive,
as well as requesting archival of specific repositories or forges,
and downloading tarballs of individual repositories.
It is available anonymously, but we recommend `authenticating
<https://archive.softwareheritage.org/api/#authentication>`__ in order to
benefit from higher rate limits, and request access to beta features.
This API provides non-pseudonymized access to archive data; but some
content may be taken down, or author names may be amended, according to
the content policy.
.. _REST API: https://archive.softwareheritage.org/api/
Compressed graph
^^^^^^^^^^^^^^^^
:ref:`swh-graph <swh-graph>` provides three APIs to perform large traversal
on the graph of the archive
-- even in the opposite direction of the data model's DAG.
It also has limited capabilities to read or filter on node/edge labels
(ie. directory and file names, commit messages, ...) and does not
include file content.
For example, it allows getting a list of origins containing a specific
file or directory.
The APIs are:
* an :ref:`HTTP RPC API <swh-graph-api>`, which is available at
https://archive.softwareheritage.org/api/1/graph/ on request.
`Contact us`_ and tell us about your use case, we are interested to know
what you plan to do with it
* a :ref:`gRPC API <swh-graph-grpc-api>`, for language-agnostic access
to more advanced features
* a :ref:`Java API <swh-graph-java-api>` for full access to its features.
The latter two are currently not hosted publicly.
However, you can run your own using the same data we have on your own computers,
by download the "Compressed graph" files from the :ref:`swh-graph-dataset`.
Beware that this is resource-intensive, as the full dataset takes about 150GB
of disk and RAM for each of the two graphs (forward and backward edges);
and swapping severely affects its performance.
Producing this dataset is computationally intensive, and is not yet automated;
so it is currently published only once a year.
Author/committer name and email are pseudonymized.
.. _contact us: https://www.softwareheritage.org/community/scientists/
Dataset export
^^^^^^^^^^^^^^
The :ref:`swh-graph-dataset` also includes a raw export of all of
the archive's database tables (as ORC files) and graph structure (as compressed CSV).
It does not include file content.
The ORC dataset takes about 11TB on disk.
Producing this dataset is not yet automated; so it is currently published
only once a year.
Author/committer name and email are pseudonymized.
Contents on S3
^^^^^^^^^^^^^^
Finally, to complement the compressed graph and dataset export, we provide
public access to file content via a S3 bucket, accessible at
``s3://softwareheritage/content/<sha1>`` and
``https://softwareheritage.s3.amazonaws.com/content/<sha1>``
where ``<sha1>`` is the hexadecimal representation of the content's
``sha1`` hash (not to be confused with ``sha1_git`` hash used in some places
in the datasets and in SWHID).
Possible bias
-------------
Statistical analyses on the archive may be biased by the way source code is
collected by the archive. This section details the main ones to be aware of
when performing research on the archive.
Code and configuration changes
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Software Heritage's codebase evolves over time, and the archive adds support
for new forges regularly.
Major changes are documented in the `archive changelog`_
Typically, this means that source code deleted from a given forge before
Software Heritage started archiving that forge is missing
-- which may lead to code hosted in less popular places to be underrepresented
in the archive.
.. _archive changelog: archive-changelog
Large objects
^^^^^^^^^^^^^
Some source code repositories, such as Chromium's and Linux's git repositories
and their clones, are particularly large.
This is a challenge for loaders, which may fail to load them at a higher frequency
than smaller repositories.
Software Heritage also does not archive any object larger than 300MB, as they
are unlikely to be source code, and would put unreasonable load on the archive.
Non-code objects
^^^^^^^^^^^^^^^^
Software Heritage collects data indiscriminately from code hosting places.
Sometimes, this includes repositories used to host non-code content and/or
binary code.
This diff is collapsed.
Click to expand it.
user/index.rst
+
1
−
0
View file @
c5d37347
...
@@ -9,6 +9,7 @@ Getting started
...
@@ -9,6 +9,7 @@ Getting started
* :ref:`listers`
* :ref:`listers`
* :ref:`loaders`
* :ref:`loaders`
* :ref:`swh_scn_webhooks`
* :ref:`swh_scn_webhooks`
* :ref:`using-swh-data`
Indices and tables
Indices and tables
...
...
This diff is collapsed.
Click to expand it.
Preview
0%
Loading
Try again
or
attach a new file
.
Cancel
You are about to add
0
people
to the discussion. Proceed with caution.
Finish editing this message first!
Save comment
Cancel
Please
register
or
sign in
to comment