Skip to content
Snippets Groups Projects
Commit 1831f873 authored by Antoine Pietri's avatar Antoine Pietri
Browse files

swh-graph: azure: first draft of azure docs

parent 146208d4
Branches azuregraphdocs
No related tags found
No related merge requests found
datasets/swh_graph_*
TABLES := content skipped_content directory directory_entry_file \
directory_entry_dir directory_entry_rev person revision revision_history \
release snapshot snapshot_branches snapshot_branch origin origin_visit
DATASETS := $(addprefix datasets/swh_graph_,$(TABLES))
OVERVIEWS := $(addsuffix /overview.md,$(DATASETS))
CONFIGS := $(addsuffix /config.json,$(DATASETS))
TARGETS := $(OVERVIEWS) $(CONFIGS) datasets/swh_graph/config.json
all: $(TARGETS)
datasets/swh_graph_%/overview.md:
mkdir -p $$( dirname $@ )
cat dataset_stub.md > $@
sed -n '/^- \+\*\*$*\*\*/,/^-/p' ../schema.rst | head -n-1 >> $@
datasets/swh_graph/config.json: config_template.json
cat config_template.json |\
jq '.Id = "software-heritage-graph-dataset"' |\
jq '.Slug = "software-heritage-graph-dataset"' |\
jq '.Name = "Software Heritage Graph Dataset"' |\
jq '.DataAccess.AzureDatabricks.python."azureml-opendatasets" = "Notebooks/software-heritage-graph-dataset/swh-graph-example-notebook.ipynb"' \
> $@
datasets/swh_graph_%/config.json:
cat config_template.json |\
jq '.Id = "software-heritage-graph-dataset-$*"' |\
jq '.Slug = "software-heritage-graph-dataset-$*"' |\
jq '.Name = "Software Heritage Graph Dataset: $* table"' |\
jq '.BlobLocation.Path = "swhgraph/2018-09-25/parquet/$*"' \
> $@
{
"Version": 2,
"Id": "software-heritage-graph-dataset%%TABLE_SLUG%%",
"Slug": "software-heritage-graph-dataset%%TABLE_SLUG%%",
"Name": "Software Heritage Graph Dataset%%TABLE_TITLE%%",
"DataFormat": {
"Type": "Parquet"
},
"IconUrl": "https://swhopendataset.blob.core.windows.net/swhgraph/swh-logo.svg",
"Tags": [
"software heritage",
"graph dataset",
"development history",
"software repositories",
"source code",
"open source software",
"free software",
"development history graph"
],
"ProfileIntervalInSeconds": "TODO",
"BootstrapTimeUtc": "TODO",
"Triaged": "TODO"
}
This dataset is part of the [Software Heritage Graph open
dataset](https://azure.microsoft.com/en-us/services/open-datasets/catalog/software-heritage-graph/).
Please refer to the main dataset page for documentation and examples.
This is the Software Heritage graph dataset: a fully-deduplicated Merkle DAG
representation of the Software Heritage archive. The dataset links together
file content identifiers, source code directories, Version Control System (VCS)
commits tracking evolution over time, up to the full states of VCS repositories
as observed by Software Heritage during periodic crawls. The dataset's contents
come from major development forges (including [GitHub](https://github.com/) and
[GitLab](https://gitlab.com)), FOSS distributions (e.g., [Debian](debian.org)),
and language-specific package managers (e.g., [PyPI](https://pypi.org/)).
Crawling information is also included, providing timestamps about when and
where all archived source code artifacts have been observed in the wild.
The Software Heritage graph dataset is also available for download in other
formats, including CSV dumps and Apache Parquet files for local use.
By accessing the dataset, you agree with the Software Heritage [Ethical Charter
for using the archive
data](https://www.softwareheritage.org/legal/users-ethical-charter/), and the
[terms of use for bulk
access](https://www.softwareheritage.org/legal/bulk-access-terms-of-use/).
If you use this dataset for research purposes, please cite the following paper:
- Antoine Pietri, Diomidis Spinellis, Stefano Zacchiroli.
*The Software Heritage Graph Dataset: Public software development under one
roof.*
In proceedings of [MSR 2019](http://2019.msrconf.org/): The 16th
International Conference on Mining Software Repositories, May 2019,
Montreal, Canada. Co-located with [ICSE
2019](https://2019.icse-conferences.org/).
[preprint](https://upsilon.cc/~zack/research/publications/msr-2019-swh.pdf),
[bibtex](https://upsilon.cc/~zack/research/publications/msr-2019-swh.bib)
......@@ -9,14 +9,14 @@ A simplified view of the corresponding database schema is shown here:
This page documents the details of the schema.
- **content**: contains information on the contents stored in
- **content**: contains information on the contents stored in
the archive.
- ``sha1`` (bytes): the SHA-1 of the content
- ``sha1_git`` (bytes): the Git SHA-1 of the content
- ``length`` (integer): the length of the content
- **skipped_content**: contains information on the contents that were not archived for
- **skipped_content**: contains information on the contents that were not archived for
various reasons.
- ``sha1`` (bytes): the SHA-1 of the missing content
......
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment