Skip to content
GitLab
Explore
Sign in
Register
Primary navigation
Search or go to…
Project
S
swh-dataset
Manage
Activity
Members
Labels
Plan
Issues
Issue boards
Milestones
Wiki
Code
Merge requests
Repository
Branches
Commits
Tags
Repository graph
Compare revisions
Snippets
Build
Pipelines
Jobs
Pipeline schedules
Artifacts
Deploy
Releases
Package registry
Model registry
Operate
Environments
Terraform modules
Monitor
Incidents
Analyze
Value stream analytics
Contributor analytics
CI/CD analytics
Repository analytics
Model experiments
Help
Help
Support
GitLab documentation
Compare GitLab plans
Community forum
Contribute to GitLab
Provide feedback
Keyboard shortcuts
?
Snippets
Groups
Projects
Show more breadcrumbs
Antoine Lambert
swh-dataset
Commits
f930c89d
Commit
f930c89d
authored
5 years ago
by
Antoine Pietri
Browse files
Options
Downloads
Patches
Plain Diff
schema: finish writing
parent
9d1f9c1d
No related branches found
No related tags found
No related merge requests found
Changes
1
Hide whitespace changes
Inline
Side-by-side
Showing
1 changed file
docs/graph/schema.rst
+106
-92
106 additions, 92 deletions
docs/graph/schema.rst
with
106 additions
and
92 deletions
docs/graph/schema.rst
+
106
−
92
View file @
f930c89d
...
...
@@ -12,117 +12,131 @@ This page documents the details of the schema.
- **content**: contains information on the contents stored in
the archive.
- ``sha1`` (bytes): the SHA-1 of the content
- ``sha1_git`` (bytes): the Git SHA-1 of the content
- ``length`` (integer): the length of the content
- ``sha1`` (bytes): the SHA-1 of the content
- ``sha1_git`` (bytes): the Git SHA-1 of the content
- ``length`` (integer): the length of the content
- **skipped_content**: contains information on the contents that were not archived for
various reasons.
- ``sha1`` (bytes): the SHA-1 of the missing content
- ``sha1_git`` (bytes): the Git SHA-1 of the missing content
- ``length`` (integer): the length of the missing content
- ``sha1`` (bytes): the SHA-1 of the missing content
- ``sha1_git`` (bytes): the Git SHA-1 of the missing content
- ``length`` (integer): the length of the missing content
- **directory**: contains the directories stored in the archive.
- ``id`` (bytes): the intrinsic identifier of the directory, recursively
computed with the Git SHA-1 algorithm
- ``dir_entries`` (array of integers): the list of directories contained in
this directory, as references to an entry in the ``directory_entry_dir``
table.
- ``file_entries`` (array of integers): the list of files contained in
this directory, as references to an entry in the ``directory_entry_file``
table.
- ``rev_entries`` (array of integers): the list of revisions contained in
this directory, as references to an entry in the ``directory_entry_rev``
table.
- ``id`` (bytes): the intrinsic identifier of the directory, recursively
computed with the Git SHA-1 algorithm
- ``dir_entries`` (array of integers): the list of directories contained in
this directory, as references to an entry in the ``directory_entry_dir``
table.
- ``file_entries`` (array of integers): the list of files contained in
this directory, as references to an entry in the ``directory_entry_file``
table.
- ``rev_entries`` (array of integers): the list of revisions contained in
this directory, as references to an entry in the ``directory_entry_rev``
table.
- **directory_entry_file**: contains informations about file entries in
directories.
- ``id`` (integer): unique identifier for the entry
- ``target`` (bytes): the Git SHA-1 of the content this entry points to
- ``name`` (bytes): the name of the file (basename of its path)
- ``perms`` (integer): the permissions of the file
- ``id`` (integer): unique identifier for the entry
- ``target`` (bytes): the Git SHA-1 of the content this entry points to
- ``name`` (bytes): the name of the file (basename of its path)
- ``perms`` (integer): the permissions of the file
- **directory_entry_dir**: contains informations about directory entries in
directories.
- ``id`` (integer): unique identifier for the entry
- ``target`` (bytes): the Git SHA-1 of the directory this entry points to
- ``name`` (bytes): the name of the directory
- ``perms`` (integer): the permissions of the directory
- ``id`` (integer): unique identifier for the entry
- ``target`` (bytes): the Git SHA-1 of the directory this entry points to
- ``name`` (bytes): the name of the directory
- ``perms`` (integer): the permissions of the directory
- **directory_entry_rev**: contains informations about revision entries in
directories.
- ``id`` (integer): unique identifier for the entry
- ``target`` (bytes): the Git SHA-1 of the revision this entry points to
- ``name`` (bytes): the name of the directory that contains this revision
- ``perms`` (integer): the permissions of the revision
- ``id`` (integer): unique identifier for the entry
- ``target`` (bytes): the Git SHA-1 of the revision this entry points to
- ``name`` (bytes): the name of the directory that contains this revision
- ``perms`` (integer): the permissions of the revision
- **revision**: contains the revisions stored in the archive.
- **person**: deduplicates commit authors by their names and e-mail addresses.
For pseudonymization purposes and in order to prevent abuse, these columns
were removed from the dataset, and this table only contains the ID of the
author. Individual authors may be retrieved using this ID from the Software
Heritage api.
- ``id`` (integer): the identifier of the person
- ``id`` (bytes): the intrinsic identifier of the revision, recursively
computed with the Git SHA-1 algorithm. For Git repositories, this
corresponds to the commit hash.
- The ``revision`` table contains all the revisions, identified by
their intrinsic hash in the ``id`` field. Each revision points to the
root directory of the project source tree, identified by the
``directory`` field which references the ``sha1_git`` cryptographic
hash of the directory. The table also contains metadata on the
revisions, notably the ``author`` and ``committer`` fields, the
``date`` and ``committer_date`` fields and the ``message`` field.
Each revision has an ordered set of parents (0 for the initial commit
of a repository, 1 for a normal commit and 2 or more for a merge
commit). These parents are stored in the ``revision_history`` table,
one row per parent. Each parent is identified by the ``id``
identifier, pointing to the hash of the revision, the ``parent_id``
identifier, pointing to the hash of the parent revision, and the
``parent_rank`` integer which defines the order of the parents of
each revision.
- The ``person`` table deduplicates commit authors by their name and
e-mail addresses. For pseudonymization purposes and in order to
prevent abuse, these columns were removed from the dataset, and this
table only contains the ``id`` column referenced by the ``author``
and ``committer`` fields of the ``revision`` table. Individual
authors may be retrieved using this ID from the Software Heritage
api.
- The ``release`` table contains the releases in the archive. They are
also identified by their intrinsic hash ``id`` and point to a
revision referenced by its hash in the ``target`` field. The metadata
fields are semantically similar to the ``revision`` table (i.e
``author``, ``date``, ``message``).
- The ``snapshot`` table contains the list of snapshots identified by
their intrinsic hash ``id``, and their integer primary key in the
archive ``object_id``. Each snapshot maps to a list of branches
listed in the table ``snapshot_branch`` through the many-to-many
relationship intermediate table ``snapshot_branches``, which
references the ``object_id`` fields of the ``snapshot`` and
``snapshot_branch`` tables. The ``snapshot_branch`` table also
contains the ``name`` of the branch and the ``target`` it points to
(identified by its intrinsic hash), either a ``release``,
``revision``, ``directory`` or ``content`` object depending on the
value of the ``target_type`` field.
In addition to the nodes and edges of the graph, the dataset also
contains crawling information, as a set of triples capturing where (an
origin url) and when (a timestamp) a given snapshot has been
encountered.
- The ``origin`` table contains the origins from which the software
projects in the dataset were archived, identified by their ``id``
identifier, and ``type`` and ``url`` metadata.
Since Software Heritage archives software continuously, software
origins are crawled more than once. Every “visit” of an origin is
stored in the ``origin_visit`` table, which contains the identifier
``origin`` of the origin visited, the ``date`` of the visit and a
``snapshot_id`` integer which points to the ``object_id`` identifier
of the ``snapshot`` table.
- **revision**: contains the revisions stored in the archive.
- ``id`` (bytes): the intrinsic identifier of the revision, recursively
computed with the Git SHA-1 algorithm. For Git repositories, this
corresponds to the revision hash.
- ``date`` (timestamp): the date the revision was authored
- ``committer_date`` (timestamp): the date the revision was committed
- ``author`` (integer): the author of the revision
- ``committer`` (integer): the committer of the revision
- ``message`` (bytes): the revision message
- ``directory`` (bytes): the Git SHA-1 of the directory the revision points
to. Every revision points to the root directory of the project source
tree to which it corresponds.
- **revision_history**: contains the ordered set of parents of each revision.
Each revision has an ordered set of parents (0 for the initial commit of a
repository, 1 for a regular commit, 2 for a regular merge commit and 3 or
more for octopus-style merge commits).
- ``id`` (bytes): the Git SHA-1 identifier of the revision
- ``parent_id`` (bytes): the Git SHA-1 identifier of the parent
- ``parent_rank`` (integer): the rank of the parent which defines the total
order of the parents of the revision
- **release**: contains the releases stored in the archive.
- ``id`` (bytes): the intrinsic identifier of the release, recursively
computed with the Git SHA-1 algorithm.
- ``target`` (bytes): the Git SHA-1 of the object the release points to.
- ``date`` (timestamp): the date the release was created
- ``author`` (integer): the author of the revision
- ``name`` (bytes): the release name
- ``message`` (bytes): the release message
- **snapshot**: contains the list of snapshots stored in the archive.
- ``id`` (bytes): the intrinsic identifier of the snapshot, recursively
computed with the Git SHA-1 algorithm.
- ``object_id`` (integer): the primary key of the snapshot
- **snapshot_branches**: contains the identifiers of branches associated with
each snapshot. This is an intermediary table through which is represented the
many-to-many relationship between snapshots and branches.
- ``snapshot_id`` (integer): the integer identifier of the snapshot
- ``branch_id`` (integer): the identifier of the branch
- **snapshot_branch**: contains the list of branches.
- ``object_id`` (integer): the identifier of the branch
- ``name`` (bytes): the name of the branch
- ``target`` (bytes): the Git SHA-1 of the object the branch points to
- ``target_type`` (string): the type of object the branch points to (either
``release``, ``revision``, ``directory`` or ``content``).
- **origin**: the software origins from which the projects in the dataset were
archived.
- ``id`` (integer): the identifier of the origin
- ``url`` (bytes): the URL of the origin
- ``type`` (string): the type of origin (e.g ``git``, ``pypi``, ``hg``,
``svn``, ``git``, ``ftp``, ``deb``, ...)
- **origin_visit**: the different visits of each origin. Since Software
Heritage archives software continuously, software origins are crawled more
than once. Each of these "visits" is an entry in this table.
- ``origin``: (integer) the identifier of the origin visited
- ``date``: (timestamp) the date at which the origin was visited
- ``snapshot_id`` (integer): the integer identifier of the snapshot archived
in this visit.
This diff is collapsed.
Click to expand it.
Preview
0%
Loading
Try again
or
attach a new file
.
Cancel
You are about to add
0
people
to the discussion. Proceed with caution.
Finish editing this message first!
Save comment
Cancel
Please
register
or
sign in
to comment