swh-git-loader - Specification (draft)
The Software Heritage Git Loader is a tool and a library to walk a local Git repository and inject into the SWH dataset all contained files that weren’t known before.
License
This program is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version.
This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.
See top-level LICENSE file for the full text of the GNU General Public License along with this program.
Dependencies
Runtime
- python3
- python3-psycopg2
- python3-pygit2
Test
- python3-nose
Requirements
Functional
- input: a Git bare repository available locally, on the filesystem
- input (optional): a table mapping SHA256 of individual files to path on the filesystem that contain the corresponding content (AKA, the file cache)
- input (optional): a set of SHA1 of Git commits that have already been seen in the past (AKA, the Git commit cache)
- output: an augmented SWH dataset, where all files present in all blobs referenced by any Git object, have been added
algo
Sketch of the (naive) algorithm that the Git loader should execute
for each ref in the repo
for each commit referenced by the commit graph starting at that ref
if we have a git commit cache and the commit is in there: stop treating the current commit sub-graph
for each tree referenced by the commit
for each blob referenced by the tree
compute the SHA256 checksum of the blob
lookup the checksum in the file cache
if it is not there
add the file to the dataset on the filesystem
add the file to the file cache, pointing to the file path on the filesystem
Non-functional
- implementation language, Python3
- coding guidelines: conform to PEP8
- Git access: via libgit2/pygit
- cache: implemented as Postgres tables
File-system storage
Given a file with SHA256 of b5bb9d8014a0f9b1d61e21e796d78dccdf1352f23cd32812f4850b878ae4944c It will be stored at STORAGE_ROOT/b5/bb/9d/80/14a0f9b1d61e21e796d78dccdf1352f23cd32812f4850b878ae4944c
Configuration
swh-git-loader depends on some tools, here are the configuration files for those:
swh-db-manager
This is a tool in charge of db management (cleanup data, bootstrap model, etc…).
Create a configuration file in ~/.config/db-manager.ini
[main]
# Where to store the logs
log_dir = swh-git-loader/log
# url access to db
db_url = dbname=swhgitloader
See http://initd.org/psycopg/docs/module.html#psycopg2.connect for the db url’s schema
swh-git-loader
Create a configuration file in ~/.config/swh/git-loader.ini:
[main]
# Where to store the logs
log_dir = /tmp/swh-git-loader/log
# url access to api's backend
backend_url = http://localhost:5000
Note:
- DB url DSL
- the configuration file can be changed in the CLI with the flag `-c <config-filepath>` or `–config-file <config-filepath>`
swh-backend
Backend api.
Create a configuration file in ~/.config/swh/back.ini:
[main]
# where to store blob on disk
content_storage_dir = /tmp/swh-git-loader/content-storage
# Where to store the logs
log_dir = swh-git-loader/log
# url access to db: dbname=<host> (port=<port> user=<user> pass=<pass>)
db_url = dbname=swhgitloader
# activate the compression for each vcs stored object
# storage_compression = true
# compute folder's depth on disk aa/bb/cc/dd
# folder_depth = 2
# Debugger (for dev only)
debug = true
See http://initd.org/psycopg/docs/module.html#psycopg2.connect for the db url’s schema
Tryouts
PUT on commits:
# tony at corellia in ~/work/inria/org/antelink on git:master x [14:04:40]
$ curl -i -XPUT -H'application/json' -d 'date=1' http://localhost:5000/commits/52745df6dd5dc46ee476a8be155ab049994f714e
HTTP/1.0 204 NO CONTENT
Content-Type: text/html; charset=utf-8
Content-Length: 0
Server: Werkzeug/0.9.6 Python/3.4.3+
Date: Thu, 18 Jun 2015 12:04:44 GMT
# tony at corellia in ~/work/inria/org/antelink on git:master x [14:12:05]
$ curl -i -XPUT -H'application/json' -d 'date=1' http://localhost:5000/commits/52745df6dd5dc46ee476a8be155ab049994f714e
HTTP/1.0 200 OK
Content-Type: text/html; charset=utf-8
Content-Length: 18
Server: Werkzeug/0.9.6 Python/3.4.3+
Date: Thu, 18 Jun 2015 12:12:19 GMT
Successful update!%
# tony at corellia in ~/work/inria/org/antelink on git:master x [14:12:19]
$ curl http://localhost:5000/commits/52745df6dd5dc46ee476a8be155ab049994f714e{
"sha1": "52745df6dd5dc46ee476a8be155ab049994f714e"
}%
GET/PUT on blob:
# tony at corellia in ~/work/inria/org/antelink on git:master x [14:12:24]
$ curl -i http://localhost:5000/blobs/52745df6dd5dc46ee476a8be155ab049994f714e HTTP/1.0 404 NOT FOUND
Content-Type: text/html; charset=utf-8
Content-Length: 10
Server: Werkzeug/0.9.6 Python/3.4.3+
Date: Thu, 18 Jun 2015 12:12:33 GMT
Not found!%
# tony at corellia in ~/work/inria/org/antelink on git:master x [14:12:33]
$ curl -i -XPUT -H'application/json' -d'git-sha1=456' -d'size=10' http://localhost:5000/blobs/52745df6dd5dc46ee476a8be155ab049994f714e
HTTP/1.0 204 NO CONTENT
Content-Type: text/html; charset=utf-8
Content-Length: 0
Server: Werkzeug/0.9.6 Python/3.4.3+
Date: Thu, 18 Jun 2015 12:13:47 GMT
# tony at corellia in ~/work/inria/org/antelink on git:master x [14:13:47]
$ curl http://localhost:5000/blobs/52745df6dd5dc46ee476a8be155ab049994f714e{
"sha1": "52745df6dd5dc46ee476a8be155ab049994f714e"
Run
Environment initialization
export PYTHONPATH=`pwd`:$PYTHONPATH
Help
bin/swh-git-loader --help
bin/swh-db-manager --help
Parse a repository from a clean slate
Clean and initialize the model then parse the repository git:
bin/swh-db-manager cleandb
bin/swh-db-manager initdb
bin/swh-git-loader load /path/to/git/repo
For ease:
make cleandb initdb clean-and-run REPO_PATH=/path/to/git/repo
Parse an existing repository
bin/swh-git-loader load /path/to/git/repo
Clean data
bin/swh-db-manager cleandb
For ease:
make cleandb
Init data
bin/swh-db-manager initdb
Log
Format
Activating the debug mode (flag `-v` or `–verbose` will log more information in the following format: <action-verb> <nature-object> <sha1-name-or-path>
where: <action-verb>
- walk walk a tree or a reference
- skip skip an already saved/visited object or unknown object (e.g. commit submodule)
- store save an object in db (file or object) and content (file or object) storage
- initialize Initialize the db
- clean Clean the db’s data
<nature-object>
- tree
- commit
- blob
- reference
- submodule-commit A commit from a submodule
- unknown-action An unknown action from swhgitloader’s cli
<sha1-name-or-path>
- sha1 git or swh’s sha1
- name object name
- path object’s content storage path
Folder
The different tools can be configured in their respective .ini file. They, by default, log inside the swh-git-loader/log folder.