Skip to content
Snippets Groups Projects
README.md 5.63 KiB
Newer Older
swh-storage
===========

Abstraction layer over the archive, allowing to access all stored source code
artifacts as well as their metadata.

See the
[documentation](https://docs.softwareheritage.org/devel/swh-storage/index.html)
for more details.

David Douard's avatar
David Douard committed
## Quick start
David Douard's avatar
David Douard committed
### Dependencies
Python tests for this module include tests that cannot be run without a local
Postgresql database, so you need the Postgresql server executable on your
machine (no need to have a running Postgresql server). They also expect a
cassandra server.

#### Debian-like host

```
$ sudo apt install libpq-dev postgresql-11 cassandra
```

#### Non Debian-like host

The tests expects the path to `cassandra` to either be unspecified, it is then
looked up at `/usr/sbin/cassandra`, either specified through the environment
variable `SWH_CASSANDRA_BIN`.

Optionally, you can avoid running the cassandra tests.
David Douard's avatar
David Douard committed
```
(swh) :~/swh-storage$ tox -- -m 'not cassandra'
David Douard's avatar
David Douard committed
```

### Installation

Stefano Zacchiroli's avatar
Stefano Zacchiroli committed
It is strongly recommended to use a virtualenv. In the following, we
David Douard's avatar
David Douard committed
consider you work in a virtualenv named `swh`. See the
[developer setup guide](https://docs.softwareheritage.org/devel/developer-setup.html#developer-setup)
for a more details on how to setup a working environment.


You can install the package directly from
[pypi](https://pypi.org/p/swh.storage):
David Douard's avatar
David Douard committed
```
(swh) :~$ pip install swh.storage
[...]
```

Or from sources:

```
(swh) :~$ git clone https://forge.softwareheritage.org/source/swh-storage.git
[...]
(swh) :~$ cd swh-storage
(swh) :~/swh-storage$ pip install .
[...]
```
David Douard's avatar
David Douard committed
Then you can check it's properly installed:
```
(swh) :~$ swh storage --help
Usage: swh storage [OPTIONS] COMMAND [ARGS]...
David Douard's avatar
David Douard committed
  Software Heritage Storage tools.
David Douard's avatar
David Douard committed
Options:
  -h, --help  Show this message and exit.
David Douard's avatar
David Douard committed
Commands:
  rpc-serve  Software Heritage Storage RPC server.
```
David Douard's avatar
David Douard committed
## Tests
David Douard's avatar
David Douard committed
The best way of running Python tests for this module is to use
[tox](https://tox.readthedocs.io/).
David Douard's avatar
David Douard committed
```
(swh) :~$ pip install tox
David Douard's avatar
David Douard committed
### tox
David Douard's avatar
David Douard committed
From the sources directory, simply use tox:
David Douard's avatar
David Douard committed
```
(swh) :~/swh-storage$ tox
[...]
========= 315 passed, 6 skipped, 15 warnings in 40.86 seconds ==========
_______________________________ summary ________________________________
  flake8: commands succeeded
  py3: commands succeeded
  congratulations :)
```
Note: it is possible to set the `JAVA_HOME` environment variable to specify the
version of the JVM to be used by Cassandra. For example, at the time of writing
this, Cassandra is meant to be run with Java 11. On Debian bookworm, one needs
to manually install openjdk-11-jre-headless from bullseye or unstable and
set the appropriate environment variable:
(swh) :~/swh-storage$ export JAVA_HOME=/usr/lib/jvm/java-11-openjdk-amd64/bin/java
David Douard's avatar
David Douard committed
## Development

The storage server can be locally started. It requires a configuration file and
a running Postgresql database.
### Sample configuration
David Douard's avatar
David Douard committed
A typical configuration `storage.yml` file is:
  cls: postgresql
  db: "dbname=softwareheritage-dev user=<user> password=<pwd>"
  objstorage:
    cls: pathslicing
    root: /tmp/swh-storage/
    slicing: 0:2/2:4/4:6
- a local storage instance whose db connection is to
David Douard's avatar
David Douard committed
  `softwareheritage-dev` local instance,
- the objstorage uses a local objstorage instance whose:
David Douard's avatar
David Douard committed
  - `root` path is /tmp/swh-storage,
David Douard's avatar
David Douard committed
  - slicing scheme is `0:2/2:4/4:6`. This means that the identifier of
    the content (sha1) which will be stored on disk at first level
    with the first 2 hex characters, the second level with the next 2
    hex characters and the third level with the next 2 hex
    characters. And finally the complete hash file holding the raw
    content. For example: 00062f8bd330715c4f819373653d97b3cd34394c
    will be stored at 00/06/2f/00062f8bd330715c4f819373653d97b3cd34394c

David Douard's avatar
David Douard committed
Note that the `root` path should exist on disk before starting the server.

David Douard's avatar
David Douard committed
### Starting the storage server
David Douard's avatar
David Douard committed
If the python package has been properly installed (e.g. in a virtual env), you
should be able to use the command:
(swh) :~/swh-storage$ swh storage -C storage.yml rpc-serve
This runs a local swh-storage api at 5002 port.
David Douard's avatar
David Douard committed
```
(swh) :~/swh-storage$ curl http://127.0.0.1:5002
<html>
<head><title>Software Heritage storage server</title></head>
<body>
<p>You have reached the
<a href="https://www.softwareheritage.org/">Software Heritage</a>
storage server.<br />
See its
<a href="https://docs.softwareheritage.org/devel/swh-storage/">documentation
and API</a> for more information</p>
```
David Douard's avatar
David Douard committed
In your upper layer
([loader-git](https://forge.softwareheritage.org/source/swh-loader-git/),
[loader-svn](https://forge.softwareheritage.org/source/swh-loader-svn/),
etc...), you can define a remote storage with this snippet of yaml
configuration.
You could directly define a postgresql storage with the following snippet:
  cls: postgresql
  db: service=swh-dev
  objstorage:
    cls: pathslicing
    root: /home/storage/swh-storage/
    slicing: 0:2/2:4/4:6

## Cassandra

As an alternative to PostgreSQL, swh-storage can use Cassandra as a database backend.
It can be used like this:

```
storage:
  cls: cassandra
  hosts:
    - localhost
  objstorage:
    cls: pathslicing
    root: /home/storage/swh-storage/
    slicing: 0:2/2:4/4:6
```

The Cassandra swh-storage implementation supports both Cassandra >= 4.0-alpha2
and ScyllaDB >= 4.4 (and possibly earlier versions, but this is untested).

While the main code supports both transparently, running tests
or configuring the schema requires specific code when using ScyllaDB,
enabled by setting the `SWH_USE_SCYLLADB=1` environment variable.