Skip to content
Snippets Groups Projects

Update mirror docker docs following a walkthrough

Compare and
1 file
+ 156
129
Compare changes
  • Side-by-side
  • Inline
@@ -11,8 +11,8 @@ Deploy a Software Heritage stack with docker deploy
Prerequisities
--------------
According you have a properly set up docker swarm cluster with support for the
`docker stack deploy
We assume that you have a properly set up docker swarm cluster with support for
the `docker stack deploy
<https://docs.docker.com/engine/reference/commandline/stack_deploy/>`_ command,
e.g.:
@@ -24,8 +24,9 @@ e.g.:
n9mfw08gys0dmvg5j2bb4j2m7 * host1 Ready Active Leader 18.09.7
Note: on some systems (centos for example), making docker swarm works require
some permission tuning regarding the firewall and selinux.
Note: on some systems (centos for example), making docker swarm work requires some
permission tuning regarding the firewall and selinux. Please refer to `the upstream
docker-swarm documentation <https://docs.docker.com/engine/swarm/swarm-tutorial/>`_.
In the following how-to, we will assume that the service `STACK` name is `swh`
(this name is the last argument of the `docker stack deploy` command below).
@@ -33,7 +34,7 @@ In the following how-to, we will assume that the service `STACK` name is `swh`
Several preparation steps will depend on this name.
We also use `docker-compose <https://github.com/docker/compose>`_ to merge compose
files, so make sure it iavailable on your system.
files, so make sure it is available on your system.
You also need to clone the git repository:
@@ -43,15 +44,15 @@ You also need to clone the git repository:
Set up volumes
--------------
Before starting the `swh` service, you may want to specify where the data
should be stored on your docker hosts.
Before starting the `swh` service, you will certainly want to specify where the
data should be stored on your docker hosts.
By default docker will use docker volumes for storing databases and the content of
the objstorage (thus put them in `/var/lib/docker/volumes`).
**Optional:** if you want to specify a different location to put a storage in,
create the storage before starting the docker service. For example for the
`objstorage` service you will need a storage named `<STACK>_objstorage`:
**Optional:** if you want to specify a different location to put the data in,
you should create the docker volumes before starting the docker service. For
example, the `objstorage` service uses a volume named `<STACK>_objstorage`:
.. code-block:: bash
@@ -62,20 +63,32 @@ create the storage before starting the docker service. For example for the
swh_objstorage
If you want to deploy services like the `swh-objstorage` on several hosts, you
will need a shared storage area in which blob objects will be stored. Typically
a NFS storage can be used for this, or any existing docker volume driver like
`REX-Ray <https://rexray.readthedocs.io/>`_. This is not covered in this doc.
If you want to deploy services like the `objstorage` on several hosts, you will need a
shared storage area in which blob objects will be stored. Typically a NFS storage can be
used for this, or any existing docker volume driver like `REX-Ray
<https://rexray.readthedocs.io/>`_. This is not covered in this documentation.
Please read the documentation of docker volumes to learn how to use such a
device/driver as volume provider for docker.
Note that the provided `base-services.yaml` file have a few placement
constraints: containers that depends on a volume (db-storage and objstorage)
are stick to the manager node of the cluster, under the assumption persistent
volumes have been created on this node. Make sure this fits your needs, or
amend these placement constraints.
Note that the provided `base-services.yaml` file has placement constraints for the
`db-storage`, `db-web` and `objstorage` containers, that depend on the availability of
specific volumes (respectively `<STACK>_storage-db`, `<STACK>_web-db` and
`<STACK>_objstorage`). These services are pinned to specific nodes using labels named
`org.softwareheritage.mirror.volumes.<base volume name>` (e.g.
`org.softwareheritage.mirror.volumes.objstorage`).
When you create a local volume for a given container, you should add the relevant label
to the docker swarm node metadata with:
.. code-block:: bash
docker node update \
--label-add org.softwareheritage.mirror.volumes.objstorage=true \
<node_name>
You have to set the node labels, or to adapt the placement constraints to your local
requirements, for the services to start.
Managing secrets
----------------
@@ -85,17 +98,18 @@ being able to start services, you need to define these secrets.
Namely, you need to create a `secret` for:
- `postgres-password`
- `swh-mirror-db-postgres-password`
- `swh-mirror-web-postgres-password`
For example:
.. code-block:: bash
~/swh-docker$ echo 'strong password' | docker secret create postgres-password -
~/swh-docker$ xkcdpass -d- | docker secret create swh-mirror-db-postgres-password -
[...]
Creating the swh base services
Spawning the swh base services
------------------------------
If you haven't done it yet, clone this git repository:
@@ -105,52 +119,71 @@ If you haven't done it yet, clone this git repository:
~$ git clone https://forge.softwareheritage.org/source/swh-docker.git
~$ cd swh-docker
This repository provides the docker compose/stack manifests to deploy all the relevant
services.
.. note::
These manifests use a set of docker images `published in the docker hub
<https://hub.docker.com/r/softwareheritage/base/tags>`_. By default, the manifests
will use the `latest` version of these images, but for production uses, you should
set the `SWH_IMAGE_TAG` environment variable to pin them to a specific version.
To specify the tag to be used, simply set the SWH_IMAGE_TAG environment variable, like
so:
.. code-block:: bash
~/swh-docker$ export SWH_IMAGE_TAG=20211022-121751
then from within this repository, just type:
You can then spawn the base services using the following command:
.. code-block:: bash
~/swh-docker$ docker stack deploy -c base-services.yml swh
Creating network swh-mirror_default
Creating config swh-mirror_storage
Creating config swh-mirror_objstorage
Creating config swh-mirror_nginx
Creating config swh-mirror_web
Creating service swh-mirror_grafana
Creating service swh-mirror_prometheus-statsd-exporter
Creating service swh-mirror_web
Creating service swh-mirror_objstorage
Creating service swh-mirror_db-storage
Creating service swh-mirror_memcache
Creating service swh-mirror_storage
Creating service swh-mirror_nginx
Creating service swh-mirror_prometheus
Creating network swh_default
Creating config swh_storage
Creating config swh_objstorage
Creating config swh_nginx
Creating config swh_web
Creating service swh_grafana
Creating service swh_prometheus-statsd-exporter
Creating service swh_web
Creating service swh_objstorage
Creating service swh_db-storage
Creating service swh_memcache
Creating service swh_storage
Creating service swh_nginx
Creating service swh_prometheus
~/swh-docker$ docker service ls
ID NAME MODE REPLICAS IMAGE PORTS
sz98tofpeb3j swh-mirror_db-storage global 1/1 postgres:11
sp36lbgfd4qi swh-mirror_grafana replicated 1/1 grafana/grafana:latest
7oja81jngiwo swh-mirror_memcache replicated 1/1 memcached:latest
y5te0gqs93li swh-mirror_nginx replicated 1/1 nginx:latest *:5081->5081/tcp
79t3r3mv3qn6 swh-mirror_objstorage replicated 1/1 softwareheritage/base:20200918-133743
l7q2zocoyvq6 swh-mirror_prometheus global 1/1 prom/prometheus:latest
p6hnd90qnr79 swh-mirror_prometheus-statsd-exporter replicated 1/1 prom/statsd-exporter:latest
jjry62tz3k76 swh-mirror_storage replicated 1/1 softwareheritage/base:20200918-133743
jkkm7qm3awfh swh-mirror_web replicated 1/1 softwareheritage/web:20200918-133743
ID NAME MODE REPLICAS IMAGE PORTS
tc93talbe2tg swh_db-storage global 1/1 postgres:13
42q5jtxsh029 swh_db-web global 1/1 postgres:13
rtlz62ok6s96 swh_grafana replicated 1/1 grafana/grafana:latest
jao3rt0et17n swh_memcache replicated 1/1 memcached:latest
rulxakqgu2ko swh_nginx replicated 1/1 nginx:latest *:5081->5081/tcp
q560pvw3q3ls swh_objstorage replicated 2/2 softwareheritage/base:20211022-121751
a2h3ltaqdt56 swh_prometheus global 1/1 prom/prometheus:latest
lm24et9gjn2k swh_prometheus-statsd-exporter replicated 1/1 prom/statsd-exporter:latest
gwqinrao5win swh_storage replicated 2/2 softwareheritage/base:20211022-121751
7g46blmphfb4 swh_web replicated 1/1 softwareheritage/web:20211022-121751
This will start a series of containers with:
- an objstorage service,
- a storage service using a postgresql database as backend,
- a web app front end,
- a web app front end using a postgresql database as backend,
- a memcache for the web app,
- a prometheus monitoring app,
- a prometeus-statsd exporter,
- a grafana server,
- an nginx server serving as reverse proxy for grafana and swh-web.
using the latest published version of the docker images by default.
using the pinned version of the docker images.
The nginx frontend will listen on the 5081 port, so you can use:
@@ -160,43 +193,20 @@ The nginx frontend will listen on the 5081 port, so you can use:
.. warning::
the 'latest' docker images work, it is highly recommended to
explicitly specify the version of the image you want to use.
Docker images for the Software Heritage stack are tagged with their build date:
.. code-block:: bash
~$ docker images -f reference='softwareheritage/*:20*'
REPOSITORY TAG IMAGE ID CREATED SIZE
softwareheritage web-20200819-112604 32ab8340e368 About an hour ago 339MB
softwareheritage base-20200819-112604 19fe3d7326c5 About an hour ago 242MB
softwareheritage web-20200630-115021 65b1869175ab 7 weeks ago 342MB
softwareheritage base-20200630-115021 3694e3fcf530 7 weeks ago 245MB
To specify the tag to be used, simply set the SWH_IMAGE_TAG environment variable, like:
.. code-block:: bash
export SWH_IMAGE_TAG=20200819-112604
docker deploy -c base-services.yml swh
.. warning::
make sure to have this variable properly set for any later `docker deploy`
command you type, otherwise you running containers will be recreated using the
':latest' image (which might **not** be the latest available version, nor
consistent amond the docker nodes on you swarm cluster).
Please make sure that the `SWH_IMAGE_TAG` variable is properly set for any later
`docker stack deploy` command you type, otherwise all the running containers will be
recreated using the ':latest' image (which might **not** be the latest available
version, nor consistent among the docker nodes on your swarm cluster).
Updating a configuration
------------------------
When you modify a configuration file exposed to docker services via the `docker
Configuration files are exposed to docker services via the `docker
config` system. Unfortunately, docker does not support updating these config
objects, so you need to either:
objects, so you will need to either:
- destroy the old config before being able to recreate them. That also means
you need to recreate every docker container using this config, or
you need to recreate every docker service using this config, or
- adapt the `name:` field in the compose file.
@@ -220,7 +230,7 @@ For example, if you edit the file `conf/storage.yml`:
Note: since persistent data (databases and objects) are stored in volumes, you
can safely destoy and recreate any container you want, you will not loose any
can safely destoy and recreate any container you want, you will not lose any
data.
Or you can change the compose file like:
@@ -276,19 +286,12 @@ replayer service):
.. code-block:: bash
~/swh-docker$ docker service update --image \
softwareheritage/replayer:${SWH_IMAGE_TAG} ) \
softwareheritage/replayer:${SWH_IMAGE_TAG} \
swh_graph-replayer
Set up a mirror
===============
.. warning::
you cannot "upgrade" an existing docker stack built from the base-services.yml file
to a mirror one; you need to recreate it; more precisely, you need to drop the
storage database before. This is due to the fact the storage database for a mirror is
not initialized the same way as the default storage database.
Set up the mirroring components
===============================
A Software Heritage mirror consists in base Software Heritage services, as
described above, without any worker related to web scraping nor source code
@@ -306,16 +309,16 @@ using configuration from yaml files in `conf/graph-replayer.yml`.
Copy these example files as plain yaml ones then modify them to replace
the XXX markers with proper values (also make sure the kafka server list
is up to date.) Parameters to check/update are:
is up to date). The parameters to check/update are:
- `journal_client/brokers`: list of kafka brokers.
- `journal_client/group_id`: unique identifier for this mirroring session;
- `journal_client.brokers`: list of kafka brokers.
- `journal_client.group_id`: unique identifier for this mirroring session;
you can choose whatever you want, but changing this value will make kafka
start consuming messages from the beginning; kafka messages are dispatched
among consumers with the same `group_id`, so in order to distribute the
load among workers, they must share the same `group_id`.
- `journal_client/sasl.username`: kafka authentication username.
- `journal_client/sasl.password`: kafka authentication password.
- `journal_client."sasl.username"`: kafka authentication username.
- `journal_client."sasl.password"`: kafka authentication password.
Then you need to merge the compose files "by hand" (due to this still
`unresolved <https://github.com/docker/cli/issues/1651>`_
@@ -338,19 +341,19 @@ Then use this generated file as argument of the `docker stack deploy` command, e
.. code-block:: bash
~/swh-docker$ docker stack deploy -c mirror.yml swh-mirror
~/swh-docker$ docker stack deploy -c mirror.yml swh
Graph replayer
--------------
To run the graph replayer compoenent of a mirror:
To run the graph replayer component of a mirror:
.. code-block:: bash
~/swh-docker$ cd conf
~/swh-docker/conf$ cp graph-replayer.yml.example graph-replayer.yml
~/swh-docker/conf$ # edit graph-replayer.yml files
~/swh-docker/conf$ $EDITOR graph-replayer.yml
~/swh-docker/conf$ cd ..
@@ -362,10 +365,10 @@ start these services with:
~/swh-docker$ docker-compose \
-f base-services.yml \
-f graph-replayer-override.yml \
config > graph-replayer.yml
config > stack-with-graph-replayer.yml
~/swh-docker$ docker stack deploy \
-c graph-replayer.yml \
swh-mirror
-c stack-with-graph-replayer.yml \
swh
[...]
You can check everything is running with:
@@ -373,35 +376,38 @@ You can check everything is running with:
.. code-block:: bash
~/swh-docker$ docker stack ls
NAME SERVICES ORCHESTRATOR
swh-mirror 11 Swarm
NAME SERVICES ORCHESTRATOR
swh 11 Swarm
~/swh-docker$ docker service ls
ID NAME MODE REPLICAS IMAGE PORTS
88djaq3jezjm swh-mirror_db-storage replicated 1/1 postgres:11
m66q36jb00xm swh-mirror_grafana replicated 1/1 grafana/grafana:latest
qfsxngh4s2sv swh-mirror_content-replayer replicated 1/1 softwareheritage/replayer:latest
qcl0n3ngr2uv swh-mirror_graph-replayer replicated 1/1 softwareheritage/replayer:latest
zn8dzsron3y7 swh-mirror_memcache replicated 1/1 memcached:latest
wfbvf3yk6t41 swh-mirror_nginx replicated 1/1 nginx:latest *:5081->5081/tcp
thtev7o0n6th swh-mirror_objstorage replicated 1/1 softwareheritage/base:latest
ysgdoqshgd2k swh-mirror_prometheus replicated 1/1 prom/prometheus:latest
u2mjjl91aebz swh-mirror_prometheus-statsd-exporter replicated 1/1 prom/statsd-exporter:latest
xyf2xgt465ob swh-mirror_storage replicated 1/1 softwareheritage/base:latest
su8eka2b5cbf swh-mirror_web replicated 1/1 softwareheritage/web:latest
ID NAME MODE REPLICAS IMAGE PORTS
tc93talbe2tg swh_db-storage global 1/1 postgres:13
42q5jtxsh029 swh_db-web global 1/1 postgres:13
rtlz62ok6s96 swh_grafana replicated 1/1 grafana/grafana:latest
7hvn66um77wr swh_graph-replayer replicated 4/4 softwareheritage/replayer:20211022-121751
jao3rt0et17n swh_memcache replicated 1/1 memcached:latest
rulxakqgu2ko swh_nginx replicated 1/1 nginx:latest *:5081->5081/tcp
q560pvw3q3ls swh_objstorage replicated 2/2 softwareheritage/base:20211022-121751
a2h3ltaqdt56 swh_prometheus global 1/1 prom/prometheus:latest
lm24et9gjn2k swh_prometheus-statsd-exporter replicated 1/1 prom/statsd-exporter:latest
gwqinrao5win swh_storage replicated 2/2 softwareheritage/base:20211022-121751
7g46blmphfb4 swh_web replicated 1/1 softwareheritage/web:20211022-121751
If everything is OK, you should have your mirror filling. Check docker logs:
.. code-block:: bash
~/swh-docker$ docker service logs swh-mirror_graph-replayer
~/swh-docker$ docker service logs swh_graph-replayer
[...]
or:
.. code-block:: bash
~/swh-docker$ docker service logs --tail 100 --follow swh-mirror_graph-replayer
~/swh-docker$ docker service logs --tail 100 --follow swh_graph-replayer
[...]
@@ -429,7 +435,7 @@ start these services with:
config > content-replayer.yml
~/swh-docker$ docker stack deploy \
-c content-replayer.yml \
swh-mirror
swh
[...]
@@ -447,12 +453,15 @@ Putting all together is just a matter of merging the 3 compose files:
config > mirror.yml
~/swh-docker$ docker stack deploy \
-c mirror.yml \
swh-mirror
swh
[...]
Scaling up services
-------------------
Getting your deployment production-ready
========================================
docker-stack scaling
--------------------
In order to scale up a replayer service, you can use the `docker scale` command. For example:
@@ -464,16 +473,34 @@ In order to scale up a replayer service, you can use the `docker scale` command.
will start 4 copies of the graph replayer service.
Notes:
Notes on the throughput of the mirroring process
------------------------------------------------
- One graph replayer service requires a steady 500MB to 1GB of RAM to run, so
make sure you have properly sized machines for running these replayer
containers, and to monitor these.
- The overall bandwidth of the replayer will depend heavily on the
`swh_storage` service, thus on the `swh_db-storage`. It will require some
network bandwidth for the ingress kafka payload (this can easily peak to
several hundreds of Mb/s). So make sure you have a correctly tuned database
and enough network bw.
- The graph replayer containers will require sufficient network bandwidth for the kafka
traffic (this can easily peak to several hundreds of megabits per second, and the
total volume of data fetched will be multiple tens of terabytes).
- The biggest kafka topics are directory, revision and content, and will take the
longest to initially replay.
Operational concerns for the Storage database
---------------------------------------------
The overall throughput of the mirroring process will depend heavily on the `swh_storage`
service, and on the performance of the underlying `swh_db-storage` database. You will
need to make sure that your database is `properly tuned
<https://wiki.postgresql.org/wiki/Tuning_Your_PostgreSQL_Server>`_.
You may also want to deploy your database directly to a bare-metal server rather than
have it managed within the docker stack. To do so, you will have to:
- Biggest topics are the directory, revision and content.
- modify the (merged) configuration of the docker stack to drop references to the
`db-storage` service (itself, and as dependency for the `storage` service)
- ensure that docker containers deployed in your swarm are able to connect to your
external database server
- override the environment variables of the `storage` service to reference the external
database server and dbname
Loading