Skip to content
Snippets Groups Projects
Verified Commit 09ef5e2c authored by Vincent Sellier's avatar Vincent Sellier
Browse files

Move mirror documentation from devel to sysadm

and merge duplicate parts

Related to T3829
parent f75b1db1
No related branches found
No related tags found
1 merge request!293Move mirror documentation from devel to sysadm
......@@ -9,5 +9,4 @@ Software Architecture
:titlesonly:
overview
mirror
metadata
.. _mirror:
Mirroring
=========
Description
-----------
A mirror is a full copy of the |swh| archive, operated independently from the
Software Heritage initiative. A minimal mirror consists of two parts:
- the graph storage (typically an instance of :ref:`swh.storage <swh-storage>`),
which contains the Merkle DAG structure of the archive, *except* the
actual content of source code files (AKA blobs),
- the object storage (typically an instance of :ref:`swh.objstorage <swh-objstorage>`),
which contains all the blobs corresponding to archived source code files.
However, a usable mirror needs also to be accessible by others. As such, a
proper mirror should also allow to:
- navigate the archive copy using a Web browser and/or the Web API (typically
using the :ref:`the web application <swh-web>`),
- retrieve data from the copy of the archive (typically using the :ref:`the
vault service <swh-vault>`)
A mirror is initially populated and maintained up-to-date by consuming data
from the |swh| Kafka-based :ref:`journal <journal-specs>` and retrieving the
blob objects (file content) from the |swh| :ref:`object storage <swh-objstorage>`.
.. note:: It is not required that a mirror is deployed using the |swh| software
stack. Other technologies, including different storage methods, can be
used. But we will focus in this documentation to the case of mirror
deployment using the |swh| software stack.
.. thumbnail:: ../images/mirror-architecture.svg
General view of the |swh| mirroring architecture.
Mirroring the Graph Storage
~~~~~~~~~~~~~~~~~~~~~~~~~~~
The replication of the graph is based on a journal using Kafka_ as event
streaming platform.
On the Software Heritage side, every addition made to the archive consist of
the addition of a :ref:`data-model` object. The new object is also serialized
as a msgpack_ bytestring which is used as the value of a message added to a
Kafka topic dedicated to the object type.
The main Kafka topics for the |swh| :ref:`data-model` are:
- `swh.journal.objects.content`
- `swh.journal.objects.directory`
- `swh.journal.objects.extid`
- `swh.journal.objects.metadata_authority`
- `swh.journal.objects.metadata_fetcher`
- `swh.journal.objects.origin_visit_status`
- `swh.journal.objects.origin_visit`
- `swh.journal.objects.origin`
- `swh.journal.objects.raw_extrinsic_metadata`
- `swh.journal.objects.release`
- `swh.journal.objects.revision`
- `swh.journal.objects.skipped_content`
- `swh.journal.objects.snapshot`
In order to set up a mirror of the graph, one needs to deploy a stack capable
of retrieving all these topics and store their content reliably. For example a
Kafka cluster configured as a replica of the main Kafka broker hosted by |swh|
would do the job (albeit not in a very useful manner by itself).
A more useful mirror can be set up using the :ref:`storage <swh-storage>`
component with the help of the special service named `replayer` provided by the
:mod:`swh.storage.replay` module.
.. TODO: replace this previous link by a link to the 'swh storage replay'
command once available, and ideally once
https://github.com/sphinx-doc/sphinx/issues/880 is fixed
Mirroring the Object Storage
~~~~~~~~~~~~~~~~~~~~~~~~~~~~
File content (blobs) are *not* directly stored in messages of the
`swh.journal.objects.content` Kafka topic, which only contains metadata about
them, such as various kinds of cryptographic hashes. A separate component is in
charge of replicating blob objects from the archive and stored them in the
local object storage instance.
A separate `swh-journal` client should subscribe to the
`swh.journal.objects.content` topic to get the stream of blob objects
identifiers, then retrieve corresponding blobs from the main Software Heritage
object storage, and store them in the local object storage.
A reference implementation for this component is available in
:ref:`content replayer <swh-objstorage-replayer>`.
Installation
------------
When using the |swh| software stack to deploy a mirror, a number of |swh|
software components must be installed (cf. architecture diagram above):
- a database to store the graph of the |swh| archive,
- the :ref:`swh-storage` component,
- an object storage solution (can be cloud-based or on local filesystem like
ZFS pools),
- the :ref:`swh-objstorage` component,
- the :mod:`swh.storage.replay` service (part of the :ref:`swh-storage`
package)
- the :mod:`swh.objstorage.replayer.replay` service (from the
:ref:`swh-objstorage-replayer` package).
A `docker-swarm <https://docs.docker.com/engine/swarm/>`_ based deployment
solution is provided as a working example of the mirror stack:
https://forge.softwareheritage.org/source/swh-docker
It is strongly recommended to start from there before planning a
production-like deployment.
See the `README <https://forge.softwareheritage.org/source/swh-docker/browse/master/README.md>`_
file of the `swh-docker <https://forge.softwareheritage.org/source/swh-docker>`_
repository for details.
.. _Kafka: https://kafka.apache.org/
.. _msgpack: https://msgpack.org
......@@ -26,12 +26,8 @@ Architecture
* :ref:`architecture-overview` → get a glimpse of the Software Heritage software
architecture
* :ref:`mirror` → learn what a Software Heritage mirror is and how to set up
one
* :ref:`Metadata workflow <architecture-metadata>` → learn how Software Heritage
stores and handles metadata
* :ref:`Keycloak <keycloak>` → learn how to use Keycloak,
the authentication system used by |swh|'s web interface and public APIs
Data Model and Specifications
-----------------------------
......
.. _content_replayer:
Content Replayer Service
========================
.. todo::
This page is a work in progress.
......@@ -9,20 +9,26 @@ by |swh|.
A mirror deployment will consists in running several components of the |swh|
stack:
- an instance of the storage (swh-storage) with its backend storage (PostgreSQL
or Cassandra),
- an instance of the object storage (swh-objstorage) with its backend storage
solution (in-house with the `pathslicer` backend, or cloud based)
- an instance of the front page (swh-web)
- an instance of the search engine (swh-search)
- the vault service and its support tooling,
- the replayer services.
- An instance of the storage (:ref:`swh-storage`);
- A backend database (PostgreSQL or Cassandra) for the storage;
- An instance of the object storage (:ref:`swh-objstorage`);
- A large storage system (zfs or cloud storage) as the objstorage backend;
- An instance of the frontend (:ref:`swh-web`);
- [Optional] An instance of the search engine backend (:ref:`swh-search`);
- [Optional] An elasticsearch instance as swh-search backend;
- [Optional] The vault service and its support tooling (RabbitMQ,
:ref:`swh-scheduler`, :ref:`swh-vault`, ...);
- The replayer services:
- :mod:`swh.storage.replay` service (part of the :ref:`swh-storage`
package)
- :mod:`swh.objstorage.replayer.replay` service (from the
:ref:`swh-objstorage-replayer` package)
Each service consists in an HTTP-based RPC served by a `gunicorn
<https://gunicorn.org/>`_ `WSGI
<https://fr.wikipedia.org/wiki/Web_Server_Gateway_Interface>`_ server.
Docker-based deployment
-----------------------
......@@ -36,55 +42,8 @@ as a working example of the mirror stack:
It is strongly recommended to :ref:`start from there <mirror_docker>` in a test
environment before planning a production-like deployment.
Step by step deployment of a mirror
-----------------------------------
When using the |swh| software stack to deploy a mirror, a number of |swh|
software components must be installed and configured to interact woth each other:
#. :ref:`How to deploy the objstorage <mirror_objstorage>`: the objstorage
consists in an object storage solution (can be cloud-based or on local
filesystem like ZFS pools) and the :ref:`swh-objstorage` service,
#. :ref:`How to deploy graph replayer services <mirror_graph_replayer>`:
:mod:`swh-devel:swh.objstorage.replayer.replay` service is responsible for
consuming the ``content`` topic from the |swh| kafka broker and filling the mirror
objstorage, retrieving blob objects from a |swh| objstarage,
#. :ref:`How to deploy the storage <mirror_storage>`: the storage consists in a
database to store the graph of the |swh| archive (PostgreSQL or Cassandra)
and the :ref:`swh-devel:swh-storage` service,
#. :ref:`How to deploy graph replayer services <mirror_graph_replayer>`:
:mod:`swh-devel:swh.storage.replay` service is responsible for consuming from
the |swh| kafka broker and fill the mirror storage,
#. :ref:`How to deploy the frontend <mirror_frontend>`: the :ref:`frontend
<swh-devel:swh-web>` consists in a `django <https://www.djangoproject.com/>`_
based application serving both the web API and the main UI for browsing the
Archive.
#. :ref:`How to deploy the search engine <mirror_search>`: the :ref:`search engine
<swh-devel:swh-search>` consists in a `ElasticSearch <https://www.elastic.co/>`_
based application used by the frontend.
#. :ref:`How to deploy the vault service <mirror_vault>`: the :ref:`vault
service <swh-devel:swh-vault>` consists in a backend asynchronous service
allowing the user to ask for a zip archive of a given repository or git
history.
.. toctree::
:titlesonly:
:hidden:
docker
objstorage
storage
content-replayer
graph-replayer
frontend
search
vault
.. _mirror_frontend:
Frontend Services
=================
.. todo::
This page is a work in progress.
.. _mirror_graph_replayer:
Graph Replayer Service
======================
.. todo::
This page is a work in progress.
.. _mirror_operations:
Mirror Operations
=================
Description
-----------
A mirror is a full copy of the |swh| archive, operated independently from the
Software Heritage initiative.
Software Heritage initiative. A minimal mirror consists of two parts:
- the graph storage (typically an instance of :ref:`swh.storage <swh-storage>`),
which contains the Merkle DAG structure of the archive, *except* the
actual content of source code files (AKA blobs),
- the object storage (typically an instance of :ref:`swh.objstorage <swh-objstorage>`),
which contains all the blobs corresponding to archived source code files.
However, a usable mirror needs also to be accessible by others. As such, a
proper mirror should also allow to:
- navigate the archive copy using a Web browser and/or the Web API (typically
using the :ref:`the web application <swh-web>`),
- retrieve data from the copy of the archive (typically using the :ref:`the
vault service <swh-vault>`)
A mirror is initially populated and maintained up-to-date by consuming data
from the |swh| Kafka-based :ref:`journal <journal-specs>` and retrieving the
blob objects (file content) from the |swh| :ref:`object storage <swh-objstorage>`.
.. note:: It is not required that a mirror be deployed using the |swh| software
stack. Other technologies, including different storage methods, can be
used. But we will focus in this documentation to the case of mirror
deployment using the |swh| software stack.
.. thumbnail:: ../images/mirror-architecture.svg
General view of the |swh| mirroring architecture.
Mirroring the Graph Storage
~~~~~~~~~~~~~~~~~~~~~~~~~~~
The replication of the graph is based on a journal using Kafka_ as event
streaming platform.
On the Software Heritage side, every addition made to the archive consist of
the addition of a :ref:`data-model` object. The new object is also serialized
as a msgpack_ bytestring which is used as the value of a message added to a
Kafka topic dedicated to the object type.
The main Kafka topics for the |swh| :ref:`data-model` are:
- `swh.journal.objects.content`
- `swh.journal.objects.directory`
- `swh.journal.objects.extid`
- `swh.journal.objects.metadata_authority`
- `swh.journal.objects.metadata_fetcher`
- `swh.journal.objects.origin_visit_status`
- `swh.journal.objects.origin_visit`
- `swh.journal.objects.origin`
- `swh.journal.objects.raw_extrinsic_metadata`
- `swh.journal.objects.release`
- `swh.journal.objects.revision`
- `swh.journal.objects.skipped_content`
- `swh.journal.objects.snapshot`
In order to set up a mirror of the graph, one needs to deploy a stack capable
of retrieving all these topics and store their content reliably. For example a
Kafka cluster configured as a replica of the main Kafka broker hosted by |swh|
would do the job (albeit not in a very useful manner by itself).
A more useful mirror can be set up using the :ref:`storage <swh-storage>`
component with the help of the special service named `replayer` provided by the
:mod:`swh.storage.replay` module.
.. TODO: replace this previous link by a link to the 'swh storage replay'
command once available, and ideally once
https://github.com/sphinx-doc/sphinx/issues/880 is fixed
Mirroring the Object Storage
~~~~~~~~~~~~~~~~~~~~~~~~~~~~
File content (blobs) are *not* directly stored in messages of the
`swh.journal.objects.content` Kafka topic, which only contains metadata about
them, such as various kinds of cryptographic hashes. A separate component is in
charge of replicating blob objects from the archive and stored them in the
local object storage instance.
A separate `swh-journal` client should subscribe to the
`swh.journal.objects.content` topic to get the stream of blob objects
identifiers, then retrieve corresponding blobs from the main Software Heritage
object storage, and store them in the local object storage.
A reference implementation for this component is available in
:ref:`content replayer <swh-objstorage-replayer>`.
A mirror should be able to:
- store a full copy of the archive,
Installation
------------
- serve the data using the web UI,
When using the |swh| software stack to deploy a mirror, a number of |swh|
software components must be installed (cf. architecture diagram above).
- search the archive using the web UI,
A `docker-swarm <https://docs.docker.com/engine/swarm/>`_ based deployment
solution is provided as a working example of the mirror stack,
see :ref:`mirror_deploy`.
- serve the data using the public API,
It is strongly recommended to start from there before planning a
production-like deployment.
- allow users to retrieve content from the archive using the :ref:`Vault
<swh-devel:swh-vault>` service.
.. _Kafka: https://kafka.apache.org/
.. _msgpack: https://msgpack.org
See the :ref:`swh-devel:mirror` for a complete description of the mirror
architecture.
You may want to read:
You may also want to read:
- :ref:`mirror_deploy` if you want to deploy a mirror of the |swh| archive on
your infrastructure.
- :ref:`mirror_monitor` to learn how to monitor your mirror and how to report
its health back the |swh|.
- :ref:`mirror_onboard` for the |swh| side view of adding a new mirror.
......
.. _mirror_objstorage:
Objstorage Service
==================
.. todo::
This page is a work in progress.
.. _mirror_search:
Search Services
===============
.. todo::
This page is a work in progress.
.. _mirror_storage:
Storage Services
================
.. todo::
This page is a work in progress.
.. _mirror_vault:
Vault Services
==============
.. todo::
This page is a work in progress.
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment