Move mirror documentation from devel to sysadm

and merge duplicate parts Related to T3829

Move mirror documentation from devel to sysadm
and merge duplicate parts Related to T3829
09ef5e2c · Vincent Sellier · f75b1db1 · 09ef5e2c · f75b1db1 · 09ef5e2c
Verified Commit 09ef5e2c authored 3 years ago by Vincent Sellier
--- a/docs/architecture/index.rst
+++ b/docs/architecture/index.rst
@@ -9,5 +9,4 @@ Software Architecture
   :titlesonly:

   overview
-   mirror
   metadata
--- a/docs/architecture/mirror.rst
+++ b/docs/architecture/mirror.rst
-.. _mirror:
-
-
-Mirroring
-=========
-
-
-Description
-----------
-
-A mirror is a full copy of the |swh| archive, operated independently from the
-Software Heritage initiative. A minimal mirror consists of two parts:
-
- the graph storage (typically an instance of :ref:`swh.storage <swh-storage>`),
-  which contains the Merkle DAG structure of the archive, *except* the
-  actual content of source code files (AKA blobs),
-
- the object storage (typically an instance of :ref:`swh.objstorage <swh-objstorage>`),
-  which contains all the blobs corresponding to archived source code files.
-
-However, a usable mirror needs also to be accessible by others. As such, a
-proper mirror should also allow to:
-
- navigate the archive copy using a Web browser and/or the Web API (typically
-  using the :ref:`the web application <swh-web>`),
- retrieve data from the copy of the archive (typically using the :ref:`the
-  vault service <swh-vault>`)
-
-A mirror is initially populated and maintained up-to-date by consuming data
-from the |swh| Kafka-based :ref:`journal <journal-specs>` and retrieving the
-blob objects (file content) from the |swh| :ref:`object storage <swh-objstorage>`.
-
-.. note:: It is not required that a mirror is deployed using the |swh| software
-   stack. Other technologies, including different storage methods, can be
-   used. But we will focus in this documentation to the case of mirror
-   deployment using the |swh| software stack.
-
-
-.. thumbnail:: ../images/mirror-architecture.svg
-
-   General view of the |swh| mirroring architecture.
-
-
-Mirroring the Graph Storage
-~~~~~~~~~~~~~~~~~~~~~~~~~~~
-
-The replication of the graph is based on a journal using Kafka_ as event
-streaming platform.
-
-On the Software Heritage side, every addition made to the archive consist of
-the addition of a :ref:`data-model` object. The new object is also serialized
-as a msgpack_ bytestring which is used as the value of a message added to a
-Kafka topic dedicated to the object type.
-
-The main Kafka topics for the |swh| :ref:`data-model` are:
-
- `swh.journal.objects.content`
- `swh.journal.objects.directory`
- `swh.journal.objects.extid`
- `swh.journal.objects.metadata_authority`
- `swh.journal.objects.metadata_fetcher`
- `swh.journal.objects.origin_visit_status`
- `swh.journal.objects.origin_visit`
- `swh.journal.objects.origin`
- `swh.journal.objects.raw_extrinsic_metadata`
- `swh.journal.objects.release`
- `swh.journal.objects.revision`
- `swh.journal.objects.skipped_content`
- `swh.journal.objects.snapshot`
-
-In order to set up a mirror of the graph, one needs to deploy a stack capable
-of retrieving all these topics and store their content reliably. For example a
-Kafka cluster configured as a replica of the main Kafka broker hosted by |swh|
-would do the job (albeit not in a very useful manner by itself).
-
-A more useful mirror can be set up using the :ref:`storage <swh-storage>`
-component with the help of the special service named `replayer` provided by the
-:mod:`swh.storage.replay` module.
-
-.. TODO: replace this previous link by a link to the 'swh storage replay'
-   command once available, and ideally once
-   https://github.com/sphinx-doc/sphinx/issues/880 is fixed
-
-
-Mirroring the Object Storage
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-
-File content (blobs) are *not* directly stored in messages of the
-`swh.journal.objects.content` Kafka topic, which only contains metadata about
-them, such as various kinds of cryptographic hashes. A separate component is in
-charge of replicating blob objects from the archive and stored them in the
-local object storage instance.
-
-A separate `swh-journal` client should subscribe to the
-`swh.journal.objects.content` topic to get the stream of blob objects
-identifiers, then retrieve corresponding blobs from the main Software Heritage
-object storage, and store them in the local object storage.
-
-A reference implementation for this component is available in
-:ref:`content replayer <swh-objstorage-replayer>`.
-
-
-Installation
------------
-
-When using the |swh| software stack to deploy a mirror, a number of |swh|
-software components must be installed (cf. architecture diagram above):
-
- a database to store the graph of the |swh| archive,
- the :ref:`swh-storage` component,
- an object storage solution (can be cloud-based or on local filesystem like
-  ZFS pools),
- the :ref:`swh-objstorage` component,
- the :mod:`swh.storage.replay` service (part of the :ref:`swh-storage`
-  package)
- the :mod:`swh.objstorage.replayer.replay` service (from the
-  :ref:`swh-objstorage-replayer` package).
-
-A `docker-swarm <https://docs.docker.com/engine/swarm/>`_ based deployment
-solution is provided as a working example of the mirror stack:
-
-  https://forge.softwareheritage.org/source/swh-docker
-
-It is strongly recommended to start from there before planning a
-production-like deployment.
-
-See the `README <https://forge.softwareheritage.org/source/swh-docker/browse/master/README.md>`_
-file of the `swh-docker <https://forge.softwareheritage.org/source/swh-docker>`_
-repository for details.
-
-
-.. _Kafka: https://kafka.apache.org/
-.. _msgpack: https://msgpack.org
-
--- a/docs/index.rst
+++ b/docs/index.rst
@@ -26,12 +26,8 @@ Architecture

 * :ref:`architecture-overview` → get a glimpse of the Software Heritage software
  architecture
-* :ref:`mirror` → learn what a Software Heritage mirror is and how to set up
-  one
 * :ref:`Metadata workflow <architecture-metadata>` → learn how Software Heritage
  stores and handles metadata
-* :ref:`Keycloak <keycloak>` → learn how to use Keycloak,
-  the authentication system used by |swh|'s web interface and public APIs

 Data Model and Specifications
 -----------------------------

--- a/docs/images/mirror-architecture.svg
+++ b/docs/images/mirror-architecture.svg
--- a/sysadm/mirror-operations/content-replayer.rst
+++ b/sysadm/mirror-operations/content-replayer.rst
-.. _content_replayer:
-
-Content Replayer Service
-========================
-
-.. todo::
-   This page is a work in progress.
--- a/sysadm/mirror-operations/deploy.rst
+++ b/sysadm/mirror-operations/deploy.rst
@@ -9,20 +9,26 @@ by |swh|.
 A mirror deployment will consists in running several components of the |swh|
 stack:

- an instance of the storage (swh-storage) with its backend storage (PostgreSQL
-  or Cassandra),
- an instance of the object storage (swh-objstorage) with its backend storage
-  solution (in-house with the `pathslicer` backend, or cloud based)
- an instance of the front page (swh-web)
- an instance of the search engine (swh-search)
- the vault service and its support tooling,
- the replayer services.
+- An instance of the storage (:ref:`swh-storage`);
+- A backend database (PostgreSQL or Cassandra) for the storage;
+- An instance of the object storage (:ref:`swh-objstorage`);
+- A large storage system (zfs or cloud storage) as the objstorage backend;
+- An instance of the frontend (:ref:`swh-web`);
+- [Optional] An instance of the search engine backend (:ref:`swh-search`);
+- [Optional] An elasticsearch instance as swh-search backend;
+- [Optional] The vault service and its support tooling (RabbitMQ,
+  :ref:`swh-scheduler`, :ref:`swh-vault`, ...);
+- The replayer services:
+
+  - :mod:`swh.storage.replay` service (part of the :ref:`swh-storage`
+    package)
+  - :mod:`swh.objstorage.replayer.replay` service (from the
+    :ref:`swh-objstorage-replayer` package)

 Each service consists in an HTTP-based RPC served by a `gunicorn
 <https://gunicorn.org/>`_ `WSGI
 <https://fr.wikipedia.org/wiki/Web_Server_Gateway_Interface>`_ server.

-
 Docker-based deployment
 -----------------------

@@ -36,55 +42,8 @@ as a working example of the mirror stack:
 It is strongly recommended to :ref:`start from there <mirror_docker>` in a test
 environment before planning a production-like deployment.

-
-Step by step deployment of a mirror
-----------------------------------
-
-When using the |swh| software stack to deploy a mirror, a number of |swh|
-software components must be installed and configured to interact woth each other:
-
-#. :ref:`How to deploy the objstorage <mirror_objstorage>`: the objstorage
-   consists in an object storage solution (can be cloud-based or on local
-   filesystem like ZFS pools) and the :ref:`swh-objstorage` service,
-
-#. :ref:`How to deploy graph replayer services <mirror_graph_replayer>`:
-   :mod:`swh-devel:swh.objstorage.replayer.replay` service is responsible for
-   consuming the ``content`` topic from the |swh| kafka broker and filling the mirror
-   objstorage, retrieving blob objects from a |swh| objstarage,
-
-#. :ref:`How to deploy the storage <mirror_storage>`: the storage consists in a
-   database to store the graph of the |swh| archive (PostgreSQL or Cassandra)
-   and the :ref:`swh-devel:swh-storage` service,
-
-#. :ref:`How to deploy graph replayer services <mirror_graph_replayer>`:
-   :mod:`swh-devel:swh.storage.replay` service is responsible for consuming from
-   the |swh| kafka broker and fill the mirror storage,
-
-#. :ref:`How to deploy the frontend <mirror_frontend>`: the :ref:`frontend
-   <swh-devel:swh-web>` consists in a `django <https://www.djangoproject.com/>`_
-   based application serving both the web API and the main UI for browsing the
-   Archive.
-
-#. :ref:`How to deploy the search engine <mirror_search>`: the :ref:`search engine
-   <swh-devel:swh-search>` consists in a `ElasticSearch <https://www.elastic.co/>`_
-   based application used by the frontend.
-
-#. :ref:`How to deploy the vault service <mirror_vault>`: the :ref:`vault
-   service <swh-devel:swh-vault>` consists in a backend asynchronous service
-   allowing the user to ask for a zip archive of a given repository or git
-   history.
-
-
-
 .. toctree::
   :titlesonly:
   :hidden:

   docker
-   objstorage
-   storage
-   content-replayer
-   graph-replayer
-   frontend
-   search
-   vault
--- a/sysadm/mirror-operations/frontend.rst
+++ b/sysadm/mirror-operations/frontend.rst
-.. _mirror_frontend:
-
-Frontend Services
-=================
-
-
-.. todo::
-   This page is a work in progress.
--- a/sysadm/mirror-operations/graph-replayer.rst
+++ b/sysadm/mirror-operations/graph-replayer.rst
-.. _mirror_graph_replayer:
-
-Graph Replayer Service
-======================
-
-
-.. todo::
-   This page is a work in progress.
--- a/sysadm/mirror-operations/index.rst
+++ b/sysadm/mirror-operations/index.rst
 .. _mirror_operations:

+
 Mirror Operations
 =================

+Description
+-----------
+
 A mirror is a full copy of the |swh| archive, operated independently from the
-Software Heritage initiative.
+Software Heritage initiative. A minimal mirror consists of two parts:
+
+- the graph storage (typically an instance of :ref:`swh.storage <swh-storage>`),
+  which contains the Merkle DAG structure of the archive, *except* the
+  actual content of source code files (AKA blobs),
+
+- the object storage (typically an instance of :ref:`swh.objstorage <swh-objstorage>`),
+  which contains all the blobs corresponding to archived source code files.
+
+However, a usable mirror needs also to be accessible by others. As such, a
+proper mirror should also allow to:
+
+- navigate the archive copy using a Web browser and/or the Web API (typically
+  using the :ref:`the web application <swh-web>`),
+- retrieve data from the copy of the archive (typically using the :ref:`the
+  vault service <swh-vault>`)
+
+A mirror is initially populated and maintained up-to-date by consuming data
+from the |swh| Kafka-based :ref:`journal <journal-specs>` and retrieving the
+blob objects (file content) from the |swh| :ref:`object storage <swh-objstorage>`.
+
+.. note:: It is not required that a mirror be deployed using the |swh| software
+   stack. Other technologies, including different storage methods, can be
+   used. But we will focus in this documentation to the case of mirror
+   deployment using the |swh| software stack.
+
+
+.. thumbnail:: ../images/mirror-architecture.svg
+
+   General view of the |swh| mirroring architecture.
+
+Mirroring the Graph Storage
+~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+The replication of the graph is based on a journal using Kafka_ as event
+streaming platform.
+
+On the Software Heritage side, every addition made to the archive consist of
+the addition of a :ref:`data-model` object. The new object is also serialized
+as a msgpack_ bytestring which is used as the value of a message added to a
+Kafka topic dedicated to the object type.
+
+The main Kafka topics for the |swh| :ref:`data-model` are:
+
+- `swh.journal.objects.content`
+- `swh.journal.objects.directory`
+- `swh.journal.objects.extid`
+- `swh.journal.objects.metadata_authority`
+- `swh.journal.objects.metadata_fetcher`
+- `swh.journal.objects.origin_visit_status`
+- `swh.journal.objects.origin_visit`
+- `swh.journal.objects.origin`
+- `swh.journal.objects.raw_extrinsic_metadata`
+- `swh.journal.objects.release`
+- `swh.journal.objects.revision`
+- `swh.journal.objects.skipped_content`
+- `swh.journal.objects.snapshot`
+
+In order to set up a mirror of the graph, one needs to deploy a stack capable
+of retrieving all these topics and store their content reliably. For example a
+Kafka cluster configured as a replica of the main Kafka broker hosted by |swh|
+would do the job (albeit not in a very useful manner by itself).
+
+A more useful mirror can be set up using the :ref:`storage <swh-storage>`
+component with the help of the special service named `replayer` provided by the
+:mod:`swh.storage.replay` module.
+
+.. TODO: replace this previous link by a link to the 'swh storage replay'
+  command once available, and ideally once
+  https://github.com/sphinx-doc/sphinx/issues/880 is fixed
+
+
+Mirroring the Object Storage
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+File content (blobs) are *not* directly stored in messages of the
+`swh.journal.objects.content` Kafka topic, which only contains metadata about
+them, such as various kinds of cryptographic hashes. A separate component is in
+charge of replicating blob objects from the archive and stored them in the
+local object storage instance.
+
+A separate `swh-journal` client should subscribe to the
+`swh.journal.objects.content` topic to get the stream of blob objects
+identifiers, then retrieve corresponding blobs from the main Software Heritage
+object storage, and store them in the local object storage.
+
+A reference implementation for this component is available in
+:ref:`content replayer <swh-objstorage-replayer>`.

-A mirror should be able to:

- store a full copy of the archive,
+Installation
+------------

- serve the data using the web UI,
+When using the |swh| software stack to deploy a mirror, a number of |swh|
+software components must be installed (cf. architecture diagram above).

- search the archive using the web UI,
+A `docker-swarm <https://docs.docker.com/engine/swarm/>`_ based deployment
+solution is provided as a working example of the mirror stack,
+see :ref:`mirror_deploy`.

- serve the data using the public API,
+It is strongly recommended to start from there before planning a
+production-like deployment.

- allow users to retrieve content from the archive using the :ref:`Vault
-  <swh-devel:swh-vault>` service.
+.. _Kafka: https://kafka.apache.org/
+.. _msgpack: https://msgpack.org

-See the :ref:`swh-devel:mirror` for a complete description of the mirror
-architecture.

-You may want to read:
+You may also want to read:

- :ref:`mirror_deploy` if you want to deploy a mirror of the |swh| archive on
-  your infrastructure.
 - :ref:`mirror_monitor` to learn how to monitor your mirror and how to report
  its health back the |swh|.
 - :ref:`mirror_onboard` for the |swh| side view of adding a new mirror.

--- a/sysadm/mirror-operations/objstorage.rst
+++ b/sysadm/mirror-operations/objstorage.rst
-.. _mirror_objstorage:
-
-Objstorage Service
-==================
-
-
-.. todo::
-   This page is a work in progress.
--- a/sysadm/mirror-operations/search.rst
+++ b/sysadm/mirror-operations/search.rst
-.. _mirror_search:
-
-Search Services
-===============
-
-
-.. todo::
-   This page is a work in progress.
--- a/sysadm/mirror-operations/storage.rst
+++ b/sysadm/mirror-operations/storage.rst
-.. _mirror_storage:
-
-Storage Services
-================
-
-
-.. todo::
-   This page is a work in progress.
--- a/sysadm/mirror-operations/vault.rst
+++ b/sysadm/mirror-operations/vault.rst
-.. _mirror_vault:
-
-Vault Services
-==============
-
-
-.. todo::
-   This page is a work in progress.