From e981b54a86e29ce6b494ff4f8660860d04e390af Mon Sep 17 00:00:00 2001
From: Valentin Lorentz <vlorentz@softwareheritage.org>
Date: Fri, 9 Apr 2021 11:03:48 +0200
Subject: [PATCH] Move the architecture overview and mirror documentation to
 the architecture/ subfolder

To match the toctree.

I'm leaving keycloak out of it, because it does not actually describe
the architecture. It's only in the architecture toctree for lack of
a better place for now.
---
 docs/architecture.rst          |  93 +----------------------
 docs/architecture/index.rst    |  11 ++-
 docs/architecture/mirror.rst   | 133 +++++++++++++++++++++++++++++++++
 docs/architecture/overview.rst |  94 +++++++++++++++++++++++
 docs/index.rst                 |   2 +-
 docs/mirror.rst                | 133 +--------------------------------
 swh/docs/sphinx/conf.py        |   2 +
 7 files changed, 241 insertions(+), 227 deletions(-)
 create mode 100644 docs/architecture/mirror.rst
 create mode 100644 docs/architecture/overview.rst

diff --git a/docs/architecture.rst b/docs/architecture.rst
index 1fa7e727..ae695f6e 100644
--- a/docs/architecture.rst
+++ b/docs/architecture.rst
@@ -1,92 +1,3 @@
-.. _architecture:
+:orphan:
 
-Software Architecture
-=====================
-
-From an end-user point of view, the |swh| platform consists in the
-:term:`archive`, which can be accessed using the web interface or its REST API.
-Behind the scene (and the web app) are several components that expose
-different aspects of the |swh| :term:`archive` as internal RPC APIs.
-
-Each of these internal APIs have a dedicated (Postgresql) database.
-
-A global (and incomplete) view of this architecture looks like:
-
-.. thumbnail:: images/general-architecture.svg
-
-   General view of the |swh| architecture.
-
-The front API components are:
-
-- :ref:`Storage API <swh-storage>` (including the Metadata Storage)
-- :ref:`Deposit API <swh-deposit>`
-- :ref:`Vault API <swh-vault>`
-- :ref:`Indexer API <swh-indexer>`
-- :ref:`Scheduler API <swh-scheduler>`
-
-On the back stage of this show, a celery_ based game of tasks and workers
-occurs to perform all the required work to fill, maintain and update the |swh|
-:term:`archive`.
-
-The main components involved in this choreography are:
-
-- :term:`Listers <lister>`: a lister is a type of task aiming at scraping a
-  web site, a forge, etc. to gather all the source code repositories it can
-  find. For each found source code repository, a :term:`loader` task is
-  created.
-
-- :term:`Loaders <loader>`: a loader is a type of task aiming at importing or
-  updating a source code repository. It is the one that inserts :term:`blob`
-  objects in the :term:`object storage`, and inserts nodes and edges in the
-  :ref:`graph <swh-merkle-dag>`.
-
-- :term:`Indexers <indexer>`: an indexer is a type of task aiming at crawling
-  the content of the :term:`archive` to extract derived information (mimetype,
-  etc.)
-
-- :term:`Vault <vault>`: this type of celery task is responsible for cooking a
-  compressed archive (zip or tgz) of an archived object (typically a directory
-  or a repository). Since this can be a rather long process, it is delegated to
-  an asynchronous (celery) task.
-
-
-Tasks
------
-
-Listers
-+++++++
-
-The following sequence diagram shows the interactions between these components
-when a new forge needs to be archived. This example depicts the case of a
-gitlab_ forge, but any other supported source type would be very similar.
-
-.. thumbnail:: images/tasks-lister.svg
-
-As one might observe in this diagram, it does two things:
-
-- it asks the forge (a gitlab_ instance in this case) the list of known
-  repositories, and
-
-- it insert one :term:`loader` task for each source code repository that will
-  be in charge of importing the content of that repository.
-
-Note that most listers usually work in incremental mode, meaning they store in a
-dedicated database the current state of the listing of the forge. Then, on a subsequent
-execution of the lister, it will ask only for new repositories.
-
-Also note that if the lister inserts a new loading task for a repository for which a
-loading task already exists, the existing task will be updated (if needed) instead of
-creating a new task.
-
-Loaders
-+++++++
-
-The sequence diagram below describe this second step of importing the content
-of a repository. Once again, we take the example of a git repository, but any
-other type of repository would be very similar.
-
-.. thumbnail:: images/tasks-git-loader.svg
-
-
-.. _celery: https://www.celeryproject.org
-.. _gitlab: https://gitlab.com
+This page was moved to: :ref:`architecture-overview`.
diff --git a/docs/architecture/index.rst b/docs/architecture/index.rst
index c6064273..5311ca9a 100644
--- a/docs/architecture/index.rst
+++ b/docs/architecture/index.rst
@@ -1,10 +1,13 @@
-Architecture
-============
+.. _architecture:
+
+Software Architecture
+=====================
+
 
 .. toctree::
    :maxdepth: 2
    :titlesonly:
 
-   ../architecture
-   ../mirror
+   overview
+   mirror
    ../keycloak/index
diff --git a/docs/architecture/mirror.rst b/docs/architecture/mirror.rst
new file mode 100644
index 00000000..7885df35
--- /dev/null
+++ b/docs/architecture/mirror.rst
@@ -0,0 +1,133 @@
+.. _mirror:
+
+
+Mirroring
+=========
+
+
+Description
+-----------
+
+A mirror is a full copy of the |swh| archive, operated independently from the
+Software Heritage initiative. A minimal mirror consists of two parts:
+
+- the graph storage (typically an instance of :ref:`swh.storage <swh-storage>`),
+  which contains the Merkle DAG structure of the archive, *except* the
+  actual content of source code files (AKA blobs),
+
+- the object storage (typically an instance of :ref:`swh.objstorage <swh-objstorage>`),
+  which contains all the blobs corresponding to archived source code files.
+
+However, a usable mirror needs also to be accessible by others. As such, a
+proper mirror should also allow to:
+
+- navigate the archive copy using a Web browser and/or the Web API (typically
+  using the :ref:`the web application <swh-web>`),
+- retrieve data from the copy of the archive (typically using the :ref:`the
+  vault service <swh-vault>`)
+
+A mirror is initially populated and maintained up-to-date by consuming data
+from the |swh| Kafka-based :ref:`journal <journal-specs>` and retrieving the
+blob objects (file content) from the |swh| :ref:`object storage <swh-objstorage>`.
+
+.. note:: It is not required that a mirror is deployed using the |swh| software
+   stack. Other technologies, including different storage methods, can be
+   used. But we will focus in this documentation to the case of mirror
+   deployment using the |swh| software stack.
+
+
+.. thumbnail:: images/mirror-architecture.svg
+
+   General view of the |swh| mirroring architecture.
+
+
+Mirroring the Graph Storage
+~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+The replication of the graph is based on a journal using Kafka_ as event
+streaming platform.
+
+On the Software Heritage side, every addition made to the archive consist of
+the addition of a :ref:`data-model` object. The new object is also serialized
+as a msgpack_ bytestring which is used as the value of a message added to a
+Kafka topic dedicated to the object type.
+
+The main Kafka topics for the |swh| :ref:`data-model` are:
+
+- `swh.journal.objects.content`
+- `swh.journal.objects.directory`
+- `swh.journal.objects.metadata_authority`
+- `swh.journal.objects.metadata_fetcher`
+- `swh.journal.objects.origin_visit_status`
+- `swh.journal.objects.origin_visit`
+- `swh.journal.objects.origin`
+- `swh.journal.objects.raw_extrinsic_metadata`
+- `swh.journal.objects.release`
+- `swh.journal.objects.revision`
+- `swh.journal.objects.skipped_content`
+- `swh.journal.objects.snapshot`
+
+In order to set up a mirror of the graph, one needs to deploy a stack capable
+of retrieving all these topics and store their content reliably. For example a
+Kafka cluster configured as a replica of the main Kafka broker hosted by |swh|
+would do the job (albeit not in a very useful manner by itself).
+
+A more useful mirror can be set up using the :ref:`storage <swh-storage>`
+component with the help of the special service named `replayer` provided by the
+:doc:`apidoc/swh.storage.replay` module.
+
+.. TODO: replace this previous link by a link to the 'swh storage replay'
+   command once available, and ideally once
+   https://github.com/sphinx-doc/sphinx/issues/880 is fixed
+
+
+Mirroring the Object Storage
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+File content (blobs) are *not* directly stored in messages of the
+`swh.journal.objects.content` Kafka topic, which only contains metadata about
+them, such as various kinds of cryptographic hashes. A separate component is in
+charge of replicating blob objects from the archive and stored them in the
+local object storage instance.
+
+A separate `swh-journal` client should subscribe to the
+`swh.journal.objects.content` topic to get the stream of blob objects
+identifiers, then retrieve corresponding blobs from the main Software Heritage
+object storage, and store them in the local object storage.
+
+A reference implementation for this component is available in
+:ref:`content replayer <swh-objstorage-replayer>`.
+
+
+Installation
+------------
+
+When using the |swh| software stack to deploy a mirror, a number of |swh|
+software components must be installed (cf. architecture diagram above):
+
+- a database to store the graph of the |swh| archive,
+- the :ref:`swh-storage` component,
+- an object storage solution (can be cloud-based or on local filesystem like
+  ZFS pools),
+- the :ref:`swh-objstorage` component,
+- the :ref:`swh.storage.replay` service (part of the :ref:`swh-storage`
+  package)
+- the :ref:`swh.objstorage.replayer.replay` service (from the
+  :ref:`swh-objstorage-replayer` package).
+
+A `docker-swarm <https://docs.docker.com/engine/swarm/>`_ based deployment
+solution is provided as a working example of the mirror stack:
+
+  https://forge.softwareheritage.org/source/swh-docker
+
+It is strongly recommended to start from there before planning a
+production-like deployment.
+
+See the `README <https://forge.softwareheritage.org/source/swh-docker/browse/master/README.md>`_
+file of the `swh-docker <https://forge.softwareheritage.org/source/swh-docker>`_
+repository for details.
+
+
+.. _Kafka: https://kafka.apache.org/
+.. _msgpack: https://msgpack.org
+
diff --git a/docs/architecture/overview.rst b/docs/architecture/overview.rst
new file mode 100644
index 00000000..abe51ac0
--- /dev/null
+++ b/docs/architecture/overview.rst
@@ -0,0 +1,94 @@
+.. _architecture-overview:
+
+Software Architecture Overview
+==============================
+
+
+From an end-user point of view, the |swh| platform consists in the
+:term:`archive`, which can be accessed using the web interface or its REST API.
+Behind the scene (and the web app) are several components that expose
+different aspects of the |swh| :term:`archive` as internal RPC APIs.
+
+Each of these internal APIs have a dedicated (Postgresql) database.
+
+A global (and incomplete) view of this architecture looks like:
+
+.. thumbnail:: images/general-architecture.svg
+
+   General view of the |swh| architecture.
+
+The front API components are:
+
+- :ref:`Storage API <swh-storage>` (including the Metadata Storage)
+- :ref:`Deposit API <swh-deposit>`
+- :ref:`Vault API <swh-vault>`
+- :ref:`Indexer API <swh-indexer>`
+- :ref:`Scheduler API <swh-scheduler>`
+
+On the back stage of this show, a celery_ based game of tasks and workers
+occurs to perform all the required work to fill, maintain and update the |swh|
+:term:`archive`.
+
+The main components involved in this choreography are:
+
+- :term:`Listers <lister>`: a lister is a type of task aiming at scraping a
+  web site, a forge, etc. to gather all the source code repositories it can
+  find. For each found source code repository, a :term:`loader` task is
+  created.
+
+- :term:`Loaders <loader>`: a loader is a type of task aiming at importing or
+  updating a source code repository. It is the one that inserts :term:`blob`
+  objects in the :term:`object storage`, and inserts nodes and edges in the
+  :ref:`graph <swh-merkle-dag>`.
+
+- :term:`Indexers <indexer>`: an indexer is a type of task aiming at crawling
+  the content of the :term:`archive` to extract derived information (mimetype,
+  etc.)
+
+- :term:`Vault <vault>`: this type of celery task is responsible for cooking a
+  compressed archive (zip or tgz) of an archived object (typically a directory
+  or a repository). Since this can be a rather long process, it is delegated to
+  an asynchronous (celery) task.
+
+
+Tasks
+-----
+
+Listers
++++++++
+
+The following sequence diagram shows the interactions between these components
+when a new forge needs to be archived. This example depicts the case of a
+gitlab_ forge, but any other supported source type would be very similar.
+
+.. thumbnail:: images/tasks-lister.svg
+
+As one might observe in this diagram, it does two things:
+
+- it asks the forge (a gitlab_ instance in this case) the list of known
+  repositories, and
+
+- it insert one :term:`loader` task for each source code repository that will
+  be in charge of importing the content of that repository.
+
+Note that most listers usually work in incremental mode, meaning they store in a
+dedicated database the current state of the listing of the forge. Then, on a subsequent
+execution of the lister, it will ask only for new repositories.
+
+Also note that if the lister inserts a new loading task for a repository for which a
+loading task already exists, the existing task will be updated (if needed) instead of
+creating a new task.
+
+Loaders
++++++++
+
+The sequence diagram below describe this second step of importing the content
+of a repository. Once again, we take the example of a git repository, but any
+other type of repository would be very similar.
+
+.. thumbnail:: images/tasks-git-loader.svg
+
+
+.. _celery: https://www.celeryproject.org
+.. _gitlab: https://gitlab.com
+
diff --git a/docs/index.rst b/docs/index.rst
index 6e46c059..50767286 100644
--- a/docs/index.rst
+++ b/docs/index.rst
@@ -25,7 +25,7 @@ Contributing
 Architecture
 ------------
 
-* :ref:`architecture` â†’ get a glimpse of the Software Heritage software
+* :ref:`architecture-overview` â†’ get a glimpse of the Software Heritage software
   architecture
 * :ref:`mirror` â†’ learn what a Software Heritage mirror is and how to set up
   one
diff --git a/docs/mirror.rst b/docs/mirror.rst
index eea99429..116aae51 100644
--- a/docs/mirror.rst
+++ b/docs/mirror.rst
@@ -1,132 +1,3 @@
-.. _mirror:
+:orphan:
 
-
-Mirroring
-=========
-
-
-Description
------------
-
-A mirror is a full copy of the |swh| archive, operated independently from the
-Software Heritage initiative. A minimal mirror consists of two parts:
-
-- the graph storage (typically an instance of :ref:`swh.storage <swh-storage>`),
-  which contains the Merkle DAG structure of the archive, *except* the
-  actual content of source code files (AKA blobs),
-
-- the object storage (typically an instance of :ref:`swh.objstorage <swh-objstorage>`),
-  which contains all the blobs corresponding to archived source code files.
-
-However, a usable mirror needs also to be accessible by others. As such, a
-proper mirror should also allow to:
-
-- navigate the archive copy using a Web browser and/or the Web API (typically
-  using the :ref:`the web application <swh-web>`),
-- retrieve data from the copy of the archive (typically using the :ref:`the
-  vault service <swh-vault>`)
-
-A mirror is initially populated and maintained up-to-date by consuming data
-from the |swh| Kafka-based :ref:`journal <journal-specs>` and retrieving the
-blob objects (file content) from the |swh| :ref:`object storage <swh-objstorage>`.
-
-.. note:: It is not required that a mirror is deployed using the |swh| software
-   stack. Other technologies, including different storage methods, can be
-   used. But we will focus in this documentation to the case of mirror
-   deployment using the |swh| software stack.
-
-
-.. thumbnail:: images/mirror-architecture.svg
-
-   General view of the |swh| mirroring architecture.
-
-
-Mirroring the Graph Storage
-~~~~~~~~~~~~~~~~~~~~~~~~~~~
-
-The replication of the graph is based on a journal using Kafka_ as event
-streaming platform.
-
-On the Software Heritage side, every addition made to the archive consist of
-the addition of a :ref:`data-model` object. The new object is also serialized
-as a msgpack_ bytestring which is used as the value of a message added to a
-Kafka topic dedicated to the object type.
-
-The main Kafka topics for the |swh| :ref:`data-model` are:
-
-- `swh.journal.objects.content`
-- `swh.journal.objects.directory`
-- `swh.journal.objects.metadata_authority`
-- `swh.journal.objects.metadata_fetcher`
-- `swh.journal.objects.origin_visit_status`
-- `swh.journal.objects.origin_visit`
-- `swh.journal.objects.origin`
-- `swh.journal.objects.raw_extrinsic_metadata`
-- `swh.journal.objects.release`
-- `swh.journal.objects.revision`
-- `swh.journal.objects.skipped_content`
-- `swh.journal.objects.snapshot`
-
-In order to set up a mirror of the graph, one needs to deploy a stack capable
-of retrieving all these topics and store their content reliably. For example a
-Kafka cluster configured as a replica of the main Kafka broker hosted by |swh|
-would do the job (albeit not in a very useful manner by itself).
-
-A more useful mirror can be set up using the :ref:`storage <swh-storage>`
-component with the help of the special service named `replayer` provided by the
-:doc:`apidoc/swh.storage.replay` module.
-
-.. TODO: replace this previous link by a link to the 'swh storage replay'
-   command once available, and ideally once
-   https://github.com/sphinx-doc/sphinx/issues/880 is fixed
-
-
-Mirroring the Object Storage
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-
-File content (blobs) are *not* directly stored in messages of the
-`swh.journal.objects.content` Kafka topic, which only contains metadata about
-them, such as various kinds of cryptographic hashes. A separate component is in
-charge of replicating blob objects from the archive and stored them in the
-local object storage instance.
-
-A separate `swh-journal` client should subscribe to the
-`swh.journal.objects.content` topic to get the stream of blob objects
-identifiers, then retrieve corresponding blobs from the main Software Heritage
-object storage, and store them in the local object storage.
-
-A reference implementation for this component is available in
-:ref:`content replayer <swh-objstorage-replayer>`.
-
-
-Installation
-------------
-
-When using the |swh| software stack to deploy a mirror, a number of |swh|
-software components must be installed (cf. architecture diagram above):
-
-- a database to store the graph of the |swh| archive,
-- the :ref:`swh-storage` component,
-- an object storage solution (can be cloud-based or on local filesystem like
-  ZFS pools),
-- the :ref:`swh-objstorage` component,
-- the :ref:`swh.storage.replay` service (part of the :ref:`swh-storage`
-  package)
-- the :ref:`swh.objstorage.replayer.replay` service (from the
-  :ref:`swh-objstorage-replayer` package).
-
-A `docker-swarm <https://docs.docker.com/engine/swarm/>`_ based deployment
-solution is provided as a working example of the mirror stack:
-
-  https://forge.softwareheritage.org/source/swh-docker
-
-It is strongly recommended to start from there before planning a
-production-like deployment.
-
-See the `README <https://forge.softwareheritage.org/source/swh-docker/browse/master/README.md>`_
-file of the `swh-docker <https://forge.softwareheritage.org/source/swh-docker>`_
-repository for details.
-
-
-.. _Kafka: https://kafka.apache.org/
-.. _msgpack: https://msgpack.org
+This page was moved to: :ref:`mirror`.
diff --git a/swh/docs/sphinx/conf.py b/swh/docs/sphinx/conf.py
index 38c71ba3..892ac560 100755
--- a/swh/docs/sphinx/conf.py
+++ b/swh/docs/sphinx/conf.py
@@ -133,6 +133,8 @@ redirects = {
     "swh-deposit/metadata": "api/metadata.html",
     "swh-deposit/specs/blueprint": "../api/use-cases.html",
     "swh-deposit/user-manual": "api/user-manual.html",
+    "architecture": "architecture/overview.html",
+    "mirror": "architecture/mirror.html",
 }
 
 
-- 
GitLab