From e981b54a86e29ce6b494ff4f8660860d04e390af Mon Sep 17 00:00:00 2001 From: Valentin Lorentz <vlorentz@softwareheritage.org> Date: Fri, 9 Apr 2021 11:03:48 +0200 Subject: [PATCH] Move the architecture overview and mirror documentation to the architecture/ subfolder To match the toctree. I'm leaving keycloak out of it, because it does not actually describe the architecture. It's only in the architecture toctree for lack of a better place for now. --- docs/architecture.rst | 93 +---------------------- docs/architecture/index.rst | 11 ++- docs/architecture/mirror.rst | 133 +++++++++++++++++++++++++++++++++ docs/architecture/overview.rst | 94 +++++++++++++++++++++++ docs/index.rst | 2 +- docs/mirror.rst | 133 +-------------------------------- swh/docs/sphinx/conf.py | 2 + 7 files changed, 241 insertions(+), 227 deletions(-) create mode 100644 docs/architecture/mirror.rst create mode 100644 docs/architecture/overview.rst diff --git a/docs/architecture.rst b/docs/architecture.rst index 1fa7e727..ae695f6e 100644 --- a/docs/architecture.rst +++ b/docs/architecture.rst @@ -1,92 +1,3 @@ -.. _architecture: +:orphan: -Software Architecture -===================== - -From an end-user point of view, the |swh| platform consists in the -:term:`archive`, which can be accessed using the web interface or its REST API. -Behind the scene (and the web app) are several components that expose -different aspects of the |swh| :term:`archive` as internal RPC APIs. - -Each of these internal APIs have a dedicated (Postgresql) database. - -A global (and incomplete) view of this architecture looks like: - -.. thumbnail:: images/general-architecture.svg - - General view of the |swh| architecture. - -The front API components are: - -- :ref:`Storage API <swh-storage>` (including the Metadata Storage) -- :ref:`Deposit API <swh-deposit>` -- :ref:`Vault API <swh-vault>` -- :ref:`Indexer API <swh-indexer>` -- :ref:`Scheduler API <swh-scheduler>` - -On the back stage of this show, a celery_ based game of tasks and workers -occurs to perform all the required work to fill, maintain and update the |swh| -:term:`archive`. - -The main components involved in this choreography are: - -- :term:`Listers <lister>`: a lister is a type of task aiming at scraping a - web site, a forge, etc. to gather all the source code repositories it can - find. For each found source code repository, a :term:`loader` task is - created. - -- :term:`Loaders <loader>`: a loader is a type of task aiming at importing or - updating a source code repository. It is the one that inserts :term:`blob` - objects in the :term:`object storage`, and inserts nodes and edges in the - :ref:`graph <swh-merkle-dag>`. - -- :term:`Indexers <indexer>`: an indexer is a type of task aiming at crawling - the content of the :term:`archive` to extract derived information (mimetype, - etc.) - -- :term:`Vault <vault>`: this type of celery task is responsible for cooking a - compressed archive (zip or tgz) of an archived object (typically a directory - or a repository). Since this can be a rather long process, it is delegated to - an asynchronous (celery) task. - - -Tasks ------ - -Listers -+++++++ - -The following sequence diagram shows the interactions between these components -when a new forge needs to be archived. This example depicts the case of a -gitlab_ forge, but any other supported source type would be very similar. - -.. thumbnail:: images/tasks-lister.svg - -As one might observe in this diagram, it does two things: - -- it asks the forge (a gitlab_ instance in this case) the list of known - repositories, and - -- it insert one :term:`loader` task for each source code repository that will - be in charge of importing the content of that repository. - -Note that most listers usually work in incremental mode, meaning they store in a -dedicated database the current state of the listing of the forge. Then, on a subsequent -execution of the lister, it will ask only for new repositories. - -Also note that if the lister inserts a new loading task for a repository for which a -loading task already exists, the existing task will be updated (if needed) instead of -creating a new task. - -Loaders -+++++++ - -The sequence diagram below describe this second step of importing the content -of a repository. Once again, we take the example of a git repository, but any -other type of repository would be very similar. - -.. thumbnail:: images/tasks-git-loader.svg - - -.. _celery: https://www.celeryproject.org -.. _gitlab: https://gitlab.com +This page was moved to: :ref:`architecture-overview`. diff --git a/docs/architecture/index.rst b/docs/architecture/index.rst index c6064273..5311ca9a 100644 --- a/docs/architecture/index.rst +++ b/docs/architecture/index.rst @@ -1,10 +1,13 @@ -Architecture -============ +.. _architecture: + +Software Architecture +===================== + .. toctree:: :maxdepth: 2 :titlesonly: - ../architecture - ../mirror + overview + mirror ../keycloak/index diff --git a/docs/architecture/mirror.rst b/docs/architecture/mirror.rst new file mode 100644 index 00000000..7885df35 --- /dev/null +++ b/docs/architecture/mirror.rst @@ -0,0 +1,133 @@ +.. _mirror: + + +Mirroring +========= + + +Description +----------- + +A mirror is a full copy of the |swh| archive, operated independently from the +Software Heritage initiative. A minimal mirror consists of two parts: + +- the graph storage (typically an instance of :ref:`swh.storage <swh-storage>`), + which contains the Merkle DAG structure of the archive, *except* the + actual content of source code files (AKA blobs), + +- the object storage (typically an instance of :ref:`swh.objstorage <swh-objstorage>`), + which contains all the blobs corresponding to archived source code files. + +However, a usable mirror needs also to be accessible by others. As such, a +proper mirror should also allow to: + +- navigate the archive copy using a Web browser and/or the Web API (typically + using the :ref:`the web application <swh-web>`), +- retrieve data from the copy of the archive (typically using the :ref:`the + vault service <swh-vault>`) + +A mirror is initially populated and maintained up-to-date by consuming data +from the |swh| Kafka-based :ref:`journal <journal-specs>` and retrieving the +blob objects (file content) from the |swh| :ref:`object storage <swh-objstorage>`. + +.. note:: It is not required that a mirror is deployed using the |swh| software + stack. Other technologies, including different storage methods, can be + used. But we will focus in this documentation to the case of mirror + deployment using the |swh| software stack. + + +.. thumbnail:: images/mirror-architecture.svg + + General view of the |swh| mirroring architecture. + + +Mirroring the Graph Storage +~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +The replication of the graph is based on a journal using Kafka_ as event +streaming platform. + +On the Software Heritage side, every addition made to the archive consist of +the addition of a :ref:`data-model` object. The new object is also serialized +as a msgpack_ bytestring which is used as the value of a message added to a +Kafka topic dedicated to the object type. + +The main Kafka topics for the |swh| :ref:`data-model` are: + +- `swh.journal.objects.content` +- `swh.journal.objects.directory` +- `swh.journal.objects.metadata_authority` +- `swh.journal.objects.metadata_fetcher` +- `swh.journal.objects.origin_visit_status` +- `swh.journal.objects.origin_visit` +- `swh.journal.objects.origin` +- `swh.journal.objects.raw_extrinsic_metadata` +- `swh.journal.objects.release` +- `swh.journal.objects.revision` +- `swh.journal.objects.skipped_content` +- `swh.journal.objects.snapshot` + +In order to set up a mirror of the graph, one needs to deploy a stack capable +of retrieving all these topics and store their content reliably. For example a +Kafka cluster configured as a replica of the main Kafka broker hosted by |swh| +would do the job (albeit not in a very useful manner by itself). + +A more useful mirror can be set up using the :ref:`storage <swh-storage>` +component with the help of the special service named `replayer` provided by the +:doc:`apidoc/swh.storage.replay` module. + +.. TODO: replace this previous link by a link to the 'swh storage replay' + command once available, and ideally once + https://github.com/sphinx-doc/sphinx/issues/880 is fixed + + +Mirroring the Object Storage +~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +File content (blobs) are *not* directly stored in messages of the +`swh.journal.objects.content` Kafka topic, which only contains metadata about +them, such as various kinds of cryptographic hashes. A separate component is in +charge of replicating blob objects from the archive and stored them in the +local object storage instance. + +A separate `swh-journal` client should subscribe to the +`swh.journal.objects.content` topic to get the stream of blob objects +identifiers, then retrieve corresponding blobs from the main Software Heritage +object storage, and store them in the local object storage. + +A reference implementation for this component is available in +:ref:`content replayer <swh-objstorage-replayer>`. + + +Installation +------------ + +When using the |swh| software stack to deploy a mirror, a number of |swh| +software components must be installed (cf. architecture diagram above): + +- a database to store the graph of the |swh| archive, +- the :ref:`swh-storage` component, +- an object storage solution (can be cloud-based or on local filesystem like + ZFS pools), +- the :ref:`swh-objstorage` component, +- the :ref:`swh.storage.replay` service (part of the :ref:`swh-storage` + package) +- the :ref:`swh.objstorage.replayer.replay` service (from the + :ref:`swh-objstorage-replayer` package). + +A `docker-swarm <https://docs.docker.com/engine/swarm/>`_ based deployment +solution is provided as a working example of the mirror stack: + + https://forge.softwareheritage.org/source/swh-docker + +It is strongly recommended to start from there before planning a +production-like deployment. + +See the `README <https://forge.softwareheritage.org/source/swh-docker/browse/master/README.md>`_ +file of the `swh-docker <https://forge.softwareheritage.org/source/swh-docker>`_ +repository for details. + + +.. _Kafka: https://kafka.apache.org/ +.. _msgpack: https://msgpack.org + diff --git a/docs/architecture/overview.rst b/docs/architecture/overview.rst new file mode 100644 index 00000000..abe51ac0 --- /dev/null +++ b/docs/architecture/overview.rst @@ -0,0 +1,94 @@ +.. _architecture-overview: + +Software Architecture Overview +============================== + + +From an end-user point of view, the |swh| platform consists in the +:term:`archive`, which can be accessed using the web interface or its REST API. +Behind the scene (and the web app) are several components that expose +different aspects of the |swh| :term:`archive` as internal RPC APIs. + +Each of these internal APIs have a dedicated (Postgresql) database. + +A global (and incomplete) view of this architecture looks like: + +.. thumbnail:: images/general-architecture.svg + + General view of the |swh| architecture. + +The front API components are: + +- :ref:`Storage API <swh-storage>` (including the Metadata Storage) +- :ref:`Deposit API <swh-deposit>` +- :ref:`Vault API <swh-vault>` +- :ref:`Indexer API <swh-indexer>` +- :ref:`Scheduler API <swh-scheduler>` + +On the back stage of this show, a celery_ based game of tasks and workers +occurs to perform all the required work to fill, maintain and update the |swh| +:term:`archive`. + +The main components involved in this choreography are: + +- :term:`Listers <lister>`: a lister is a type of task aiming at scraping a + web site, a forge, etc. to gather all the source code repositories it can + find. For each found source code repository, a :term:`loader` task is + created. + +- :term:`Loaders <loader>`: a loader is a type of task aiming at importing or + updating a source code repository. It is the one that inserts :term:`blob` + objects in the :term:`object storage`, and inserts nodes and edges in the + :ref:`graph <swh-merkle-dag>`. + +- :term:`Indexers <indexer>`: an indexer is a type of task aiming at crawling + the content of the :term:`archive` to extract derived information (mimetype, + etc.) + +- :term:`Vault <vault>`: this type of celery task is responsible for cooking a + compressed archive (zip or tgz) of an archived object (typically a directory + or a repository). Since this can be a rather long process, it is delegated to + an asynchronous (celery) task. + + +Tasks +----- + +Listers ++++++++ + +The following sequence diagram shows the interactions between these components +when a new forge needs to be archived. This example depicts the case of a +gitlab_ forge, but any other supported source type would be very similar. + +.. thumbnail:: images/tasks-lister.svg + +As one might observe in this diagram, it does two things: + +- it asks the forge (a gitlab_ instance in this case) the list of known + repositories, and + +- it insert one :term:`loader` task for each source code repository that will + be in charge of importing the content of that repository. + +Note that most listers usually work in incremental mode, meaning they store in a +dedicated database the current state of the listing of the forge. Then, on a subsequent +execution of the lister, it will ask only for new repositories. + +Also note that if the lister inserts a new loading task for a repository for which a +loading task already exists, the existing task will be updated (if needed) instead of +creating a new task. + +Loaders ++++++++ + +The sequence diagram below describe this second step of importing the content +of a repository. Once again, we take the example of a git repository, but any +other type of repository would be very similar. + +.. thumbnail:: images/tasks-git-loader.svg + + +.. _celery: https://www.celeryproject.org +.. _gitlab: https://gitlab.com + diff --git a/docs/index.rst b/docs/index.rst index 6e46c059..50767286 100644 --- a/docs/index.rst +++ b/docs/index.rst @@ -25,7 +25,7 @@ Contributing Architecture ------------ -* :ref:`architecture` → get a glimpse of the Software Heritage software +* :ref:`architecture-overview` → get a glimpse of the Software Heritage software architecture * :ref:`mirror` → learn what a Software Heritage mirror is and how to set up one diff --git a/docs/mirror.rst b/docs/mirror.rst index eea99429..116aae51 100644 --- a/docs/mirror.rst +++ b/docs/mirror.rst @@ -1,132 +1,3 @@ -.. _mirror: +:orphan: - -Mirroring -========= - - -Description ------------ - -A mirror is a full copy of the |swh| archive, operated independently from the -Software Heritage initiative. A minimal mirror consists of two parts: - -- the graph storage (typically an instance of :ref:`swh.storage <swh-storage>`), - which contains the Merkle DAG structure of the archive, *except* the - actual content of source code files (AKA blobs), - -- the object storage (typically an instance of :ref:`swh.objstorage <swh-objstorage>`), - which contains all the blobs corresponding to archived source code files. - -However, a usable mirror needs also to be accessible by others. As such, a -proper mirror should also allow to: - -- navigate the archive copy using a Web browser and/or the Web API (typically - using the :ref:`the web application <swh-web>`), -- retrieve data from the copy of the archive (typically using the :ref:`the - vault service <swh-vault>`) - -A mirror is initially populated and maintained up-to-date by consuming data -from the |swh| Kafka-based :ref:`journal <journal-specs>` and retrieving the -blob objects (file content) from the |swh| :ref:`object storage <swh-objstorage>`. - -.. note:: It is not required that a mirror is deployed using the |swh| software - stack. Other technologies, including different storage methods, can be - used. But we will focus in this documentation to the case of mirror - deployment using the |swh| software stack. - - -.. thumbnail:: images/mirror-architecture.svg - - General view of the |swh| mirroring architecture. - - -Mirroring the Graph Storage -~~~~~~~~~~~~~~~~~~~~~~~~~~~ - -The replication of the graph is based on a journal using Kafka_ as event -streaming platform. - -On the Software Heritage side, every addition made to the archive consist of -the addition of a :ref:`data-model` object. The new object is also serialized -as a msgpack_ bytestring which is used as the value of a message added to a -Kafka topic dedicated to the object type. - -The main Kafka topics for the |swh| :ref:`data-model` are: - -- `swh.journal.objects.content` -- `swh.journal.objects.directory` -- `swh.journal.objects.metadata_authority` -- `swh.journal.objects.metadata_fetcher` -- `swh.journal.objects.origin_visit_status` -- `swh.journal.objects.origin_visit` -- `swh.journal.objects.origin` -- `swh.journal.objects.raw_extrinsic_metadata` -- `swh.journal.objects.release` -- `swh.journal.objects.revision` -- `swh.journal.objects.skipped_content` -- `swh.journal.objects.snapshot` - -In order to set up a mirror of the graph, one needs to deploy a stack capable -of retrieving all these topics and store their content reliably. For example a -Kafka cluster configured as a replica of the main Kafka broker hosted by |swh| -would do the job (albeit not in a very useful manner by itself). - -A more useful mirror can be set up using the :ref:`storage <swh-storage>` -component with the help of the special service named `replayer` provided by the -:doc:`apidoc/swh.storage.replay` module. - -.. TODO: replace this previous link by a link to the 'swh storage replay' - command once available, and ideally once - https://github.com/sphinx-doc/sphinx/issues/880 is fixed - - -Mirroring the Object Storage -~~~~~~~~~~~~~~~~~~~~~~~~~~~~ - -File content (blobs) are *not* directly stored in messages of the -`swh.journal.objects.content` Kafka topic, which only contains metadata about -them, such as various kinds of cryptographic hashes. A separate component is in -charge of replicating blob objects from the archive and stored them in the -local object storage instance. - -A separate `swh-journal` client should subscribe to the -`swh.journal.objects.content` topic to get the stream of blob objects -identifiers, then retrieve corresponding blobs from the main Software Heritage -object storage, and store them in the local object storage. - -A reference implementation for this component is available in -:ref:`content replayer <swh-objstorage-replayer>`. - - -Installation ------------- - -When using the |swh| software stack to deploy a mirror, a number of |swh| -software components must be installed (cf. architecture diagram above): - -- a database to store the graph of the |swh| archive, -- the :ref:`swh-storage` component, -- an object storage solution (can be cloud-based or on local filesystem like - ZFS pools), -- the :ref:`swh-objstorage` component, -- the :ref:`swh.storage.replay` service (part of the :ref:`swh-storage` - package) -- the :ref:`swh.objstorage.replayer.replay` service (from the - :ref:`swh-objstorage-replayer` package). - -A `docker-swarm <https://docs.docker.com/engine/swarm/>`_ based deployment -solution is provided as a working example of the mirror stack: - - https://forge.softwareheritage.org/source/swh-docker - -It is strongly recommended to start from there before planning a -production-like deployment. - -See the `README <https://forge.softwareheritage.org/source/swh-docker/browse/master/README.md>`_ -file of the `swh-docker <https://forge.softwareheritage.org/source/swh-docker>`_ -repository for details. - - -.. _Kafka: https://kafka.apache.org/ -.. _msgpack: https://msgpack.org +This page was moved to: :ref:`mirror`. diff --git a/swh/docs/sphinx/conf.py b/swh/docs/sphinx/conf.py index 38c71ba3..892ac560 100755 --- a/swh/docs/sphinx/conf.py +++ b/swh/docs/sphinx/conf.py @@ -133,6 +133,8 @@ redirects = { "swh-deposit/metadata": "api/metadata.html", "swh-deposit/specs/blueprint": "../api/use-cases.html", "swh-deposit/user-manual": "api/user-manual.html", + "architecture": "architecture/overview.html", + "mirror": "architecture/mirror.html", } -- GitLab