Skip to content
Snippets Groups Projects
Commit e8d48567 authored by David Douard's avatar David Douard
Browse files

Add the beginning of a top-level architecture document

parent c0f71e6c
No related branches found
No related tags found
1 merge request!11Add the beginning of a top-level architecture document
......@@ -22,6 +22,17 @@ complete [Software Heritage development environment][2].
How to build the doc
--------------------
Ensure you have the required tools to generate images ([graphviz][3]'s `dot`
and [plantuml][4]). On a Debian system:
$ sudo apt install plantuml graphviz
[3]: https://graphviz.org
[4]: http://plantuml.com
Then
$ cd docs
$ make html
......
.. _architecture:
Software Architecture
=====================
From an end-user point of view, the |swh| platform consists in the
:term:`archive`, which can be accessed using the web interface or its REST API.
Behind the scene (and the web app) are several components that expose
different aspects of the |swh| :term:`archive` as internal REST APIs.
Each of these internal APIs have a dedicated (Postgresql) database.
A global view of this architecture looks like:
.. thumbnail:: images/general-architecture.svg
General view of the |swh| architecture.
The front API components are:
- :ref:`Storage API <swh-storage>`
- :ref:`Deposit API <swh-deposit>`
- :ref:`Vault API <swh-vault>`
- :ref:`Indexer API <swh-indexer>`
- :ref:`Scheduler API <swh-scheduler>`
On the back stage of this show, a celery_ based game of tasks and workers
occurs to perform all the required work to fill, maintain and update the |swh|
:term:`archive`.
The main components involved in this choreography are:
- :term:`Listers <lister>`: a lister is a type of task aiming at scrapping a
web site, a forge, etc. to gather all the source code repositories it can
find. For each found source code repository, a :term:`loader` task is
created.
- :term:`Loaders <loader>`: a loader is a type of task aiming at importing or
updating a source code repository. It is the one that inserts :term:`blob`
objects in the :term:`object storage`, and inserts nodes and edges in the
:ref:`graph <swh-merkle-dag>`.
- :term:`Indexers <indexer>`: an indexer is a type of task aiming at crawling
the content of the :term:`archive` to extract derived information (mimetype,
etc.)
Tasks
-----
The following sequence diagram shows the interactions between these components
when a new forge needs to be archived. This example depicts the case of a
gitlab_ forge, but any other supported source type would be very similar.
.. thumbnail:: images/tasks-lister.svg
As one might observe in this diagram, it does create two things:
- it adds one :term:`origin` objects in the :term:`storage` database for each
source code repository, and
- it insert one :term:`loader` task for each source code repository that will
be in charge of importing the content of that repository.
The sequence diagram below describe this second step of importing the content
of a repository. Once again, we take the example of a git repository, but any
other type of repository would be very similar.
.. thumbnail:: images/tasks-git-loader.svg
.. _celery: https://www.celeryproject.org
.. _gitlab: https://gitlab.com
......@@ -6,7 +6,10 @@ DEP_GRAPHS += $(patsubst %,%.pdf,$(DEP_GRAPHS_base))
DEP_GRAPHS += $(patsubst %,%.svg,$(DEP_GRAPHS_base))
PY_DEPGRAPH = ../bin/py-depgraph
all: $(DEP_GRAPHS)
UML_DIAGS_SRC = $(wildcard *.uml)
UML_DIAGS = $(patsubst %.uml,%.svg,$(UML_DIAGS_SRC))
all: $(DEP_GRAPHS) $(UML_DIAGS)
py-deps-all.dot: $(PY_DEPGRAPH) $(PY_REQUIREMENTS)
cd ../../.. ; $(CURDIR)/$(PY_DEPGRAPH) > $(CURDIR)/$@
......@@ -23,5 +26,8 @@ py-deps-ext.dot: $(PY_DEPGRAPH) $(PY_REQUIREMENTS)
%.svg: %.dot
dot -T svg $< > $@
%.svg: %.uml
plantuml -tsvg $<
clean:
-rm -f $(DEP_GRAPHS)
-rm -f $(DEP_GRAPHS) $(UML_DIAGS)
source diff could not be displayed: it is too large. Options to address this: view the blob.
@startuml
participant SCH_DB as "scheduler DB" #B0C4DE
participant SCH_RUN as "scheduler runner"
participant SCH_LS as "scheduler listener"
participant RMQ as "Rabbit-MQ"
participant OBJSTORE as "object storage"
participant STORAGE_DB as "storage DB" #B0C4DE
participant STORAGE_API as "storage API"
participant WORK_GIT as "worker@loader-git"
participant GIT as "git server"
Note over SCH_DB,SCH_RUN: Task T2 created beforehand \n by the lister-gitlab task
loop Polling
SCH_RUN->>SCH_DB: GET TASK set state=scheduled
SCH_DB-->>SCH_RUN: TASK id=T2
activate SCH_RUN
SCH_RUN->>RMQ: CREATE Celery Task CT2 loader-git
deactivate SCH_RUN
activate RMQ
end
RMQ->>WORK_GIT: Start task CT2
deactivate RMQ
activate WORK_GIT
WORK_GIT->>STORAGE_API: GET origin state
activate STORAGE_API
STORAGE_API-->>WORK_GIT: 200
deactivate STORAGE_API
WORK_GIT->>GIT: GET refs
activate GIT
GIT->>WORK_GIT: 200 / refs
deactivate GIT
WORK_GIT->>GIT: GET new_objects
activate GIT
GIT->>WORK_GIT: 200 / objects
deactivate GIT
WORK_GIT->>GIT: PACKFILE
activate GIT
GIT->>WORK_GIT: 200 / blobs
deactivate GIT
WORK_GIT->>STORAGE_API: LOAD NEW CONTENT
activate STORAGE_API
loop For each blob
STORAGE_API->>OBJSTORE: ADD BLOB
end
STORAGE_API-->>WORK_GIT: 200 / blobs
deactivate STORAGE_API
WORK_GIT->>STORAGE_API: NEW DIR
activate STORAGE_API
loop For each DIR
STORAGE_API->>STORAGE_DB: INSERT DIR
end
STORAGE_API-->>WORK_GIT: 201
deactivate STORAGE_API
WORK_GIT->>STORAGE_API: NEW REV
activate STORAGE_API
loop For each REV
STORAGE_API->>STORAGE_DB: INSERT REV
end
STORAGE_API-->>WORK_GIT: 201
deactivate STORAGE_API
WORK_GIT->>STORAGE_API: NEW REL
activate STORAGE_API
loop For each REL
STORAGE_API->>STORAGE_DB: INSERT REL
end
STORAGE_API-->>WORK_GIT: 201
deactivate STORAGE_API
WORK_GIT->>STORAGE_API: NEW SNAPSHOT
activate STORAGE_API
loop For each SNAPSHOT
STORAGE_API->>STORAGE_DB: INSERT SNAPSHOT
end
STORAGE_API-->>WORK_GIT: 201
deactivate STORAGE_API
WORK_GIT-->>RMQ: SET CT2 status=eventful
deactivate WORK_GIT
activate RMQ
RMQ->>SCH_LS: NOTIFY end of task CT2
deactivate RMQ
activate SCH_LS
SCH_LS->>SCH_DB: UPDATE T2 set state=end
deactivate SCH_LS
@enduml
@startuml
participant WEB as "swh-web"
participant SCH_API as "scheduler API" #ECECFF
participant SCH_DB as "scheduler DB" #B0C4DE
participant SCH_RUN as "scheduler runner"
participant RMQ as "Rabbit-MQ"
participant SCH_LS as "scheduler listener"
participant WORK_GITLAB as "worker@gitlab-lister"
participant GITLAB as "gitlab API"
participant STORAGE_API as "storage API" #ECECFF
participant STORAGE_DB as "storage DB" #B0C4DE
Note over WEB,SCH_API: Save gitlab forge 0xdeadbeef
WEB->>SCH_API: CREATE TASK lister-gitlab
activate WEB
activate SCH_API
SCH_API->>SCH_DB: INSERT TASK
activate SCH_DB
SCH_API-->>WEB: 201
deactivate SCH_API
deactivate WEB
loop Polling
SCH_RUN->>SCH_DB: GET TASK set state=scheduled
SCH_DB-->>SCH_RUN: TASK id=T1
deactivate SCH_DB
activate SCH_RUN
SCH_RUN->>RMQ: CREATE Celery Task CT1
deactivate SCH_RUN
activate RMQ
end
RMQ->>WORK_GITLAB: Start task CT1
deactivate RMQ
activate WORK_GITLAB
WORK_GITLAB->>GITLAB: Get git repos
activate GITLAB
GITLAB-->>WORK_GITLAB: Known git repos
deactivate GITLAB
loop For Each Repo
WORK_GITLAB->>STORAGE_API: CREATE ORIGIN
activate STORAGE_API
WORK_GITLAB->>SCH_API: CREATE TASK loader-git
activate SCH_API
STORAGE_API->>STORAGE_DB: INSERT ORIGIN
STORAGE_API-->>WORK_GITLAB: 201
deactivate STORAGE_API
SCH_API->>SCH_DB: INSERT TASK
SCH_API-->>WORK_GITLAB: 201
deactivate SCH_API
end
WORK_GITLAB-->>RMQ: SET CT1 status=eventful
deactivate WORK_GITLAB
activate RMQ
RMQ->>SCH_LS: NOTIFY end of task CT1
activate SCH_LS
deactivate RMQ
SCH_LS->>SCH_DB: UPDATE T1 set state=end
deactivate SCH_LS
@enduml
......@@ -15,6 +15,13 @@ Getting started
stack
Architecture
------------
* :ref:`architecture` ← go there to have a glimpse on the Software Heritage software
architecture
Components
----------
......@@ -125,5 +132,6 @@ Indices and tables
:hidden:
:glob:
architecture
getting-started
swh-*/index
......@@ -17,9 +17,10 @@ author = 'the Software Heritage developers'
# ones.
extensions = ['sphinx.ext.autodoc',
'sphinx.ext.napoleon',
# 'sphinx.ext.intersphinx',
'sphinxcontrib.httpdomain',
'sphinx.ext.extlinks']
'sphinx.ext.extlinks',
'sphinxcontrib.images',
]
# Add any paths that contain templates here, relative to this directory.
templates_path = ['_templates']
......
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment