Add some documentation on how to operate a seaweedfs cluster for a mirror

f1d69cd6 · David Douard · 6376a8f6 · f1d69cd6 · f1d69cd6
Commit f1d69cd6 authored 11 months ago by David Douard
--- a/docs/sysadm/mirror-operations/index.rst
+++ b/docs/sysadm/mirror-operations/index.rst
@@ -149,14 +149,18 @@ You may want to read:

 - :ref:`mirror_monitor` to learn how to monitor your mirror and how to report
  its health back the |swh|.
+- :ref:`mirror_seaweedfs` to have some tips and explanations on how to use
+  SeaweedFS_ as objstorage backend for a mirror.
 - :ref:`mirror_onboard` for the |swh| side view of adding a new mirror.

+.. _SeaweedFS: https://github.com/seaweedfs/seaweedfs/

 .. toctree::
   :hidden:

   planning
   deploy
+   seaweedfs
   takedown-notices
   onboard
   monitor
--- a/docs/sysadm/mirror-operations/seaweedfs.rst
+++ b/docs/sysadm/mirror-operations/seaweedfs.rst
+.. _mirror_seaweedfs:
+
+Using seaweedfs as object storage backend
+=========================================
+
+When deploying a mirror, one of the key technological choice you have to make
+is the backend used for the Software Heritage object storage.
+
+The :ref:`swh-objstorage` module support a variety of backends, one of which is
+seaweedfs_.
+
+seaweedfs_ is a SeaweedFS is a fast distributed storage system for blobs,
+objects, files, and data lake. It's able to handle billions of files. It comes
+with many valuable features that are useful a :ref:`swh-objstorage` backend:
+
+- Distributed architecture; allows to easily grow the objstorage cluster when
+  needed.
+- Pack small files in big (30GB by default) volumes, implementing the concepts
+  behind the Haystack storage system developed by Facebook, preventing space
+  amplication problems.
+- Support for `Erasure Coding`_, allowing to have data replication at a
+  fraction of the cost in terms of storage space.
+- Supports asynchronous cross-DC replication.
+
+For seaweedfs to be used as an objstorage backend, it needs:
+
+- one seaweedfs master (ideally at least 3 for HA),
+- one seaweedfs filer using a fast key-value store as backend (e.g. leveldb or redis)
+- a number of seaweedfs volume servers with enough capacity to store the whole
+  Software Heritage Archive object storage (including erasure coding
+  replication factor, if used).
+
+The key to achieve good replication performances is to have a good
+communication backbone behind the seawwwedfs cluster. At least a properly
+configured 10G network is mandatory.
+
+For example, a seaweedfs cluster consisting of 4 bare metal machines hosting
+all the seaweedfs components (3x master, 1x filer, 4x volumes), the content
+replication process can achieve a steady throughput of 2000 objects/s at about
+200MB/s (giving a total replication ETA of about 3 months as of march 2023).
+
+.. _`Erasure Coding`: https://en.wikipedia.org/wiki/Erasure_code
+
+
+Master
+------
+
+The master configuration is pretty straightforward. Typically, you will want to
+ensure the following options of the ``weed master`` command are set:
+
+- ``volumePreallocate``: Preallocate disk space for volumes
+- ``metrics.address``: Prometheus gateway address
+
+You will also want to have a fair number of writable volumes set up (for good
+concurrent write performances); this can be done in the :file:`master.toml`
+configuration file:
+
+.. code-block:: ini
+
+   [master.volume_growth]
+   copy_1 = 32
+
+
+This ensure there are always at least 32 volumes used for writing.
+
+See `the SeaweedFS optimization page
+<https://github.com/seaweedfs/seaweedfs/wiki/Optimization>`_ for more details.
+
+Having multiple masters is a good idea. The number of masters must be an even
+number.
+
+Replication and erasure coding
++++++++++++++++++++++++++++++
+
+Seaweedfs support both replication (keeping multiple copies of each content
+object) and erasure coding. The erasure coding implementedin seaweedfs is 10.4
+Reed-Solomon Erasure Coding (EC). The large volumes are split into chunks of
+1GB, and every 10 data chunks are also encoded into 4 parity chunks. So a 30 GB
+data volume will be encoded into 14 EC shards, each shard is of size 3 GB and
+has 3 EC blocks. This allows loss of 4 shards of data with 1.4x data size.
+Compared to replicating data 5 times to achieve the same robustness, it saves
+3.6x disk space.
+
+Erasure Coding is only activated for warm storage, i.e. storage volumes that
+are sealed (almost full, not part of the active volumes in which new writes are
+done, and have not been modified for a while).
+
+As such, performing the EC processing is a asynchronous background process
+initiated by the master. It is typically set up using this piece of
+:file:`master.toml` configuration:
+
+.. code-block:: ini
+
+   [master.maintenance]
+   # periodically run these scripts are the same as running them from 'weed shell'
+   scripts = """
+     ec.encode -fullPercent=95 -quietFor=1h
+     ec.rebuild -force
+     ec.balance -force
+   """
+   sleep_minutes = 17          # sleep minutes between each script execution
+
+
+For replication, you can choose between different replication strategies:
+number of copies, placement constraints for copies (server, rack, datacenter);
+see `the documentation
+<https://github.com/seaweedfs/seaweedfs/wiki/Replication>`_ for more details.
+
+
+Filer
+-----
+
+The filer is the component used to make the link between the way objects are
+identified in software heritage (using the hash of the object as identifier)
+and the location of the objects (as well as a few other metadata entries) in a
+seaweedfs volume.
+
+This is a rather simple key/value store. Seaweedfs filer actually uses an
+existing k/v store as backend. By default, it will use a local leveldb store,
+putting all the objects in a flat namespace. This works ok up to a few billions
+of objects, but it might be a good idea to organize the filer a bit.
+
+When using leveldb as backend k/v store, the volume needed to index the whole
+|swh| archive objstorage is about 2TB (as of March 2023).
+
+Multi-filer deployment
++++++++++++++++++++++
+
+seaweedfs filer does support for multiple filers. When using a
+shared/distributed k/v store as backend (redis, postgresql, cassandra, HBase,
+etc.), the filer itself is stateless so it's easy to deploy several filer
+instances. But Seaweedfs also support parallel filers with embedded filer store
+(e.g. leveldb). But in this case, the asynchronous and eventually consistent
+nature of the replication process between the different filer instances makes
+it not suitable for the |swh| objstorage mirroring workload.
+
+It's however possible to use several filer instances (for HA or load balancing)
+when using a shared or distributed k/v store as backend (typically redis).
+
+Note that using a multiple filers with embedded filer store (leveldb) remains
+possible as a HA or backupping solution; as long as all the content replayers
+do target the same filer instance for writing objects, it should be fine. Doing
+so, the filer is (eventually) replicated and the behavior of the replication
+process remains consistent.
+
+Note that a single leveldb filer can support up to 2000 object insertions / sec
+(with fast SSD storage).
+
+Using Redis as filer store
++++++++++++++++++++++++++
+
+For a |swh| mirror deployment, it is probably a good idea to go with a solution
+based on a shared filer store (redis should be the best choice to ensure good
+performances during the replication process).
+
+However there is a small catch: a redis filer (``redis2``) store won't be able to
+handle having all the |swh| content entries in a single (flat) directory.
+
+Seaweedfs provides an alternate redis backend implementation, ``redis3``, that
+can circumvent this issue. This comes at the cost of slightly slower
+insertions.
+
+Another approach is to configure the |swh| ``seaweedfs`` objstorage to use a
+slicer: this will organize the objects in a directory structure built from the
+object hash. For example, using this |swh| objstorage configuration:
+
+.. code-block:: yaml
+
+   objstorage_dst:
+   cls: seaweedfs
+   compression: none
+   url: http://localhost:8088/swh/
+   slicing: "0:2/2:4"
+
+will organize the placement of objects in a tree structure with 2 levels of
+directory. For instance a file with SHA1
+`34973274ccef6ab4dfaaf86599792fa9c3fe4689` will be located at
+`http://localhost:8088/swh/34/97/34973274ccef6ab4dfaaf86599792fa9c3fe4689`.
+
+Note that using the slicer also comes with a slight performance hit.
+
+See `this page
+<https://github.com/seaweedfs/seaweedfs/wiki/Filer-Redis-Setup#super-large-directories>`_
+for more details on the ``redis3`` filer backend, and `this one
+<https://github.com/seaweedfs/seaweedfs/wiki/Super-Large-Directories>`_ for a
+more general discussion about handling super large directories in seaweedfs
+filer.
+
+Filer backup
++++++++++++
+
+The filer db is a key component of the seaweedfs backend; it's content can be
+rebuilt from the volumes, but the time it take to do so will probably be in the
+range of 1 to 2 months, so better not to loose it!
+
+One can use the integrated filer backup tool to perform continuous backups of
+the filer metadata using the command: ``weed filer.meta.backup``. This command
+needs a :file:`backup_filer.toml` file specifying a filer store that will be used as
+backup. For example, backupping an existing filer in a local leveldb the
+:file:`backup_filer.toml` file would be like:
+
+.. code-block:: ini
+
+   [leveldb2]
+   enabled = true
+   dir = "/tmp/filerrdb"  # directory to store leveldb files
+
+
+then:
+
+.. code-block:: bash
+
+   weed filer.meta.backup -config ./backup_filer.toml -filer localhost:8581
+   I0315 11:18:44 95975 leveldb2_store.go:42] filer store leveldb2 dir: /tmp/filerrdb
+   I0315 11:18:44 95975 file_util.go:23] Folder /tmp/filerrdb Permission: -rwxr-xr-x
+   I0315 11:18:45 95975 filer_meta_backup.go:112] configured filer store to leveldb2
+   I0315 11:18:45 95975 filer_meta_backup.go:80] traversing metadata tree...
+   I0315 11:18:45 95975 filer_meta_backup.go:86] metadata copied up to 2023-03-15 11:18:45.559363829 +0000 UTC m=+0.915180010
+   I0315 11:18:45 95975 filer_meta_backup.go:154] streaming from 2023-03-15 11:18:45.559363829 +0000 UTC
+
+
+It can be interrupted and restarted. Using the ``-restart`` argument force a full
+backup (rather than resuming incremental backup).
+
+This is however very similar to having multiple filers configured in the
+seawwedf cluster. As a backup component, it can however be a good idea not to
+have this backup filer store not part of the cluster at all.
+
+
+Volume
+------
+
+seaweedfs volume servers are the backbone of a seaweedfs cluster; where objects
+are stored. Each volume server will manage a number of volumes, each os which is
+by default about 30GB.
+
+Content objects are packed in small (compressed) chunks in one or more volumes
+(so one content object can be spread in several volumes).
+
+When using several disks on one volume server, you may use hardware or software
+RAID or similar redundancy techniques, but you may also prefer to run one
+volume server per disk. When using safety features like erasure coding or
+volume replication, seaweedfs has support for ensuring safe distribution among
+servers (depending on the number if servers and the number of disks, make sure
+losing one (or more) server does not lead to data loss.)
+
+By default, each volume server will use memory to store indexes for volume
+content (keep track of the position of each chunk in local volumes). This can
+lead to a few problems when the number of volumes increases:
+
+- extended startup time of the volume server (it needs to load all volume file
+  indexes at startup), and
+- consumed memory can grow pretty large.
+
+In the context of a |swh| mirror, it's a good idea to have `volume servers
+using a leveldb index
+<https://github.com/seaweedfs/seaweedfs/wiki/Optimization#use-leveldb>`_. This
+can be done using ``weed volume -index=leveldb`` command line argument (note
+there are 3 possible index arguments: ``leveldb``, ``leveldbMedium`` and
+``leveldbLarge``; see the seaweedfs documentation for more details).
+
+
+.. _seaweedfs: https://github.com/seaweedfs/seaweedfs/