Migrate upgrade swh services and storage db migration

This splits the upgrade swh services from the intranet page in 2 distincts documentation. Related to T3154

Migrate upgrade swh services and storage db migration
8ab8240d · Antoine R. Dumont · 4e9e5ce0 · 4e9e5ce0 · 8ab8240d · 8ab8240d
Verified Commit 8ab8240d authored 3 years ago by Antoine R. Dumont
--- a/sysadm/deployment/data-migration.rst
+++ b/sysadm/deployment/data-migration.rst
-.. _data-migration:
-
-How to handle data migrations
-=============================
-
-Empty page
----------
-
-.. todo::
-   This page is a work in progress.
-
-
-
-
-
-
--- a/sysadm/deployment/index.rst
+++ b/sysadm/deployment/index.rst
@@ -7,4 +7,4 @@ SWH Software Deployment
   deployment-environments
   upgrade-swh-service
   deploy-lister
-   data-migration
+   storage-database-migration
--- a/sysadm/deployment/storage-database-migration.rst
+++ b/sysadm/deployment/storage-database-migration.rst
+.. _storage-database-migration:
+
+How to handle a storage database migration
+==========================================
+
+.. admonition:: Intended audience
+   :class: important
+
+   sysadm staff members
+
+If a storage database upgrade is needed, a migration script should already exists in the
+*swh-storage* git repository.
+
+.. _upgrade_version:
+
+Upgrade version
+---------------
+
+Check the current database version (first one in desc order):
+
+.. code:: sql
+
+   select dbversion from dbversion order by version desc limit 1;
+
+Say, for example that the result is 159 here.
+
+Check the migration script folder in swh-storage:/sql/upgrades/ (and find the next one,
+for example `160.sql
+<https://forge.softwareheritage.org/source/swh-storage/browse/master/sql/upgrades/160.sql>`_).
+It's previous version number + 1 from the given db version retrieved (so 160 with the
+current example).
+
+Note: That you could need to run more than one migration. It depends on the current
+packaged version and the next version we want to deploy. Check the git history to
+determine that.
+
+Requisite
+---------
+
+Ensure the migration script runs first in the staging database
+(db0.internal.staging.swh.network is the node holding the swh staging database). Then
+you can go ahead and run it in production database
+(belvedere.internal.softwareheritage.org).
+
+Connect to the db with the user with write permission, then run the
+script:
+
+.. code::
+
+   $ psql -e ...
+   > \i sql/upgrades/160.sql
+
+Note:
+
+-  *-e* so you can see the queries currently running prior to its result
+
+-  For long-running scripts, connect to the remote machine first [5] [6]
+
+Adaptations
+-----------
+
+Hopefully, in production, the script runs as is without adaptation…
+
+Otherwise, if the data volume for a given table is large, you may want to adapt. See
+`160.sql
+<https://forge.softwareheritage.org/source/swh-storage/browse/master/sql/upgrades/160.sql>`_
+and `its adaptation <https://forge.softwareheritage.org/P747>`_
+
+For such a case, consider working on ranges on the table id instead. So it uses index
+and keep the transaction short. Long-standing migration query (translates to long
+running transaction). This could create too many WALs accumulation (for the
+replication), thus disk space starvation issue, etc…
+
+Note
+----
+
+We use grafana to ensure everything is fine (for example, for the replication, we use
+the `postgresql database dashboard, bottom page to the right
+<https://grafana.softwareheritage.org/d/PEKz-Ygiz/postgresql-server-overview?orgId=1&refresh=5m&from=1598405876817&to=1598427476817&var-instance=belvedere.internal.softwareheritage.org&var-cluster=:5433&var-datname=All&var-ntop_relations=5&var-interface=All&var-disk=All&var-filesystem=All&var-application_name=All&var-rate_interval=5m>`_).
+
+We also use it to keep a reference of what happened for a given deployment. For this,
+Open a grafana dashboard (for example `worker task processing dashboard
+<https://grafana.softwareheritage.org/d/b_xh3f9ik/worker-task-processing?orgId=1&from=now-6h&to=now>`_)
+and add a tag *deployment* (so it's shared across dashboards) with a description on what
+is the current deployment about. It's usually a list of module names that gets deployed
+and associated version deployed.
--- a/sysadm/deployment/upgrade-swh-service.rst
+++ b/sysadm/deployment/upgrade-swh-service.rst
 .. _upgrade-swh-service:

-How to upgrade swh service
-==========================
+Upgrade swh service
+===================

-Empty page
----------
+.. admonition:: Intended audience
+   :class: important
+
+   sysadm staff members
+
+Workers
+-------
+
+Dedicated workers [1] run our *swh-worker@loader_{git, hg, svn, npm, ...}* services.
+When a new version is released, we need to upgrade their package(s).
+
+[1] Here are the following group name (in `clush
+<https://clustershell.readthedocs.io/en/latest/index.html>`_ terms):
+
+-  *@swh-workers* for the production workers
+-  *@azure-workers* for the production ones running on azure
+-  *@staging-loader-workers* for the staging ones
+
+See :ref:`deploy-new-lister` for a practical example.
+
+Code and publish
+----------------
+
+.. _fix-or-evolve-code:
+
+Code an evolution of fix an issue in the python code within the git repository's master
+branch. Open a diff for review, land it when accepted, and start back at :ref:`tag and push
+<tag-and-push>`.
+
+.. _tag-and-push:
+
+Tag and push
+~~~~~~~~~~~~
+
+When ready, `git tag` and `git push` the new tag of the module.
+
+.. code::
+
+   $ git tag vA.B.C
+   $ git push origin --follow-tags
+
+.. _publish-and-deploy:
+
+Publish and deploy
+~~~~~~~~~~~~~~~~~~
+
+Let jenkins publish and deploy the debian package.
+
+.. _troubleshoot:
+
+Troubleshoot
+~~~~~~~~~~~~
+
+If jenkins fails for some reason, fix the module be it :ref:`python code
+<fix-or-evolve-code>` or the :ref:`debian packaging <troubleshoot-debian-package>`.
+
+.. _troubleshoot-debian-package:
+
+Debian package troubleshoot
+~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+In that case, upgrade and checkout the *debian/unstable-swh* branch, then fix whatever
+is not updated or broken due to a change. It's usually a missing new package dependency
+to fix in *debian/control*). Add a new entry in *debian/changelog*. Make sure gbp builds
+fine. Then tag it. Jenkins will build the package anew.
+
+.. code::
+
+   $ gbp buildpackage --git-tag-only --git-sign-tag  # tag it
+   $ git push origin --follow-tags                   # trigger the build
+
+Deploy
+------
+
+.. _nominal_case:
+
+Nominal case
+~~~~~~~~~~~~
+
+Update the machine dependencies and restart service. That usually means
+as sudo user:
+
+.. code::
+
+   $ apt-get update
+   $ apt-get dist-upgrade -y
+   $ systemctl restart swh-worker@loader_${type}
+
+Note that this is for one machine you ssh into.
+
+We usually wrap those commands from the sysadmin machine pergamon [3] with the *clush*
+command, something like:
+
+.. code::
+
+   $ sudo clush -b -w @swh-workers 'apt-get update; env DEBIAN_FRONTEND=noninteractive \
+       apt-get -o Dpkg::Options::="--force-confdef" \
+       -o Dpkg::Options::="--force-confold" -y dist-upgrade'
+
+[3] pergamon is already *clush* configured to allow multiple ssh connections in parallel
+on our managed infrastructure nodes.
+
+.. _configuration-change-required:
+
+Configuration change required
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Either wait for puppet to actually deploy the changes first and then go back to the
+nominal case.
+
+Or force a puppet run:
+
+.. code::
+
+   sudo clush -b -w @swh-workers puppet agent -t
+
+Note: *-t* is not optional
+
+.. _long-standing-migration:
+
+Long-standing migration
+~~~~~~~~~~~~~~~~~~~~~~~
+
+In that case, you may need to stop all services for migration which could take some time
+(because lots of data is migrated for example).
+
+You need to momentarily stop puppet (which runs every 30 min to apply manifest changes)
+and the cron service (which restarts down services) on the workers nodes.
+
+Report yourself to the :ref:`storage database migration <storage-database-migration>`
+for a concrete case of database migration.
+
+.. code::
+
+   $ sudo clush -b -w @swh-workers 'systemctl stop cron.service; puppet agent --disable'
+
+Then:
+
+-  Execute the database migration.
+-  Go back to the nominal case.
+-  Restart puppet and the cron on workers
+
+.. code::
+
+   $ sudo clush -b -w @swh-workers 'systemctl start cron.service; puppet agent --enable'

-.. todo::
-   This page is a work in progress.