Migrate production database servers to bullseye
This focuses on migrating the release of the (db) nodes from buster to bullseye. This does not migrate the postgresql version in itself. This is another dedicated task [1].
Servers to migrate:
- db1.internal.staging.swh.network
- belvedere.internal.softwareheritage.org
- somerset.internal.softwareheritage.org
Plan: staging: Moved to this subtask #3813 (closed)
Production:
-
upgrade somerset
- switch the webapp to belvedere
- on somerset
- disable puppet
- stop and disable postgresql
- perform the last buster upgrade
-
reboot(restarted recently) - perform the bullseye upgrade
- reboot
- restart and enable postgresql
- check the replication with belvedere is ok
- switch back the webapp to somerset
-
upgrade of belvedere
- add a notification in the status.io page
A database upgrade is scheduled the XXXX-XX-XX between XX:XX and XX:XX
Some service disruptions can occur during this period
Impacted services:
- archive.softwareheritage.org
- Save code now
- Source code crawler
- deposit
- [x] connect to the idrac: https://swh9-adm.inria.fr/
- [x] stop the loaders and listers workers
- [x] stop the indexers
- [x] stop the scheduler runners + those in the tmux in saatchi
- [x] **ensure the provenance experiment is stopped**
- [x] on belvedere:
- [X] stop puppet
- [X] ~~stop and disable postgresql**s ** (to avoid the restarts after the server reboots)~~ can be ignored
- [x] ~~perform the last upgrade of buster~~
- [x] ~~reboot~~
- [X] upgrade to bullseye
- [X] reboot
- [X] check everything is going well after the reboot
- [X] ~~start and enable the postgresql servers~~
- [X] check the replication to somerset is ok
- [X] reactivate puppet
- [x] restart stopped services
Migrated from T3801 (view on Phabricator)
- Show closed items
Activity
-
Newest first Oldest first
-
Show all activity Show comments only Show history only
- Vincent Sellier mentioned in issue #3579 (closed)
mentioned in issue #3579 (closed)
- Phabricator Migration user marked this issue as related to #3579 (closed)
marked this issue as related to #3579 (closed)
- Vincent Sellier added Component upgrades priority:Normal labels
added Component upgrades priority:Normal labels
- Author Maintainer
The following minor postgresql upgrades will be performed during the upgrade:
- somerset: postgresql 13.4 -> 13.5 [1]
A dump/restore is not required for those running 13.X.
- belvedere:
- 11.14-0 -> 11.14-1 (indexer db)
- 12.8-1 -> 12.9-1 [2] (other dbs)
A dump/restore is not required for those running 12.X.
-
db1:
- 12.8-1 -> 12.9-1 [2]
- Vincent Sellier changed the description
changed the description
- Antoine R. Dumont changed title from migrate database servers to bullseye to Migrate database servers to bullseye
changed title from migrate database servers to bullseye to Migrate database servers to bullseye
- Antoine R. Dumont changed the description
changed the description
- Phabricator Migration user marked this issue as related to #3813 (closed)
marked this issue as related to #3813 (closed)
- Vincent Sellier changed the description
changed the description
- Vincent Sellier assigned to @vsellier
assigned to @vsellier
- Vincent Sellier added state:wip label
added state:wip label
- Vincent Sellier changed title from Migrate database servers to bullseye to Migrate production database servers to bullseye
changed title from Migrate database servers to bullseye to Migrate production database servers to bullseye
- Vincent Sellier changed the description
changed the description
- Vincent Sellier marked the checklist item disable puppet as completed
marked the checklist item disable puppet as completed
- Vincent Sellier marked the checklist item stop and disable postgresql as completed
marked the checklist item stop and disable postgresql as completed
- Author Maintainer
somerset
on moma:
- puppet disabled
root@moma:/etc/softwareheritage/storage# puppet agent --disable '#3801 upgrade database servers'
- storage configuration update to use belvedere database and service restarted
on somerset:
- last upgrade of buster applied:
root@somerset:~# apt list --upgradable Listing... Done libpq5/buster-pgdg 14.1-1.pgdg100+1 amd64 [upgradable from: 14.0-1.pgdg100+1] pgbouncer/buster-pgdg 1.16.1-1.pgdg100+1 amd64 [upgradable from: 1.16.0-1.pgdg100+1] postgresql-11/buster-pgdg 11.14-1.pgdg100+1 amd64 [upgradable from: 11.14-0+deb10u1] postgresql-13/buster-pgdg 13.5-1.pgdg100+1 amd64 [upgradable from: 13.4-4.pgdg100+1] postgresql-14/buster-pgdg 14.1-1.pgdg100+1 amd64 [upgradable from: 14.0-1.pgdg100+1] postgresql-client-11/buster-pgdg 11.14-1.pgdg100+1 amd64 [upgradable from: 11.14-0+deb10u1] postgresql-client-13/buster-pgdg 13.5-1.pgdg100+1 amd64 [upgradable from: 13.4-4.pgdg100+1] postgresql-client-14/buster-pgdg 14.1-1.pgdg100+1 amd64 [upgradable from: 14.0-1.pgdg100+1] postgresql-client-common/buster-pgdg 232.pgdg100+1 all [upgradable from: 231.pgdg100+1] postgresql-common/buster-pgdg 232.pgdg100+1 all [upgradable from: 231.pgdg100+1] postgresql-plperl-11/buster-pgdg 11.14-1.pgdg100+1 amd64 [upgradable from: 11.14-0+deb10u1] postgresql-plpython3-11/buster-pgdg 11.14-1.pgdg100+1 amd64 [upgradable from: 11.14-0+deb10u1] postgresql/buster-pgdg 14+232.pgdg100+1 all [upgradable from: 14+231.pgdg100+1] root@somerset:~# apt upgrade
- postgresql has restarted correctly
2021-12-21 08:21:01 UTC [932629]: [3-1] LOG: starting PostgreSQL 13.5 (Debian 13.5-1.pgdg100+1) on x86_64-pc-linux-gnu, compiled by gcc (Debian 8.3.0-6) 8.3.0, 64-bit 2021-12-21 08:21:01 UTC [932629]: [4-1] LOG: listening on IPv6 address "::1", port 5433 2021-12-21 08:21:01 UTC [932629]: [5-1] LOG: listening on IPv4 address "127.0.0.1", port 5433 2021-12-21 08:21:01 UTC [932629]: [6-1] LOG: listening on IPv4 address "192.168.100.103", port 5433 2021-12-21 08:21:01 UTC [932629]: [7-1] LOG: listening on Unix socket "/var/run/postgresql/.s.PGSQL.5433" 2021-12-21 08:21:01 UTC [932631]: [1-1] LOG: database system was shut down at 2021-12-21 08:20:53 UTC 2021-12-21 08:21:01 UTC [932631]: [2-1] LOG: recovered replication state of node 1 to 258A0/2A30E0E8 2021-12-21 08:21:01 UTC [932629]: [8-1] LOG: database system is ready to accept connections 2021-12-21 08:21:01 UTC [932638]: [1-1] LOG: logical replication apply worker for subscription "softwareheritage_replica" has started
-
rebootno needed because only postgresql was updated - upgrade to bullseye
root@somerset:/etc# uptime 08:33:09 up 10 days, 17:36, 2 users, load average: 2.10, 2.27, 2.20 root@somerset:/etc# puppet agent --disable '#3801' root@somerset:/etc# sed -i -e 's/buster/bullseye/' /etc/apt/sources.list.d/* root@somerset:/etc# sed -i -e 's,bullseye/updates,bullseye-security,' /etc/apt/sources.list.d/debian-security.list root@somerset:/etc# git status On branch master Changes not staged for commit: (use "git add <file>..." to update what will be committed) (use "git checkout -- <file>..." to discard changes in working directory) modified: apt/sources.list.d/backports.list modified: apt/sources.list.d/debian-security.list modified: apt/sources.list.d/debian-updates.list modified: apt/sources.list.d/debian.list modified: apt/sources.list.d/hwraid_levert.list modified: apt/sources.list.d/icinga-stable-release.list modified: apt/sources.list.d/pgdg.list modified: apt/sources.list.d/softwareheritage.list no changes added to commit (use "git add" and/or "git commit -a") root@somerset:/etc# grep bullseye-security /etc/apt/sources.list.d/debian-security.list deb http://deb.debian.org/debian-security/ bullseye-security main root@somerset:/etc# git add . root@somerset:/etc# git commit -m "#3801: Migrate sources.list to bullseye"
root@somerset:/etc# CMD="apt -o Dpkg::Options::=--force-confdef -o Dpkg::Options::=--force-confold" root@somerset:/etc# export DEBIAN_FRONTEND=noninteractive root@somerset:/etc# $CMD upgrade -y root@somerset:/etc# $CMD dist-upgrade -y
- reboot
root@somerset:~# cat /etc/os-release PRETTY_NAME="Debian GNU/Linux 11 (bullseye)" NAME="Debian GNU/Linux" VERSION_ID="11" VERSION="11 (bullseye)" VERSION_CODENAME=bullseye ID=debian HOME_URL="https://www.debian.org/" SUPPORT_URL="https://www.debian.org/support" BUG_REPORT_URL="https://bugs.debian.org/"
- execute
apt autoremove
- reactivate and execute puppet
root@somerset:~# puppet agent --enable; puppet agent --test ... Notice: Applied catalog in 26.27 seconds
- postgresql is running correctly:
root@somerset:~# systemctl status postgresql@13-replica * postgresql@13-replica.service - PostgreSQL Cluster 13-replica Loaded: loaded (/lib/systemd/system/postgresql@.service; enabled-runtime; vendor preset: enabled) Active: active (running) since Tue 2021-12-21 08:50:46 UTC; 35s ago Process: 146 ExecStart=/usr/bin/pg_ctlcluster --skip-systemctl-redirect 13-replica start (code=exited, status=0/SUCCESS) Main PID: 209 (postgres) Tasks: 14 (limit: 618987) Memory: 770.2M CPU: 16.087s CGroup: /system.slice/system-postgresql.slice/postgresql@13-replica.service |-209 /usr/lib/postgresql/13/bin/postgres -D /srv/softwareheritage/postgres/13/replica -c config_file=/etc/postgresql/13/replica/postgresql.conf |-268 postgres: 13/replica: logger |-283 postgres: 13/replica: checkpointer |-284 postgres: 13/replica: background writer |-285 postgres: 13/replica: walwriter |-286 postgres: 13/replica: autovacuum launcher |-287 postgres: 13/replica: stats collector |-288 postgres: 13/replica: logical replication launcher |-293 postgres: 13/replica: logical replication worker for subscription 99086720 |-806 postgres: 13/replica: postgres softwareheritage 192.168.100.103(36940) idle |-807 postgres: 13/replica: guest softwareheritage 192.168.100.103(36942) idle |-819 postgres: 13/replica: guest softwareheritage 192.168.100.103(36944) idle |-820 postgres: 13/replica: guest softwareheritage 192.168.100.103(36946) idle `-823 postgres: 13/replica: autovacuum worker softwareheritage Dec 21 08:50:43 somerset systemd[1]: Starting PostgreSQL Cluster 13-replica... Dec 21 08:50:46 somerset systemd[1]: Started PostgreSQL Cluster 13-replica.
On moma:
- reconfgure storage to use somerset
- restart the service
- reactivate puppet and launch it to encure the configuration is correct
root@moma:/etc/softwareheritage/storage# puppet agent --enable; puppet agent --test
- Vincent Sellier marked the checklist item perform the last buster upgrade as completed
marked the checklist item perform the last buster upgrade as completed
- Vincent Sellier marked the checklist item upgrade somerset as completed
marked the checklist item upgrade somerset as completed
- Vincent Sellier marked the checklist item switch the webapp to belvedere as completed
marked the checklist item switch the webapp to belvedere as completed
- Vincent Sellier marked the checklist item perform the bullseye upgrade as completed
marked the checklist item perform the bullseye upgrade as completed
- Vincent Sellier marked the checklist item reboot as completed
marked the checklist item reboot as completed
- Vincent Sellier marked the checklist item restart and enable postgresql as completed
marked the checklist item restart and enable postgresql as completed
- Vincent Sellier marked the checklist item check the replication with belvedere is ok as completed
marked the checklist item check the replication with belvedere is ok as completed
- Vincent Sellier marked the checklist item switch back the webapp to somerset as completed
marked the checklist item switch back the webapp to somerset as completed
- Author Maintainer
Belvedere
A memory alert is logged on the idrac
Correctable memory error logging disabled for a memory device at location DIMM_A9. Fri 17 Dec 2021 16:15:39
We will have to monitor in the future to check if this memory dimm has some weaknesses
- before the upgrade:
% uptime 09:10:41 up 277 days, 18:46, 4 users, load average: 41.18, 38.10, 38.51
- Stopping the worker:
clush -b -w @swh-workers 'set -e; puppet agent --disable #3801; cd /etc/systemd/system/multi-user.target.wants; for unit in swh-worker@*; do systemctl disable $unit; done; systemctl stop --no-block swh-worker@*; sleep 300; systemctl kill swh-worker@* -s 9'
- stop the indexers
root@pergamon:/etc/clustershell# clush -b -w @azure-workers 'set -e; puppet agent --disable #3801; cd /etc/systemd/system/multi-user.target.wants; for unit in swh-worker@*; do systemctl disable $unit; done; systemctl stop --no-block swh-worker@*; sleep 300; systemctl kill swh-worker@* -s 9'
- stop the scheduler runners + tmux
root@saatchi:~# puppet agent --disable #3801 root@saatchi:~# systemctl stop swh-scheduler*
- In the tmux:
- the visit_type=deb was not running
- stopped:
swhscheduler@saatchi:~$ queue=oneshot3:swh.loader.git.tasks.UpdateGitRepository lister_uuid=860d41f8-d0c0-4733-a4d8-437c386bc31f; sleep=300; config=/etc/softwareheritage/scheduler/listener-runner.yml; while true; do for policy in never_visited_oldest_update_first already_visited_order_by_lag; do for visit_type in svn hg git; do echo "$(date) scheduling $visit_type origins with policy ${policy}"; SWH_CONFIG_FILENAME=$config swh scheduler -C $config origin send-to-celery --policy $policy --queue $queue --lister-uuid $lister_uuid $visit_type; done; echo "$(date) sleep $sleep" ; sleep $sleep; done done
swhscheduler@saatchi:~$ lister_name=gitlab.com; lister_uuid=baf89663-feae-4850-a8ec-3a21e699cc0b; queue="oneshot3:swh.loader.git.tasks.UpdateGitRepository" ; visit_type=git; sleep=300; while true; do for policy in never_visited_oldest_update_first never_visited_oldest_update_first never_visited_oldest_update_first already_visited_order_by_lag; do echo "$(date) scheduling $visit_type origins with policy ${policy} to queue ${queue} for lister ${lister_name}"; SWH_CONFIG_FILENAME=/etc/softwareheritage/scheduler/listener-runner.yml swh scheduler -C /etc/softwareheritage/scheduler/listener-runner.yml origin send-to-celery --lister-uuid $lister_uuid --queue $queue --policy $policy $visit_type; echo "$(date) sleep $sleep" ; sleep $sleep; done; done
On belvedere:
- network configuration updated to comment the physical interfaces
- (last upgrade buster upgrade ignored)
- Upgrade to bullseye
- upgrade the network configuration
/etc/network/interfaces
and comment the physical interface declarations - puppet disabled
root@belvedere:/etc/network# puppet agent --disable #3801
-
Upgrade to bullseye performed
-
everything is ok after the reboot
-
Restart the services
- all the services restarted
- all the scheduler services restarted as before
- an upgrade of buster was performed on the azure workers before restarting the services
everything looks good \o/
- Vincent Sellier marked the checklist item connect to the idrac: https://swh9-adm.inria.fr/ as completed
marked the checklist item connect to the idrac: https://swh9-adm.inria.fr/ as completed
- Vincent Sellier marked the checklist item stop the loaders and listers workers as completed
marked the checklist item stop the loaders and listers workers as completed
- Vincent Sellier marked the checklist item stop the indexers as completed
marked the checklist item stop the indexers as completed
- Vincent Sellier marked the checklist item stop the scheduler runners + those in the tmux in saatchi as completed
marked the checklist item stop the scheduler runners + those in the tmux in saatchi as completed
- Vincent Sellier marked the checklist item ensure the provenance experiment is stopped as completed
marked the checklist item ensure the provenance experiment is stopped as completed
- Vincent Sellier marked the checklist item stop puppet as completed
marked the checklist item stop puppet as completed
- Vincent Sellier marked the checklist item upgrade to bullseye as completed
marked the checklist item upgrade to bullseye as completed
- Vincent Sellier marked the checklist item reboot as completed
marked the checklist item reboot as completed
- Vincent Sellier marked the checklist item check everything is going well after the reboot as completed
marked the checklist item check everything is going well after the reboot as completed
- Vincent Sellier marked the checklist item restart stopped services as completed
marked the checklist item restart stopped services as completed
- Vincent Sellier marked the checklist item on belvedere: as completed
marked the checklist item on belvedere: as completed
- Vincent Sellier removed state:wip label
removed state:wip label
- Vincent Sellier closed
closed