Skip to content

Database replication lag keeps growing on somerset

It seems the softwareheritage database replication from belvedere to somerset is no more functional since at least one week.

This is what we get when we count the number of origins on belvedere:

antoine@guggenheim:~$ psql service=swh
psql (11.5 (Debian 11.5-1+deb10u1), serveur 11.3 (Debian 11.3-1.pgdg90+1))
Connexion SSL (protocole : TLSv1.2, chiffrement : ECDHE-RSA-CHACHA20-POLY1305, bits : 256, compression : désactivé)
Saisissez « help » pour l'aide.

softwareheritage=> select count(*) from origin;
  count   
----------
 90429442
(1 ligne)

While the same query on somerset returns the following:

antoine@guggenheim:~$ psql service=swh-replica
psql (11.5 (Debian 11.5-1+deb10u1), serveur 11.3 (Debian 11.3-1.pgdg90+1))
Connexion SSL (protocole : TLSv1.2, chiffrement : ECDHE-RSA-AES256-GCM-SHA384, bits : 256, compression : désactivé)
Saisissez « help » pour l'aide.

softwareheritage=> select count(*) from origin;
  count   
----------
 90233725
(1 ligne)

So we are currently missing 195717 origins in the replica. This number keeps growing as yersteday it was equal to 171809.

This lack of replication impacts the Software Heritage web application as it uses the database hosted on somerset. For instance, all 'Save code now' requests submitted since the last week are still marked as scheduled even if they were correctly executed. Because the newly ingested origins are not present in the replica database, no visit date can be found for them and thus the erroneous reported status.


Migrated from T2016 (view on Phabricator)