massmoca lost synchronization with albertina

assigned to @olasd

mentioned in commit swh/infra/puppet/puppet-swh-site@6aa80be7

mentioned in commit swh/infra/ci-cd/swh-charts@d1d3a962

massmoca's database has been reinstalled with pg16, and the replication started.

As usual, I first tried to do that with the whole schema enabled, including indexes, which makes the initial sync take forever.

This morning cleaned up the replica database, removed all indexes and launched the initial sync again (which is already farther along than it was after 15 hours of work...)

Could you describe what you did exactly for the record, or point to some other issues if it exists ? Thanks a lot

The initial sync of all tables to massmoca is complete, now the indexes need to be rebuilt.

The indexes have been rebuilt, and the replica is struggling to recover the lag, which hovers around 200 GB...

Looks like the replica is now running properly and has remained up to date with the primary for a week (including the weekend's backup window).

And indeed the bootstrap steps for the replica don't seem documented, so let's summarize:

install the latest postgresql version with reasonable settings (effective_cache_size, shared_buffers etc.). Some relevant settings for logical replication since pg 16 are

max_logical_replication_workers = 16	# taken from max_worker_processes
max_sync_workers_per_subscription = 6	# taken from max_logical_replication_workers
max_parallel_apply_workers_per_subscription = 8	# taken from max_logical_replication_workers

create an empty database with the read_replica flavor using swh tooling. I got a bit lost between swh db create, swh db init-admin and swh db init, but eventually I ended up with an appropriate database with an empty schema.

I think I ended up using something along the lines of:

createdb -p 5433 -O swhstorage softwareheritage
swh db init-admin storage -d "user=postgres host=localhost port=5433 dbname=softwareheritage password=xxx"
swh db init storage --flavor read_replica -d "user=swhstorage host=localhost port=5433 dbname=softwareheritage password=$PGPASS"

Make sure pgbouncer works with the new database (authentication needs some specific functions in the postgres database)
create the subscription in the new database (which triggers the initial sync, up to max_sync_workers_per_subscription parallel tables)

create subscription softwareheritage_massmoca CONNECTION 'host=albertina.internal.softwareheritage.org port=5433 user=postgres dbname=softwareheritage' PUBLICATION softwareheritage WITH (slot_name = 'softwareheritage_massmoca', create_slot=true, streaming=true, copy_data=true, binary=true);

The primary server will accumulate WALs for the whole duration of every single table sync, so you need to be ready to accommodate that (this is especially problematic if a backup is running, as that increases WAL traffic significantly).

monitor the initial sync of tables; The progress indicator here is table size, so it's a bit of a wild guess. Overall without indexes the initial sync takes a few days, the content and directory tables take the longest

select pg_stat_subscription.*, relname from pg_stat_subscription left join pg_class on pg_class.oid = relid;

notice that indexes are making the initial replication take forever, and re-do steps 2-4 dropping the index creation script (60-indexes.sql) in swh db init
When every table finishes syncing, create the associated indexes by running the 60-indexes.sql commands manually. Doing all the create index commands for the same table in parallel allows postgresql to read the data only once (but, counterintuitively, that only works for create index without concurrently.

You can monitor index creation progress with :

select pid, pg_stat_progress_create_index.datname, command, phase, case blocks_total when 0 then 0 else (blocks_done * 100.0) / (blocks_total * 1.0) end as blocks_pc, case tuples_total when 0 then 0 else (tuples_done * 100.0) / (tuples_total * 1.0) end as tuples_pc,relname, query from pg_stat_progress_create_index left join pg_class on pg_class.oid = relid left join pg_stat_activity using (pid);

thanks

closed

massmoca lost synchronization with albertina

Designs

Child items ...

Activity