S3 content replayers encounter a lot of old contents and not keeping up with current ingestion rate

changed milestone to %MRO 2023

assigned to @olasd

I don't know what made some old contents reappear in front of the kafka topic. This looks contingent with kafka compaction on the content topic happening on 2023-03-20. This caused some load issues on the kafka brokers preventing the main storage from sending some messages.

I've done the following:

add --check-dst flag to the s3 replayer CLI to do an existence check on S3 before sending a content
add SENTRY_DISABLE_LOGGING_EVENTS=true envvar to avoid spamming sentry with non-existent object logs
restarted the journal clients (and bumped to 48 processes instead of 32)

Looks like the first change quiesced the logs and the second one isn't really needed. The last change is needed for catching up.

Mar 22 14:29:29 saam swh[2280152]: INFO:swh.objstorage.replayer.replay:processed 186 content objects in 52.2sec (3.6 obj/sec, 0.5MB/sec) - 0 failed - 14 skipped
Mar 22 14:29:36 saam swh[2214262]: INFO:swh.objstorage.replayer.replay:processed 200 content objects in 46.4sec (4.3 obj/sec, 0.2MB/sec) - 0 failed - 0 skipped
Mar 22 14:29:37 saam swh[2280112]: INFO:swh.objstorage.replayer.replay:processed 190 content objects in 53.6sec (3.5 obj/sec, 0.8MB/sec) - 0 failed - 10 skipped
Mar 22 14:29:40 saam swh[2208310]: INFO:swh.objstorage.replayer.replay:processed 185 content objects in 46.6sec (4.0 obj/sec, 0.2MB/sec) - 0 failed - 15 skipped
Mar 22 14:29:41 saam swh[2213546]: INFO:swh.objstorage.replayer.replay:processed 194 content objects in 46.8sec (4.1 obj/sec, 0.1MB/sec) - 0 failed - 6 skipped
Mar 22 14:29:41 saam swh[2280119]: INFO:swh.objstorage.replayer.replay:processed 200 content objects in 49.9sec (4.0 obj/sec, 0.1MB/sec) - 0 failed - 0 skipped
Mar 22 14:29:41 saam swh[2211711]: INFO:swh.objstorage.replayer.replay:processed 198 content objects in 47.5sec (4.2 obj/sec, 0.1MB/sec) - 0 failed - 2 skipped
Mar 22 14:29:42 saam swh[2280147]: INFO:swh.objstorage.replayer.replay:processed 200 content objects in 47.1sec (4.3 obj/sec, 0.3MB/sec) - 0 failed - 0 skipped
Mar 22 14:29:42 saam swh[2213714]: INFO:swh.objstorage.replayer.replay:processed 200 content objects in 48.7sec (4.1 obj/sec, 0.1MB/sec) - 0 failed - 0 skipped
Mar 22 14:29:45 saam swh[2213477]: INFO:swh.objstorage.replayer.replay:processed 155 content objects in 40.5sec (3.8 obj/sec, 0.1MB/sec) - 0 failed - 45 skipped
Mar 22 14:29:49 saam swh[2280156]: INFO:swh.objstorage.replayer.replay:processed 200 content objects in 45.0sec (4.4 obj/sec, 0.1MB/sec) - 0 failed - 0 skipped
Mar 22 14:29:51 saam swh[2212770]: INFO:swh.objstorage.replayer.replay:processed 200 content objects in 47.9sec (4.2 obj/sec, 0.2MB/sec) - 0 failed - 0 skipped
Mar 22 14:29:51 saam swh[2211572]: INFO:swh.objstorage.replayer.replay:processed 200 content objects in 48.7sec (4.1 obj/sec, 0.4MB/sec) - 0 failed - 0 skipped
Mar 22 14:29:53 saam swh[2280185]: INFO:swh.objstorage.replayer.replay:processed 200 content objects in 45.9sec (4.4 obj/sec, 0.1MB/sec) - 0 failed - 0 skipped
Mar 22 14:29:54 saam swh[2211851]: INFO:swh.objstorage.replayer.replay:processed 199 content objects in 50.2sec (4.0 obj/sec, 0.3MB/sec) - 0 failed - 1 skipped
Mar 22 14:29:54 saam swh[2212985]: INFO:swh.objstorage.replayer.replay:processed 199 content objects in 47.3sec (4.2 obj/sec, 0.3MB/sec) - 0 failed - 1 skipped

Only some partitions got old objects added. Dashboards show we're back to catching up, slowly.

I think it would make sense for --no-check-dst to still check the destination if the object doesn't seem to exist at the source; this would avoid doing two requests for each object.

I've dropped --check-dst now that we've gone over the hump of old objects. This will enable much faster (24h) catch-up.

The replayers are caught up, I've scaled back down to 32 workers.

I'll submit a follow up task for only logging when the object doesn't exist in the destination

closed

S3 content replayers encounter a lot of old contents and not keeping up with current ingestion rate

Designs

Child items ...

Activity