I don't know what made some old contents reappear in front of the kafka topic. This looks contingent with kafka compaction on the content topic happening on 2023-03-20. This caused some load issues on the kafka brokers preventing the main storage from sending some messages.
I've done the following:
add --check-dst flag to the s3 replayer CLI to do an existence check on S3 before sending a content
add SENTRY_DISABLE_LOGGING_EVENTS=true envvar to avoid spamming sentry with non-existent object logs
restarted the journal clients (and bumped to 48 processes instead of 32)
Looks like the first change quiesced the logs and the second one isn't really needed. The last change is needed for catching up.
Mar 22 14:29:29 saam swh[2280152]: INFO:swh.objstorage.replayer.replay:processed 186 content objects in 52.2sec (3.6 obj/sec, 0.5MB/sec) - 0 failed - 14 skippedMar 22 14:29:36 saam swh[2214262]: INFO:swh.objstorage.replayer.replay:processed 200 content objects in 46.4sec (4.3 obj/sec, 0.2MB/sec) - 0 failed - 0 skippedMar 22 14:29:37 saam swh[2280112]: INFO:swh.objstorage.replayer.replay:processed 190 content objects in 53.6sec (3.5 obj/sec, 0.8MB/sec) - 0 failed - 10 skippedMar 22 14:29:40 saam swh[2208310]: INFO:swh.objstorage.replayer.replay:processed 185 content objects in 46.6sec (4.0 obj/sec, 0.2MB/sec) - 0 failed - 15 skippedMar 22 14:29:41 saam swh[2213546]: INFO:swh.objstorage.replayer.replay:processed 194 content objects in 46.8sec (4.1 obj/sec, 0.1MB/sec) - 0 failed - 6 skippedMar 22 14:29:41 saam swh[2280119]: INFO:swh.objstorage.replayer.replay:processed 200 content objects in 49.9sec (4.0 obj/sec, 0.1MB/sec) - 0 failed - 0 skippedMar 22 14:29:41 saam swh[2211711]: INFO:swh.objstorage.replayer.replay:processed 198 content objects in 47.5sec (4.2 obj/sec, 0.1MB/sec) - 0 failed - 2 skippedMar 22 14:29:42 saam swh[2280147]: INFO:swh.objstorage.replayer.replay:processed 200 content objects in 47.1sec (4.3 obj/sec, 0.3MB/sec) - 0 failed - 0 skippedMar 22 14:29:42 saam swh[2213714]: INFO:swh.objstorage.replayer.replay:processed 200 content objects in 48.7sec (4.1 obj/sec, 0.1MB/sec) - 0 failed - 0 skippedMar 22 14:29:45 saam swh[2213477]: INFO:swh.objstorage.replayer.replay:processed 155 content objects in 40.5sec (3.8 obj/sec, 0.1MB/sec) - 0 failed - 45 skippedMar 22 14:29:49 saam swh[2280156]: INFO:swh.objstorage.replayer.replay:processed 200 content objects in 45.0sec (4.4 obj/sec, 0.1MB/sec) - 0 failed - 0 skippedMar 22 14:29:51 saam swh[2212770]: INFO:swh.objstorage.replayer.replay:processed 200 content objects in 47.9sec (4.2 obj/sec, 0.2MB/sec) - 0 failed - 0 skippedMar 22 14:29:51 saam swh[2211572]: INFO:swh.objstorage.replayer.replay:processed 200 content objects in 48.7sec (4.1 obj/sec, 0.4MB/sec) - 0 failed - 0 skippedMar 22 14:29:53 saam swh[2280185]: INFO:swh.objstorage.replayer.replay:processed 200 content objects in 45.9sec (4.4 obj/sec, 0.1MB/sec) - 0 failed - 0 skippedMar 22 14:29:54 saam swh[2211851]: INFO:swh.objstorage.replayer.replay:processed 199 content objects in 50.2sec (4.0 obj/sec, 0.3MB/sec) - 0 failed - 1 skippedMar 22 14:29:54 saam swh[2212985]: INFO:swh.objstorage.replayer.replay:processed 199 content objects in 47.3sec (4.2 obj/sec, 0.3MB/sec) - 0 failed - 1 skipped
Only some partitions got old objects added. Dashboards show we're back to catching up, slowly.
I think it would make sense for --no-check-dst to still check the destination if the object doesn't seem to exist at the source; this would avoid doing two requests for each object.