Project 'infra/sysadm-environment' was moved to 'swh/infra/sysadm-environment'. Please update any links and bookmarks that may still have the old path.
extid topic is misconfigured in staging and production
The cleanup policy need to be configured to compact and the partition count to 64 in staging and 256 in production. The replication also needs to be increased to 2 in production
vsellier@journal0 ~ % /opt/kafka/bin/kafka-topics.sh --zookeeper $ZK --alter --topic swh.journal.objects.extid --config cleanup.policy=compact --partitions 64WARNING: Altering topic configuration from this script has been deprecated and may be removed in future releases. Going forward, please use kafka-configs.sh for this functionalityUpdated config for topic swh.journal.objects.extid.WARNING: If partitions are increased for a topic that has a key, the partition logic or ordering of the messages will be affectedAdding partitions succeeded!
The backfill is running in staging (launch with the migrated/migration$1114 and migrated/migration$1115 script on storage1.staging launched the 2021-08-17 at 11:20 UTC):
swhstorage@storage1:~$ ./backfill.sh | tee output.log
Unfortunately, the replication factor can't be changed directly, the partition assignment must be reconfigured to change it.
It was done before increasing the number of partition to limit the number of move to perform
kafka1 ~ % /opt/kafka/bin/kafka-reassign-partitions.sh --zookeeper $ZK --reassignment-json-file reassign.json -executeWarning: --zookeeper is deprecated, and will be removed in a future version of Kafka.Current partition replica assignment{"version":1,"partitions":[{"topic":"swh.journal.objects.extid","partition":0,"replicas":[3],"log_dirs":["any"]}]}Save this to use as the --reassignment-json-file option during rollbackSuccessfully started partition reassignment for swh.journal.objects.extid-0kafka1 ~ % /opt/kafka/bin/kafka-topics.sh --bootstrap-server $SERVER --describe --topic swh.journal.objects.extid | grep "^Topic"Topic: swh.journal.objects.extid PartitionCount: 1 ReplicationFactor: 2 Configs: max.message.bytes=104857600
Increase the number of partition and change the cleanup policy
kafka1 ~ % /opt/kafka/bin/kafka-topics.sh --zookeeper $ZK --alter --topic swh.journal.objects.extid --config cleanup.policy=compact --partitions 256WARNING: Altering topic configuration from this script has been deprecated and may be removed in future releases. Going forward, please use kafka-configs.sh for this functionalityUpdated config for topic swh.journal.objects.extid.WARNING: If partitions are increased for a topic that has a key, the partition logic or ordering of the messages will be affectedAdding partitions succeeded!kafka1 ~ % /opt/kafka/bin/kafka-topics.sh --bootstrap-server $SERVER --describe --topic swh.journal.objects.extid | grep "^Topic" Topic: swh.journal.objects.extid PartitionCount: 256 ReplicationFactor: 2 Configs: cleanup.policy=compact,max.message.bytes=104857600
The backfill process was interrupted by a restart of kafka on kafka1 (!).
2021-08-18T09:20:05 ERROR swh.journal.writer.kafka FAIL [swh.storage.journal_writer.getty#producer-1] [thrd:kafka1.internal.softwareheritage.org:9092/bootstrap]: kafka1.internal.softwareheritage.org:9092/1: Connect to ipv4#192.168.100.201:9092 failed: Connection refused (after 0ms in state CONNECT, 12 identical error(s) suppressed)2021-08-18T09:20:05 INFO swh.journal.writer.kafka Received non-fatal kafka error: KafkaError{code=_TRANSPORT,val=-195,str="kafka1.internal.softwareheritage.org:9092/1: Connect to ipv4#192.168.100.201:9092 failed: Connection refused (after 0ms in state CONNECT, 12 identical error(s) suppressed)"}2021-08-18T09:20:05 ERROR swh.journal.writer.kafka FAIL [swh.storage.journal_writer.getty#producer-1] [thrd:kafka1.internal.softwareheritage.org:9092/bootstrap]: kafka1.internal.softwareheritage.org:9092/1: Connect to ipv4#192.168.100.201:9092 failed: Connection refused (after 0ms in state CONNECT, 5 identical error(s) suppressed)2021-08-18T09:20:05 INFO swh.journal.writer.kafka Received non-fatal kafka error: KafkaError{code=_TRANSPORT,val=-195,str="kafka1.internal.softwareheritage.org:9092/1: Connect to ipv4#192.168.100.201:9092 failed: Connection refused (after 0ms in state CONNECT, 5 identical error(s) suppressed)"}2021-08-18T09:20:07 INFO swh.journal.writer.kafka PARTCNT [swh.storage.journal_writer.getty#producer-1] [thrd:main]: Topic swh.journal.objects.extid partition count changed from 256 to 1282021-08-18T09:20:07 WARNING swh.journal.writer.kafka BROKER [swh.storage.journal_writer.getty#producer-1] [thrd:main]: swh.journal.objects.extid [128] is unknown (partition_cnt 128): ignoring leader (-1) update2021-08-18T09:20:07 WARNING swh.journal.writer.kafka BROKER [swh.storage.journal_writer.getty#producer-1] [thrd:main]: swh.journal.objects.extid [130] is unknown (partition_cnt 128): ignoring leader (-1) update2021-08-18T09:20:07 WARNING swh.journal.writer.kafka BROKER [swh.storage.journal_writer.getty#producer-1] [thrd:main]: swh.journal.objects.extid [132] is unknown (partition_cnt 128): ignoring leader (-1) update...2021-08-18T09:20:07 WARNING swh.journal.writer.kafka BROKER [swh.storage.journal_writer.getty#producer-1] [thrd:main]: swh.journal.objects.extid [253] is unknown (partition_cnt 128): ignoring leader (-1) updateTraceback (most recent call last): File "/usr/bin/swh", line 11, in <module> load_entry_point('swh.core==0.13.0', 'console_scripts', 'swh')() File "/usr/lib/python3/dist-packages/swh/core/cli/__init__.py", line 185, in main return swh(auto_envvar_prefix="SWH") File "/usr/lib/python3/dist-packages/click/core.py", line 764, in __call__ return self.main(*args, **kwargs) File "/usr/lib/python3/dist-packages/click/core.py", line 717, in main rv = self.invoke(ctx) File "/usr/lib/python3/dist-packages/click/core.py", line 1137, in invoke return _process_result(sub_ctx.command.invoke(sub_ctx)) File "/usr/lib/python3/dist-packages/click/core.py", line 1137, in invoke return _process_result(sub_ctx.command.invoke(sub_ctx)) File "/usr/lib/python3/dist-packages/click/core.py", line 956, in invoke return ctx.invoke(self.callback, **ctx.params) File "/usr/lib/python3/dist-packages/click/core.py", line 555, in invoke return callback(*args, **kwargs) File "/usr/lib/python3/dist-packages/click/decorators.py", line 17, in new_func return f(get_current_context(), *args, **kwargs) File "/usr/lib/python3/dist-packages/swh/storage/cli.py", line 145, in backfill dry_run=dry_run, File "/usr/lib/python3/dist-packages/swh/storage/backfill.py", line 637, in run writer.write_additions(object_type, objects) File "/usr/lib/python3/dist-packages/swh/storage/writer.py", line 67, in write_additions self.journal.write_additions(object_type, values) File "/usr/lib/python3/dist-packages/swh/journal/writer/kafka.py", line 249, in write_additions self.flush() File "/usr/lib/python3/dist-packages/swh/journal/writer/kafka.py", line 215, in flush raise self.delivery_error("Failed deliveries after flush()")swh.journal.writer.kafka.KafkaDeliveryError: KafkaDeliveryError(Failed deliveries after flush(), [extid 344a2795951fabbf1f898b1a5fc54c4b57293cd5 (Local: Unknown partition)])2021-08-18T09:20:07 INFO swh.journal.writer.kafka PARTCNT [swh.storage.journal_writer.getty#producer-1] [thrd:main]: Topic swh.journal.objects.extid partition count changed from 256 to 128...swh.journal.writer.kafka.KafkaDeliveryError: KafkaDeliveryError(flush() exceeded timeout (120s), [extid 6e1a1317c35b971ef88e052a8b1b78d57bc71a2e (No delivery before flush() timeout), extid a5052a247a0af7926b8e33224ecf7ab12c148eb5 (No delivery before flush() timeout), extid 4f5ed974e8691d340724782b01bc9bb63781176f (No delivery before flush() timeout)])Traceback (most recent call last): File "/usr/bin/swh", line 11, in <module> load_entry_point('swh.core==0.13.0', 'console_scripts', 'swh')() File "/usr/lib/python3/dist-packages/swh/core/cli/__init__.py", line 185, in main return swh(auto_envvar_prefix="SWH") File "/usr/lib/python3/dist-packages/click/core.py", line 764, in __call__ return self.main(*args, **kwargs) File "/usr/lib/python3/dist-packages/click/core.py", line 717, in main rv = self.invoke(ctx) File "/usr/lib/python3/dist-packages/click/core.py", line 1137, in invoke return _process_result(sub_ctx.command.invoke(sub_ctx)) File "/usr/lib/python3/dist-packages/click/core.py", line 1137, in invoke return _process_result(sub_ctx.command.invoke(sub_ctx)) File "/usr/lib/python3/dist-packages/click/core.py", line 956, in invoke return ctx.invoke(self.callback, **ctx.params) File "/usr/lib/python3/dist-packages/click/core.py", line 555, in invoke return callback(*args, **kwargs) File "/usr/lib/python3/dist-packages/click/decorators.py", line 17, in new_func return f(get_current_context(), *args, **kwargs) File "/usr/lib/python3/dist-packages/swh/storage/cli.py", line 145, in backfill dry_run=dry_run, File "/usr/lib/python3/dist-packages/swh/storage/backfill.py", line 637, in run writer.write_additions(object_type, objects) File "/usr/lib/python3/dist-packages/swh/storage/writer.py", line 67, in write_additions self.journal.write_additions(object_type, values) File "/usr/lib/python3/dist-packages/swh/journal/writer/kafka.py", line 249, in write_additions self.flush() File "/usr/lib/python3/dist-packages/swh/journal/writer/kafka.py", line 212, in flush "flush() exceeded timeout (%ss)" % self.flush_timeout,swh.journal.writer.kafka.KafkaDeliveryError: KafkaDeliveryError(flush() exceeded timeout (120s), [extid b3f5a81891b2be4bf487ff1f8418110fd87d1042 (No delivery before flush() timeout), extid 5c165ffa4bb15bde37d0652cee9e19c5f0cda09b (No delivery before flush() timeout)])
The backfill will be restarted from the last positions (need to figure how to do that without taking too much time)
The backfill was stopped as the performance was (much) lower than expected (worked around with D6127).
To allow cleaning up the gigantic partition, and considering that we're going to backfill the whole topic, we can temporarily set a time-based retention policy on the topic with the retention.ms setting (to, e.g., one day = 86400000, and adding delete to cleanup.policy.
...[2021-08-25 14:56:19,495] INFO [Log partition=swh.journal.objects.extid-162, dir=/srv/kafka/logdir] Found deletable segments with base offsets [0] due to retention time 86400000ms breach (kafka.log.Log)[2021-08-25 14:56:19,495] INFO [Log partition=swh.journal.objects.extid-162, dir=/srv/kafka/logdir] Scheduling segments for deletion LogSegment(baseOffset=0, size=2720767, lastModifiedTime=1629815520833, largestTime=1629815520702) (kafka.log.Log)[2021-08-25 14:56:19,495] INFO [Log partition=swh.journal.objects.extid-162, dir=/srv/kafka/logdir] Incremented log start offset to 20623 due to segment deletion (kafka.log.Log)....