Unbreak journal clients
Since migration to confluent-kafka-python, there is a recurring issue of journal clients being stuck for no apparent reason. Debug logs don't seem to show anything relevan.
My observations so far:
- a consumer group running confluent-kafka-python may block other consumer groups that don't use it (eg. if they use python-kafka)
- when the number of consumers is low (<10), having less topics or more consumers seems to unstuck the consumers. (unsure if that's also true with higher number of consumers)
- most days, switching the same consumer between confluent-kafka-python and python-kafka results very clearly in "confluent-kafka-python = broken, python-kafka = works", but some days both works, and some rare days neither do
Consumer logs are full of errors like this, regardless of whether they are stuck or not:
Oct 10 12:05:03 desktop5 replayer-12301[24534]: INFO:swh.journal.client:REQTMOUT [rdkafka#consumer-1] [thrd:esnode2.internal.softwareheritage.org:9092/12]: esnode2.internal.softwareheritage.org:9092/12: Timed out FetchRequest in flight (after 61043ms, timeout #0)
Oct 10 12:05:03 desktop5 replayer-12301[24534]: WARNING:swh.journal.client:REQTMOUT [rdkafka#consumer-1] [thrd:esnode2.internal.softwareheritage.org:9092/12]: esnode2.internal.softwareheritage.org:9092/12: Timed out 1 in-flight, 0 retry-queued, 0 out-queue, 0 partially-sent requests
Oct 10 12:05:05 desktop5 replayer-12302[24568]: INFO:swh.journal.client:REQTMOUT [rdkafka#consumer-1] [thrd:esnode2.internal.softwareheritage.org:9092/12]: esnode2.internal.softwareheritage.org:9092/12: Timed out FetchRequest in flight (after 61053ms, timeout #0)
Oct 10 12:05:05 desktop5 replayer-12302[24568]: WARNING:swh.journal.client:REQTMOUT [rdkafka#consumer-1] [thrd:esnode2.internal.softwareheritage.org:9092/12]: esnode2.internal.softwareheritage.org:9092/12: Timed out 1 in-flight, 0 retry-queued, 0 out-queue, 0 partially-sent requests
Oct 10 12:05:14 desktop5 replayer-12301[24534]: INFO:swh.journal.client:REQTMOUT [rdkafka#consumer-1] [thrd:esnode1.internal.softwareheritage.org:9092/bootstrap]: esnode1.internal.softwareheritage.org:9092/11: Timed out FetchRequest in flight (after 61039ms, timeout #0)
Oct 10 12:05:14 desktop5 replayer-12301[24534]: WARNING:swh.journal.client:REQTMOUT [rdkafka#consumer-1] [thrd:esnode1.internal.softwareheritage.org:9092/bootstrap]: esnode1.internal.softwareheritage.org:9092/11: Timed out 1 in-flight, 0 retry-queued, 0 out-queue, 0 partially-sent requests
Oct 10 12:05:15 desktop5 replayer-12302[24568]: INFO:swh.journal.client:REQTMOUT [rdkafka#consumer-1] [thrd:esnode1.internal.softwareheritage.org:9092/bootstrap]: esnode1.internal.softwareheritage.org:9092/11: Timed out FetchRequest in flight (after 61044ms, timeout #0)
Oct 10 12:05:15 desktop5 replayer-12302[24568]: INFO:swh.journal.client:REQTMOUT [rdkafka#consumer-1] [thrd:esnode1.internal.softwareheritage.org:9092/bootstrap]: esnode1.internal.softwareheritage.org:9092/11: Timed out MetadataRequest in flight (after 60902ms, timeout #1)
Oct 10 12:05:15 desktop5 replayer-12302[24568]: INFO:swh.journal.client:REQTMOUT [rdkafka#consumer-1] [thrd:esnode1.internal.softwareheritage.org:9092/bootstrap]: esnode1.internal.softwareheritage.org:9092/11: Timed out MetadataRequest in flight (after 60902ms, timeout #2)
Oct 10 12:05:15 desktop5 replayer-12302[24568]: INFO:swh.journal.client:REQTMOUT [rdkafka#consumer-1] [thrd:esnode1.internal.softwareheritage.org:9092/bootstrap]: esnode1.internal.softwareheritage.org:9092/11: Timed out MetadataRequest in flight (after 60902ms, timeout #3)
Oct 10 12:05:15 desktop5 replayer-12302[24568]: INFO:swh.journal.client:REQTMOUT [rdkafka#consumer-1] [thrd:esnode1.internal.softwareheritage.org:9092/bootstrap]: esnode1.internal.softwareheritage.org:9092/11: Timed out MetadataRequest in flight (after 60902ms, timeout #4)
Oct 10 12:05:15 desktop5 replayer-12302[24568]: WARNING:swh.journal.client:REQTMOUT [rdkafka#consumer-1] [thrd:esnode1.internal.softwareheritage.org:9092/bootstrap]: esnode1.internal.softwareheritage.org:9092/11: Timed out 8 in-flight, 0 retry-queued, 0 out-queue, 0 partially-sent requests
Oct 10 12:05:20 desktop5 replayer-12301[24534]: INFO:swh.journal.client:REQTMOUT [rdkafka#consumer-1] [thrd:esnode3.internal.softwareheritage.org:9092/13]: esnode3.internal.softwareheritage.org:9092/13: Timed out FetchRequest in flight (after 61049ms, timeout #0)
Oct 10 12:05:20 desktop5 replayer-12301[24534]: INFO:swh.journal.client:REQTMOUT [rdkafka#consumer-1] [thrd:esnode3.internal.softwareheritage.org:9092/13]: esnode3.internal.softwareheritage.org:9092/13: Timed out MetadataRequest in flight (after 60956ms, timeout #1)
Oct 10 12:05:20 desktop5 replayer-12301[24534]: INFO:swh.journal.client:REQTMOUT [rdkafka#consumer-1] [thrd:esnode3.internal.softwareheritage.org:9092/13]: esnode3.internal.softwareheritage.org:9092/13: Timed out MetadataRequest in flight (after 60956ms, timeout #2)
Oct 10 12:05:20 desktop5 replayer-12301[24534]: WARNING:swh.journal.client:REQTMOUT [rdkafka#consumer-1] [thrd:esnode3.internal.softwareheritage.org:9092/13]: esnode3.internal.softwareheritage.org:9092/13: Timed out 3 in-flight, 0 retry-queued, 0 out-queue, 0 partially-sent requests
Oct 10 12:05:24 desktop5 replayer-12302[24568]: INFO:swh.journal.client:REQTMOUT [rdkafka#consumer-1] [thrd:esnode3.internal.softwareheritage.org:9092/13]: esnode3.internal.softwareheritage.org:9092/13: Timed out FetchRequest in flight (after 61045ms, timeout #0)
Oct 10 12:05:24 desktop5 replayer-12302[24568]: WARNING:swh.journal.client:REQTMOUT [rdkafka#consumer-1] [thrd:esnode3.internal.softwareheritage.org:9092/13]: esnode3.internal.softwareheritage.org:9092/13: Timed out 1 in-flight, 0 retry-queued, 0 out-queue, 0 partially-sent requests
Though it seems like there are less errors about esnode1 when they are not stuck.
Migrated from T2034 (view on Phabricator)