Project 'infra/sysadm-environment' was moved to 'swh/infra/sysadm-environment'. Please update any links and bookmarks that may still have the old path.
Most indexers are consuming journal topics slower than messages are produced
slow storage and/or objstorage (which would make sense, as the extrinsic-metadata indexer is super-fast, and is also the only one not to use the storage and objstorage)
these spikes are highly suspicious. It's clearly not processing 500M visit statuses in 15 seconds. I'm guessing these are statuses that were compacted away before they were consumed.
Or, more likely, some partitions in the consumer group haven't moved for a long time and the related information has been discarded by the kafka consumer group exporter.
After some investigation the origin_intrinsic_metadata worker issue seeems to be related to "network" problems.
Let's dump the several different issues discovered:
A log of kafka timeouts
A couple of initial connection to kafka timouts
Stuck connections to the indexer storage on saam
On the indexer worker, it seems there is sometime network connectivity problems, not directly related to the VPN because kafka stream the data using the public kafka port, kafka, filebeat lost the connectivity
When the high cpu comsumption occurs, an strace on the indexer process shows it's stuck on a request on the indexer storage.
Another point is the indexer storage seems short in term of declared workers.
There is often more than 30 concurrent connection but the number of workers is 32.
Half of the worker seems to be consumed by the webapp
It seems the vms have also some difficulties to reach the kafka nodes:
root@indexer-worker06:~# time telnet broker1.journal.softwareheritage.org 9093Trying 128.93.166.48... <---------------- No response^Creal 0m4.007suser 0m0.003ssys 0m0.000sroot@indexer-worker06:~# time telnet broker2.journal.softwareheritage.org 9093Trying 128.93.166.49... <---------------- OkConnected to broker2.journal.softwareheritage.org.Escape character is '^]'.^CConnection closed by foreign host.real 0m1.080suser 0m0.003ssys 0m0.000sroot@indexer-worker06:~# time telnet broker3.journal.softwareheritage.org 9093Trying 128.93.166.50... <---------------- No response^Creal 0m3.440suser 0m0.002ssys 0m0.000sroot@indexer-worker06:~# time telnet broker4.journal.softwareheritage.org 9093Trying 128.93.166.51... <---------------- OkConnected to broker4.journal.softwareheritage.org.Escape character is '^]'.^CConnection closed by foreign host.real 0m2.023suser 0m0.002ssys 0m0.001s
The current b2ms servers cost around 60$ per months.
As we are not using the tmpfs storage at all, we could test a d2as - d96as v5 model which is the equivalent (2vcpu / 8Gi ram) backed by AMD EPYCTM 7763v processors for 62.78$ per month.
It will probably be faster for a negligible extra cost.
There is still the same network connectivity issue so it seems it's not related to the burst credit available:
May 15 10:21:22 indexer-worker05 swh[632]: WARNING:urllib3.connectionpool:Retrying (Retry(total=2, connect=None, read=None, redirect=None, status=None)) after connection broken by 'NewConnectionError('<urllib3.connection.VerifiedHTTPSConnection object at 0x7f9a19e63320>: Failed to establish a new connection: [Errno 110] Connection timed out')': /api/5/store/
vsellier@indexer-worker05 ~ % date; time telnet broker1.journal.softwareheritage.org 9093Mon May 15 10:22:59 UTC 2023Trying 128.93.166.48...^Ctelnet broker1.journal.softwareheritage.org 9093 0.00s user 0.00s system 0% cpu 46.351 total
It seems more general as vangogh doesn't successfully ping broker1 neither.
Configure the servers interface to 'accelerated network performances': no changes
tcptraceroute doesn't show anything useful (all '*')
More connectivity tests show the connection is only unstable to reach only kafka servers and sentry
indexer -> kafka: unstable
indexer-worker06 ~ % while true; do nc -w 2 -z swh-kafka1.inria.fr 9093; ret=$?; echo "$(date): $ret"; sleep 1; doneMon May 15 16:44:16 UTC 2023: 0Mon May 15 16:44:19 UTC 2023: 1Mon May 15 16:44:22 UTC 2023: 1Mon May 15 16:44:23 UTC 2023: 0Mon May 15 16:44:26 UTC 2023: 1Mon May 15 16:44:29 UTC 2023: 1Mon May 15 16:44:30 UTC 2023: 0Mon May 15 16:44:33 UTC 2023: 1
indexer -> sentry: unstable
indexer-worker06 ~ % while true; do nc -w 2 -z sentry.softwareheritage.org 443; ret=$?; echo "$(date): $ret"; sleep 1; doneMon May 15 16:44:16 UTC 2023: 0Mon May 15 16:44:19 UTC 2023: 1Mon May 15 16:44:22 UTC 2023: 1Mon May 15 16:44:23 UTC 2023: 0Mon May 15 16:44:26 UTC 2023: 1Mon May 15 16:44:29 UTC 2023: 1Mon May 15 16:44:30 UTC 2023: 0Mon May 15 16:44:33 UTC 2023: 1Mon May 15 16:44:36 UTC 2023: 1Mon May 15 16:44:37 UTC 2023: 0Mon May 15 16:44:40 UTC 2023: 1Mon May 15 16:44:43 UTC 2023: 1Mon May 15 16:44:44 UTC 2023: 0Mon May 15 16:44:47 UTC 2023: 1Mon May 15 16:44:50 UTC 2023: 1Mon May 15 16:44:51 UTC 2023: 0Mon May 15 16:44:54 UTC 2023: 1
indexer -> archive: stable
indexer-worker06 ~ % while true; do nc -w 2 -z archive.softwareheritage.org 443; ret=$?; echo "$(date): $ret"; sleep 1; doneMon May 15 16:44:16 UTC 2023: 0Mon May 15 16:44:17 UTC 2023: 0Mon May 15 16:44:18 UTC 2023: 0Mon May 15 16:44:19 UTC 2023: 0Mon May 15 16:44:20 UTC 2023: 0Mon May 15 16:44:21 UTC 2023: 0Mon May 15 16:44:22 UTC 2023: 0Mon May 15 16:44:23 UTC 2023: 0Mon May 15 16:44:24 UTC 2023: 0Mon May 15 16:44:25 UTC 2023: 0Mon May 15 16:44:26 UTC 2023: 0Mon May 15 16:44:27 UTC 2023: 0Mon May 15 16:44:28 UTC 2023: 0Mon May 15 16:44:29 UTC 2023: 0Mon May 15 16:44:30 UTC 2023: 0Mon May 15 16:44:31 UTC 2023: 0Mon May 15 16:44:32 UTC 2023: 0Mon May 15 16:44:33 UTC 2023: 0
ipfs gateway (azure) -> kafka: stable
root@ipfs:~# while true; do nc -w 2 -z swh-kafka1.inria.fr 9093; ret=$?; echo "$(date): $ret"; sleep 1; doneMon May 15 16:41:40 UTC 2023: 0Mon May 15 16:41:41 UTC 2023: 0Mon May 15 16:41:42 UTC 2023: 0Mon May 15 16:41:43 UTC 2023: 0Mon May 15 16:41:44 UTC 2023: 0Mon May 15 16:41:45 UTC 2023: 0Mon May 15 16:41:46 UTC 2023: 0Mon May 15 16:41:47 UTC 2023: 0Mon May 15 16:41:49 UTC 2023: 0Mon May 15 16:41:50 UTC 2023: 0Mon May 15 16:41:51 UTC 2023: 0
Any other host -> kafka: stable
➜~» while true; do nc -w 2 -z swh-kafka1.inria.fr 9093; ret=$?; echo "$(date): $ret"; sleep 1; doneMon May 15 17:05:01 CEST 2023: 0Mon May 15 17:05:02 CEST 2023: 0Mon May 15 17:05:03 CEST 2023: 0Mon May 15 17:05:04 CEST 2023: 0Mon May 15 17:05:05 CEST 2023: 0Mon May 15 17:05:06 CEST 2023: 0Mon May 15 17:05:07 CEST 2023: 0
After hard restart of a worker (the temporary ip address was recycled by azure), the connexion is stable
the connection with kafka is stable for some time but only for a moment
indexer-worker05 ~ % while true; do nc -w 2 -z swh-kafka1.inria.fr 9093; ret=$?; echo "$(date): $ret"; sleep 1; doneMon May 15 17:06:55 UTC 2023: 0Mon May 15 17:06:56 UTC 2023: 0Mon May 15 17:06:57 UTC 2023: 0Mon May 15 17:06:58 UTC 2023: 0Mon May 15 17:06:59 UTC 2023: 0Mon May 15 17:07:00 UTC 2023: 0Mon May 15 17:07:01 UTC 2023: 0Mon May 15 17:07:02 UTC 2023: 0...Mon May 15 17:12:28 UTC 2023: 1Mon May 15 17:12:31 UTC 2023: 1Mon May 15 17:12:32 UTC 2023: 0Mon May 15 17:12:35 UTC 2023: 1Mon May 15 17:12:38 UTC 2023: 1Mon May 15 17:12:39 UTC 2023: 0Mon May 15 17:12:42 UTC 2023: 1Mon May 15 17:12:45 UTC 2023: 1Mon May 15 17:12:48 UTC 2023: 1Mon May 15 17:12:51 UTC 2023: 1Mon May 15 17:12:54 UTC 2023: 1Mon May 15 17:12:57 UTC 2023: 1Mon May 15 17:13:00 UTC 2023: 1
the connection to sentry is stable
indexer-worker05 ~ % while true; do nc -w 2 -z sentry.softwareheritage.org 443; ret=$?; echo "$(date): $ret"; sleep 1; doneMon May 15 17:10:44 UTC 2023: 0Mon May 15 17:10:45 UTC 2023: 0Mon May 15 17:10:46 UTC 2023: 0Mon May 15 17:10:47 UTC 2023: 0Mon May 15 17:10:48 UTC 2023: 0Mon May 15 17:10:49 UTC 2023: 0Mon May 15 17:10:50 UTC 2023: 0Mon May 15 17:10:51 UTC 2023: 0Mon May 15 17:10:52 UTC 2023: 0
It looks like there is some kind of filtering or qos somewhere between azure and rocq