Most indexers are consuming journal topics slower than messages are produced

added Indexer System administration ~119 labels

changed the description

https://grafana.softwareheritage.org/goto/RJkpEXVVz?orgId=1 origin-intrinsic-metadata (on origin_visit_status topic)

It seems this ones is getting better by itself [1]

[1] https://grafana.softwareheritage.org/goto/DwtAxjSVk?orgId=1

Remaining other indexer journal clients (content ones) seems steadily slow.

these spikes are highly suspicious. It's clearly not processing 500M visit statuses in 15 seconds. I'm guessing these are statuses that were compacted away before they were consumed.

Or, more likely, some partitions in the consumer group haven't moved for a long time and the related information has been discarded by the kafka consumer group exporter.

See the metrics disappearing on this grafana explore query

ouch, indeed

It looks like the optimization added in https://forge.softwareheritage.org/D8861 does not have a significant effect on throughput: https://grafana.softwareheritage.org/goto/VNAGdt54k?orgId=1 (though deploying unrelated updates applied at the same time at least fixed crashes)

changed milestone to %Collect and index forge metadata [Roadmap - Collect]

added activity::MRO label

mentioned in issue swh/meta#5035 (closed)

After some investigation the origin_intrinsic_metadata worker issue seeems to be related to "network" problems.

Let's dump the several different issues discovered:

A log of kafka timeouts
A couple of initial connection to kafka timouts
Stuck connections to the indexer storage on saam

On the indexer worker, it seems there is sometime network connectivity problems, not directly related to the VPN because kafka stream the data using the public kafka port, kafka, filebeat lost the connectivity

When the high cpu comsumption occurs, an strace on the indexer process shows it's stuck on a request on the indexer storage.

Another point is the indexer storage seems short in term of declared workers. There is often more than 30 concurrent connection but the number of workers is 32.

Half of the worker seems to be consumed by the webapp

root@saam:~#  ss -tanp |  grep "ESTAB.*:5007 " | awk '{print $5}' | cut -f1 -d":" | sort | uniq -c | sort -n
      2 192.168.200.10
      2 192.168.200.12
      2 192.168.200.17
      2 192.168.200.6
      3 192.168.200.8
     22 192.168.100.31  <--- the webapp

A quick test could be to increase the number of worker to see if the situation is improved.

If yes, perhaps it could be interesting to declare a read-only indexer-storage on moma to separate the workloads.

if yes too, we should search why the indexer storage remains stucks, probably due to the connectivity issue

It seems the vms have also some difficulties to reach the kafka nodes:

root@indexer-worker06:~# time telnet broker1.journal.softwareheritage.org 9093
Trying 128.93.166.48...   <---------------- No response
^C

real    0m4.007s
user    0m0.003s
sys     0m0.000s
root@indexer-worker06:~# time telnet broker2.journal.softwareheritage.org 9093
Trying 128.93.166.49...   <---------------- Ok
Connected to broker2.journal.softwareheritage.org.
Escape character is '^]'.
^CConnection closed by foreign host.

real    0m1.080s
user    0m0.003s
sys     0m0.000s
root@indexer-worker06:~# time telnet broker3.journal.softwareheritage.org 9093
Trying 128.93.166.50...   <---------------- No response
^C

real    0m3.440s
user    0m0.002s
sys     0m0.000s
root@indexer-worker06:~# time telnet broker4.journal.softwareheritage.org 9093
Trying 128.93.166.51...   <---------------- Ok
Connected to broker4.journal.softwareheritage.org.
Escape character is '^]'.
^CConnection closed by foreign host.

real    0m2.023s
user    0m0.002s
sys     0m0.001s

The problem looks intermittent, the connection is working well now.

As suggested by @vlorentz, it's possible the issue is related to the burst credit consumption of the vm.

we are continuously running low on credit. When the indexers are stopped, the credit raises again.

It could explain why it works some time before crashing after the indexers are restarted.

When there is no credit, the vm should be running at it's nominal capacity, but it looks like it could have some side effects.

It will try to migrate one node to an equivalent non burstable vm to check the behavior.

from https://azure.microsoft.com/en-us/pricing/details/virtual-machines/linux/

The current b2ms servers cost around 60$ per months.

As we are not using the tmpfs storage at all, we could test a d2as - d96as v5 model which is the equivalent (2vcpu / 8Gi ram) backed by AMD EPYCTM 7763v processors for 62.78$ per month.

It will probably be faster for a negligible extra cost.

As nothing is simple, migrating a vm with local temp disk to a vm without temp disk is not supported: https://learn.microsoft.com/en-us/azure/virtual-machines/resize-vm?tabs=portal#limitations

Let's test a D2as – D96as v4 (the same as d2as - d96as v5 but with temp storage) first. We'll see if if it's worth it to recreate the vm from scratch

There is still the same network connectivity issue so it seems it's not related to the burst credit available:

May 15 10:21:22 indexer-worker05 swh[632]: WARNING:urllib3.connectionpool:Retrying (Retry(total=2, connect=None, read=None, redirect=None, status=None)) after connection broken by 'NewConnectionError('<urllib3.connection.VerifiedHTTPSConnection object at 0x7f9a19e63320>: Failed to establish a new connection: [Errno 110] Connection timed out')': /api/5/store/

vsellier@indexer-worker05 ~ % date; time telnet broker1.journal.softwareheritage.org 9093
Mon May 15 10:22:59 UTC 2023
Trying 128.93.166.48...
^C
telnet broker1.journal.softwareheritage.org 9093  0.00s user 0.00s system 0% cpu 46.351 total

It seems more general as vangogh doesn't successfully ping broker1 neither.

Some more tests were performed:

Configure the servers interface to 'accelerated network performances': no changes
tcptraceroute doesn't show anything useful (all '*')
More connectivity tests show the connection is only unstable to reach only kafka servers and sentry
- indexer -> kafka: unstable

indexer-worker06 ~ % while true; do nc -w 2 -z swh-kafka1.inria.fr 9093; ret=$?; echo "$(date): $ret"; sleep 1; done
Mon May 15 16:44:16 UTC 2023: 0
Mon May 15 16:44:19 UTC 2023: 1
Mon May 15 16:44:22 UTC 2023: 1
Mon May 15 16:44:23 UTC 2023: 0
Mon May 15 16:44:26 UTC 2023: 1
Mon May 15 16:44:29 UTC 2023: 1
Mon May 15 16:44:30 UTC 2023: 0
Mon May 15 16:44:33 UTC 2023: 1

indexer -> sentry: unstable

indexer-worker06 ~ % while true; do nc -w 2 -z sentry.softwareheritage.org 443; ret=$?; echo "$(date): $ret"; sleep 1; done
Mon May 15 16:44:16 UTC 2023: 0
Mon May 15 16:44:19 UTC 2023: 1
Mon May 15 16:44:22 UTC 2023: 1
Mon May 15 16:44:23 UTC 2023: 0
Mon May 15 16:44:26 UTC 2023: 1
Mon May 15 16:44:29 UTC 2023: 1
Mon May 15 16:44:30 UTC 2023: 0
Mon May 15 16:44:33 UTC 2023: 1
Mon May 15 16:44:36 UTC 2023: 1
Mon May 15 16:44:37 UTC 2023: 0
Mon May 15 16:44:40 UTC 2023: 1
Mon May 15 16:44:43 UTC 2023: 1
Mon May 15 16:44:44 UTC 2023: 0
Mon May 15 16:44:47 UTC 2023: 1
Mon May 15 16:44:50 UTC 2023: 1
Mon May 15 16:44:51 UTC 2023: 0
Mon May 15 16:44:54 UTC 2023: 1

indexer -> archive: stable

indexer-worker06 ~ % while true; do nc -w 2 -z archive.softwareheritage.org 443; ret=$?; echo "$(date): $ret"; sleep 1; done
Mon May 15 16:44:16 UTC 2023: 0
Mon May 15 16:44:17 UTC 2023: 0
Mon May 15 16:44:18 UTC 2023: 0
Mon May 15 16:44:19 UTC 2023: 0
Mon May 15 16:44:20 UTC 2023: 0
Mon May 15 16:44:21 UTC 2023: 0
Mon May 15 16:44:22 UTC 2023: 0
Mon May 15 16:44:23 UTC 2023: 0
Mon May 15 16:44:24 UTC 2023: 0
Mon May 15 16:44:25 UTC 2023: 0
Mon May 15 16:44:26 UTC 2023: 0
Mon May 15 16:44:27 UTC 2023: 0
Mon May 15 16:44:28 UTC 2023: 0
Mon May 15 16:44:29 UTC 2023: 0
Mon May 15 16:44:30 UTC 2023: 0
Mon May 15 16:44:31 UTC 2023: 0
Mon May 15 16:44:32 UTC 2023: 0
Mon May 15 16:44:33 UTC 2023: 0

ipfs gateway (azure) -> kafka: stable

root@ipfs:~# while true; do nc -w 2 -z swh-kafka1.inria.fr 9093; ret=$?; echo "$(date): $ret"; sleep 1; done
Mon May 15 16:41:40 UTC 2023: 0
Mon May 15 16:41:41 UTC 2023: 0
Mon May 15 16:41:42 UTC 2023: 0
Mon May 15 16:41:43 UTC 2023: 0
Mon May 15 16:41:44 UTC 2023: 0
Mon May 15 16:41:45 UTC 2023: 0
Mon May 15 16:41:46 UTC 2023: 0
Mon May 15 16:41:47 UTC 2023: 0
Mon May 15 16:41:49 UTC 2023: 0
Mon May 15 16:41:50 UTC 2023: 0
Mon May 15 16:41:51 UTC 2023: 0

Any other host -> kafka: stable

➜~» while true; do nc -w 2 -z swh-kafka1.inria.fr 9093; ret=$?; echo "$(date): $ret"; sleep 1; done
Mon May 15 17:05:01 CEST 2023: 0
Mon May 15 17:05:02 CEST 2023: 0
Mon May 15 17:05:03 CEST 2023: 0
Mon May 15 17:05:04 CEST 2023: 0
Mon May 15 17:05:05 CEST 2023: 0
Mon May 15 17:05:06 CEST 2023: 0
Mon May 15 17:05:07 CEST 2023: 0

After hard restart of a worker (the temporary ip address was recycled by azure), the connexion is stable
- before the restart:

indexer-worker05 ~ %  curl -4 icanhazip.com
20.234.197.99

after the restart:

indexer-worker05 ~ %  curl -4 icanhazip.com
40.115.17.197

the connection with kafka is stable for some time but only for a moment

indexer-worker05 ~ % while true; do nc -w 2 -z swh-kafka1.inria.fr 9093; ret=$?; echo "$(date): $ret"; sleep 1; done
Mon May 15 17:06:55 UTC 2023: 0
Mon May 15 17:06:56 UTC 2023: 0
Mon May 15 17:06:57 UTC 2023: 0
Mon May 15 17:06:58 UTC 2023: 0
Mon May 15 17:06:59 UTC 2023: 0
Mon May 15 17:07:00 UTC 2023: 0
Mon May 15 17:07:01 UTC 2023: 0
Mon May 15 17:07:02 UTC 2023: 0
...
Mon May 15 17:12:28 UTC 2023: 1
Mon May 15 17:12:31 UTC 2023: 1
Mon May 15 17:12:32 UTC 2023: 0
Mon May 15 17:12:35 UTC 2023: 1
Mon May 15 17:12:38 UTC 2023: 1
Mon May 15 17:12:39 UTC 2023: 0
Mon May 15 17:12:42 UTC 2023: 1
Mon May 15 17:12:45 UTC 2023: 1
Mon May 15 17:12:48 UTC 2023: 1
Mon May 15 17:12:51 UTC 2023: 1
Mon May 15 17:12:54 UTC 2023: 1
Mon May 15 17:12:57 UTC 2023: 1
Mon May 15 17:13:00 UTC 2023: 1

the connection to sentry is stable

indexer-worker05 ~ % while true; do nc -w 2 -z sentry.softwareheritage.org 443; ret=$?; echo "$(date): $ret"; sleep 1; done
Mon May 15 17:10:44 UTC 2023: 0
Mon May 15 17:10:45 UTC 2023: 0
Mon May 15 17:10:46 UTC 2023: 0
Mon May 15 17:10:47 UTC 2023: 0
Mon May 15 17:10:48 UTC 2023: 0
Mon May 15 17:10:49 UTC 2023: 0
Mon May 15 17:10:50 UTC 2023: 0
Mon May 15 17:10:51 UTC 2023: 0
Mon May 15 17:10:52 UTC 2023: 0

It looks like there is some kind of filtering or qos somewhere between azure and rocq

mentioned in issue #5216 (closed)

Most indexers are consuming journal topics slower than messages are produced

Designs

Child items ...

Activity