kafka broker crash: "Error while rolling log segment [...] Map failed"
@guillaume has noticed some hangs in the swh-scheduler journal client, which prevent status reporting in the add forge now process.
While investigating the latest hang (at or around 2023-06-05 09:30 UTC), we've noticed the following crash in the kafka2.internal.softwareheritage.org kafka server log:
[2023-06-05 09:28:15,315] ERROR Error while rolling log segment for __consumer_offsets-32 in dir /srv/kafka/logdir (kafka.server.LogDirFailureChannel)
java.io.IOException: Map failed
at java.base/sun.nio.ch.FileChannelImpl.map(FileChannelImpl.java:1016)
at kafka.log.AbstractIndex.$anonfun$resize$1(AbstractIndex.scala:191)
at kafka.log.AbstractIndex.resize(AbstractIndex.scala:175)
at kafka.log.AbstractIndex.$anonfun$trimToValidSize$1(AbstractIndex.scala:241)
at kafka.log.AbstractIndex.trimToValidSize(AbstractIndex.scala:241)
at kafka.log.LogSegment.onBecomeInactiveSegment(LogSegment.scala:511)
at kafka.log.LocalLog.$anonfun$roll$9(LocalLog.scala:529)
at kafka.log.LocalLog.$anonfun$roll$9$adapted(LocalLog.scala:529)
at scala.Option.foreach(Option.scala:437)
at kafka.log.LocalLog.$anonfun$roll$2(LocalLog.scala:529)
at kafka.log.LocalLog.roll(LocalLog.scala:786)
at kafka.log.UnifiedLog.roll(UnifiedLog.scala:1517)
at kafka.log.UnifiedLog.maybeRoll(UnifiedLog.scala:1503)
at kafka.log.UnifiedLog.append(UnifiedLog.scala:899)
at kafka.log.UnifiedLog.appendAsLeader(UnifiedLog.scala:740)
at kafka.cluster.Partition.$anonfun$appendRecordsToLeader$1(Partition.scala:1167)
at kafka.cluster.Partition.appendRecordsToLeader(Partition.scala:1155)
at kafka.server.ReplicaManager.$anonfun$appendToLocalLog$6(ReplicaManager.scala:947)
at scala.collection.Iterator$$anon$9.next(Iterator.scala:577)
at scala.collection.mutable.Growable.addAll(Growable.scala:62)
at scala.collection.mutable.Growable.addAll$(Growable.scala:57)
at scala.collection.immutable.MapBuilderImpl.addAll(Map.scala:692)
at scala.collection.immutable.Map$.from(Map.scala:643)
at scala.collection.immutable.Map$.from(Map.scala:173)
at scala.collection.MapOps.map(Map.scala:299)
at scala.collection.MapOps.map$(Map.scala:299)
at scala.collection.AbstractMap.map(Map.scala:405)
at kafka.server.ReplicaManager.appendToLocalLog(ReplicaManager.scala:935)
at kafka.server.ReplicaManager.appendRecords(ReplicaManager.scala:593)
at kafka.coordinator.group.GroupMetadataManager.storeOffsets(GroupMetadataManager.scala:338)
at kafka.coordinator.group.GroupCoordinator.$anonfun$doCommitOffsets$1(GroupCoordinator.scala:1042)
at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.scala:18)
at kafka.coordinator.group.GroupMetadata.inLock(GroupMetadata.scala:225)
at kafka.coordinator.group.GroupCoordinator.handleCommitOffsets(GroupCoordinator.scala:1021)
at kafka.server.KafkaApis.handleOffsetCommitRequest(KafkaApis.scala:535)
at kafka.server.KafkaApis.handle(KafkaApis.scala:183)
at kafka.server.KafkaRequestHandler.run(KafkaRequestHandler.scala:75)
at java.base/java.lang.Thread.run(Thread.java:829)
Caused by: java.lang.OutOfMemoryError: Map failed
at java.base/sun.nio.ch.FileChannelImpl.map0(Native Method)
at java.base/sun.nio.ch.FileChannelImpl.map(FileChannelImpl.java:1013)
... 37 more
This then causes the broker to fully hiccup and replications to restart, which can make clients hang.