Skip to content

GitLab

Explore

Sign in
Register

Elasticsearch cluster failure during a rolling restart

During the rolling restart of the cluster 2 disks failures has crashed esnode1 and avoid the cluster to recover.

[Copied from a comment] Short term plan :

Remove old systemlogs indexes older than 1year to start, but we can go to 3 months if necessary
reactivate the shard allocation to have 1 replica for all the shards in case of a second node failure
Launch a long smartcl test on all the disks of each esnode* server
Contact DELL support to proceed to the replacement of the 2 failing disks (under warranty(?)) [1]
Try to recover the 16 red indexes if possible, if not, delete them as they are not critical

Middle term:

Reconfigure sentry to use its local kafka instance instead of the esnode* kafka cluster (thank olasd)
infra/puppet/puppet-swh-site!286, infra/puppet/puppet-swh-site!287: Cleanup the esnode* kafka/zookeeper instances
done for esnode1 reclaim the 2To disk reserved for the journal => #2958 (closed)
~~Add a new datadir on elasticsearch using the new available disk~~
Add smartctl monitoring to detect disk failure as soon as possible #2960
[1] sdb serial : K5GJBLTA / sdc serial : K5GV9REA

Migrated from T2888 (view on Phabricator)

Edited Oct 18, 2022 by Vincent Sellier

Assignee Loading

Time tracking Loading