Currently, only 3 disks (on 4) are used and the partition formerly used by kafka is lost (~2to).
The usage of zfs with the same partitioning scheme on each disk[1] is ok (tested on esnode1).
It can be applied on esnode[2-3] to have an homogeneous configuration and add ~1.8To per node.
Interesting reading on how the cluster state is replicated/persisted : https://www.elastic.co/guide/en/elasticsearch/reference/current/cluster-state-publishing.html
It seems there is a lot of pressure on the cluster with the shard reallocation. After the first timeout, it seems esnode3 is managing all the primary shards and esnode1 is trying again and again to recover until a new timeout is occuring.
It will try to temporary reduce the number of parallel index initialization to reduce a little the pressure on the cluster
The cause of the problem was a high write i/o pressure on esnode1 due to the index copy from esnode3.
The indices.recovery.max_bytes_per_sec value was configured to 500MB. It seems it's viable for a startup with local indexes that just have to be read locally.
When the indexes have to be written from another node, it seems the write I/O pressure is so important than the sever can't write the cluster status locally. It results to its exclusion from the cluster and it has to restart its whole initialization process.
[2021-01-13T22:17:34,597][WARN ][o.e.g.PersistedClusterStateService] [esnode1] writing cluster state took [11805ms] which is above the warn threshold of [10s]; wrote global metadata [false] and metadata for [1] indices and skipped [2131] unchanged indices[2021-01-13T22:17:56,318][WARN ][o.e.g.PersistedClusterStateService] [esnode1] writing cluster state took [15407ms] which is above the warn threshold of [10s]; wrote global metadata [false] and metadata for [0] indices and skipped [2132] unchanged indices[2021-01-13T22:19:31,436][WARN ][o.e.g.PersistedClusterStateService] [esnode1] writing cluster state took [15387ms] which is above the warn threshold of [10s]; wrote global metadata [false] and metadata for [0] indices and skipped [2132] unchanged indices...[2021-01-13T19:26:43,791][WARN ][o.e.c.c.LagDetector ] [esnode3] node [{esnode1}{t4iSb7f1RZmEwpH4O_OoGw}{CpoEqfyFS-yD7fNQJ624Ig}{192.168.100.61}{192.168.100.61:9300}{dilmrt}{ml.machine_memory=33635225600, ml.max_open_jobs=20, xpack.installed=true, transform.node=true}] is lagging at cluster state version [444470], although publication of cluster state version [444471] completed [1.5m] ago...
The cluster finally recovered by limiting the max recovering bytes per seconds to 50MB.
The good news is it's seems it's not due to the new zfs migration as there is plenty of such exclusions on the previous logs (just a small extracts):
swh-logging-prod-2020-01-22-1.json.gz:{"type": "server", "timestamp": "2020-01-22T17:00:38,687Z", "level": "WARN", "component": "o.e.c.c.C.CoordinatorPublication", "cluster.name": "swh-logging-prod", "node.name": "esnode3", "message": "after [1m] publication of cluster state version [207982] is still waiting for {esnode2}{jTNROc0CQyCqUpItsfVMOQ}{gK5P8Q62Swa8redwMr_c-A}{192.168.100.62}{192.168.100.62:9300}{dilm}{ml.machine_memory=33688297472, ml.max_open_jobs=20, xpack.installed=true} [SENT_PUBLISH_REQUEST]", "cluster.uuid": "-pJ9DxzdTIGyqhG-p7fh0Q", "node.id": "SZGj6DMmSmC18ejyOh23fQ" }swh-logging-prod-2020-01-22-1.json.gz:{"type": "server", "timestamp": "2020-01-22T17:03:02,353Z", "level": "INFO", "component": "o.e.c.c.C.CoordinatorPublication", "cluster.name": "swh-logging-prod", "node.name": "esnode3", "message": "after [55.7s] publication of cluster state version [207985] is still waiting for {esnode1}{KQbk_8_ARbObrsKYvt0dAQ}{ZXrh1hKTRXOYsEM_U2lztg}{192.168.100.61}{192.168.100.61:9300}{dilm}{ml.machine_memory=33688297472, ml.max_open_jobs=20, xpack.installed=true} [SENT_APPLY_COMMIT]", "cluster.uuid": "-pJ9DxzdTIGyqhG-p7fh0Q", "node.id": "SZGj6DMmSmC18ejyOh23fQ" }swh-logging-prod-2020-05-04-2.json.gz:{"type": "server", "timestamp": "2020-05-04T18:37:22,705Z", "level": "WARN", "component": "o.e.c.c.C.CoordinatorPublication", "cluster.name": "swh-logging-prod", "node.name": "esnode3", "message": "after [30s] publication of cluster state version [304674] is still waiting for {esnode1}{KQbk_8_ARbObrsKYvt0dAQ}{Edz0GInGTR-LpoPV34OtHQ}{192.168.100.61}{192.168.100.61:9300}{dilm}{ml.machine_memory=33688297472, ml.max_open_jobs=20, xpack.installed=true} [SENT_PUBLISH_REQUEST]", "cluster.uuid": "-pJ9DxzdTIGyqhG-p7fh0Q", "node.id": "SZGj6DMmSmC18ejyOh23fQ" }swh-logging-prod-2020-12-04-5.log.gz:[2020-12-04T14:21:25,235][WARN ][o.e.c.c.C.CoordinatorPublication] [esnode3] after [43.8s] publication of cluster state version [361407] is still waiting for {esnode1}{KQbk_8_ARbObrsKYvt0dAQ}{ZHWNVbprQiG3Cg9ZtZMYMQ}{192.168.100.61}{192.168.100.61:9300}{dilmrt}{ml.machine_memory=33688309760, ml.max_open_jobs=20, xpack.installed=true, transform.node=true} [SENT_APPLY_COMMIT]
At one time or another the cluster was configured to use a limit of 40mb and the value was forced to 500mb as there is only logs on the way 40mb -> something :
, "node.name": "esnode3", "message": "updating [indices.recovery.max_bytes_per_sec] from [200MB] to [500MB]", "cluster.uuid": "-pJ9DxzdTIGyqhG-p7fh0Q", "node.id": "SZGj6DMmSmC18ejyOh23fQ" }/var/log/elasticsearch/swh-logging-prod-2020-01-22-1.log.gz:[2020-01-22T21:48:41,410][INFO ][o.e.c.s.ClusterSettings ] [esnode3] updating [indices.recovery.max_bytes_per_sec] from [40mb] to [200MB]/var/log/elasticsearch/swh-logging-prod-2020-01-22-1.log.gz:[2020-01-22T21:58:42,985][INFO ][o.e.c.s.ClusterSettings ] [esnode3] updating [indices.recovery.max_bytes_per_sec] from [200MB] to [500MB]/var/log/elasticsearch/swh-logging-prod-2020-06-30-1.json.gz:{"type": "server", "timestamp": "2020-06-30T06:03:04,337Z", "level": "INFO", "component": "o.e.c.s.ClusterSettings", "cluster.name": "swh-logging-prod", "node.name": "esnode3", "message": "updating [indices.recovery.max_bytes_per_sec] from [40mb] to [500MB]" }/var/log/elasticsearch/swh-logging-prod-2020-06-30-1.log.gz:[2020-06-30T06:03:04,337][INFO ][o.e.c.s.ClusterSettings ] [esnode3] updating [indices.recovery.max_bytes_per_sec] from [40mb] to [500MB]/var/log/elasticsearch/swh-logging-prod-2020-12-04-5.json.gz:{"type": "server", "timestamp": "2020-12-04T14:14:04,227Z", "level": "INFO", "component": "o.e.c.s.ClusterSettings", "cluster.name": "swh-logging-prod", "node.name": "esnode3", "message": "updating [indices.recovery.max_bytes_per_sec] from [40mb] to [500MB]" }/var/log/elasticsearch/swh-logging-prod-2020-12-04-5.log.gz:[2020-12-04T14:14:04,227][INFO ][o.e.c.s.ClusterSettings ] [esnode3] updating [indices.recovery.max_bytes_per_sec] from [40mb] to [500MB]/var/log/elasticsearch/swh-logging-prod-2020-12-14-1.json.gz:{"type": "server", "timestamp": "2020-12-14T12:50:16,482Z", "level": "INFO", "component": "o.e.c.s.ClusterSettings", "cluster.name": "swh-logging-prod", "node.name": "esnode3", "message": "updating [indices.recovery.max_bytes_per_sec] from [40mb] to [500MB]" }/var/log/elasticsearch/swh-logging-prod-2020-12-14-1.log.gz:[2020-12-14T12:50:16,482][INFO ][o.e.c.s.ClusterSettings ] [esnode3] updating [indices.recovery.max_bytes_per_sec] from [40mb] to [500MB]
Having kept esnode2 in the cluster with no shards allocated allowed esnode3 to become the primary node of all the indexes so the cluster was not red when esnode1 was excluded from the cluster.
Now the cluster is stabilized, esnode2 can be stopped and migrated to zfs
kafka partition and old elasticsearch raid removed
root@esnode2:~# diff -U3 /tmp/fstab /etc/fstabroot@esnode2:~# diff -U3 /tmp/fstab /etc/fstab--- /tmp/fstab 2021-01-14 09:05:59.609906708 +0000+++ /etc/fstab 2021-01-14 09:06:49.390527123 +0000@@ -9,8 +9,5 @@ UUID=3700082d-41e5-4c54-8667-46280f124b33 / ext4 errors=remount-ro 0 1 # /boot/efi was on /dev/sda1 during installation UUID=0228-9320 /boot/efi vfat umask=0077 0 1-#/srv/kafka was on /dev/sda4 during installation-#UUID=c97780cb-378c-4963-ac31-59281410b2f9 /srv/kafka ext4 defaults 0 2 # swap was on /dev/sda3 during installation UUID=3eea10c5-9913-44c1-aa85-a1e93ae12970 none swap sw 0 0-/dev/md0 /srv/elasticsearch xfs defaults,noatime 0 0
Removing old raid :
root@esnode2:~# mdadm --detail /dev/md0 /dev/md0: Version : 1.2 Creation Time : Wed May 23 08:21:35 2018 Raid Level : raid0 Array Size : 5860150272 (5588.67 GiB 6000.79 GB) Raid Devices : 3 Total Devices : 3 Persistence : Superblock is persistent Update Time : Wed May 23 08:21:35 2018 State : clean Active Devices : 3 Working Devices : 3 Failed Devices : 0 Spare Devices : 0 Chunk Size : 512KConsistency Policy : none Name : esnode2:0 (local to host esnode2) UUID : 9bdaa5d1:de7dbae8:6c57c7f9:55875c0b Events : 0 Number Major Minor RaidDevice State 0 8 16 0 active sync /dev/sdb 1 8 32 1 active sync /dev/sdc 2 8 48 2 active sync /dev/sddroot@esnode2:~# mdadm --stop /dev/md0mdadm: stopped /dev/md0
There is no configuration defined on the /etc/mdadm/mdadm.conf file
Replicate sda partitionning to other disks:
root@esnode2:/etc/mdadm# sfdisk --wipe /dev/sdbroot@esnode2:/etc/mdadm# sfdisk -d /dev/sda | sfdisk -f /dev/sdbChecking that no-one is using this disk right now ... OKThe old linux_raid_member signature may remain on the device. It is recommended to wipe the device with wipefs(8) or sfdisk --wipe, in order to avoid possible collisions.Disk /dev/sdb: 1.8 TiB, 2000398934016 bytes, 3907029168 sectorsDisk model: HGST HUS726020ALUnits: sectors of 1 * 512 = 512 bytesSector size (logical/physical): 512 bytes / 512 bytesI/O size (minimum/optimal): 512 bytes / 512 bytes>>> Script header accepted.>>> Script header accepted.>>> Script header accepted.>>> Script header accepted.>>> Script header accepted.>>> Script header accepted.>>> Created a new GPT disklabel (GUID: 8606F89B-F66E-4B25-AD6E-D7C8BF9EF2F6).The old linux_raid_member signature may remain on the device. It is recommended to wipe the device with wipefs(8) or sfdisk --wipe, in order to avoid possible collisions./dev/sdb1: Created a new partition 1 of type 'EFI System' and of size 512 MiB./dev/sdb2: Created a new partition 2 of type 'Linux filesystem' and of size 37.3 GiB./dev/sdb3: Created a new partition 3 of type 'Linux swap' and of size 29.8 GiB./dev/sdb4: Created a new partition 4 of type 'Linux filesystem' and of size 1.8 TiB./dev/sdb5: Done.New situation:Disklabel type: gptDisk identifier: 8606F89B-F66E-4B25-AD6E-D7C8BF9EF2F6Device Start End Sectors Size Type/dev/sdb1 2048 1050623 1048576 512M EFI System/dev/sdb2 1050624 79175679 78125056 37.3G Linux filesystem/dev/sdb3 79175680 141676543 62500864 29.8G Linux swap/dev/sdb4 141676544 3907028991 3765352448 1.8T Linux filesystemThe partition table has been altered.Calling ioctl() to re-read partition table.Syncing disks.root@esnode2:/etc/mdadm# sfdisk -d /dev/sda | sfdisk -f /dev/sdc...root@esnode2:/etc/mdadm# sfdisk -d /dev/sda | sfdisk -f /dev/sdd...
zfs creation
root@esnode2:~# zpool create -f elasticsearch-data -m /srv/elasticsearch/nodes /dev/sda4 /dev/sdb4 /dev/sdc4 /dev/sdd4root@esnode2:~# zpool status elasticsearch-data pool: elasticsearch-data state: ONLINE scan: none requestedconfig: NAME STATE READ WRITE CKSUM elasticsearch-data ONLINE 0 0 0 sda4 ONLINE 0 0 0 sdb4 ONLINE 0 0 0 sdc4 ONLINE 0 0 0 sdd4 ONLINE 0 0 0errors: No known data errorsroot@esnode2:~# zpool listNAME SIZE ALLOC FREE CKPOINT EXPANDSZ FRAG CAP DEDUP HEALTH ALTROOTelasticsearch-data 7T 141K 7.00T - - 0% 0% 1.00x ONLINE -root@esnode2:~# chown elasticsearch: /srv/elasticsearch/nodes
restart elasticsearch and remove the allocation exclusion
root@esnode2:~# puppet agent --enableroot@esnode2:~# systemctl enable elasticsearchSynchronizing state of elasticsearch.service with SysV service script with /lib/systemd/systemd-sysv-install.Executing: /lib/systemd/systemd-sysv-install enable elasticsearchCreated symlink /etc/systemd/system/multi-user.target.wants/elasticsearch.service -> /usr/lib/systemd/system/elasticsearch.service.root@esnode2:~# systemctl start elasticsearch
After a reboot, a message Failed to start Import ZFS pools by cache file is displayed on the server console and the pool is not mounted. It seems it can be caused by using /dev/sd* disk names directly.
The pool was recreated with the WWN[1] ids and it solved the issue. I will check after esnode3 will be migrated of the pool on esnode1 need to be changed
root@esnode2:~# zpool create -f elasticsearch-data -m /srv/elasticsearch/nodes -O atime=off -O relatime=on $(ls /dev/disk/by-id/wwn-*part4)root@esnode2:~# zpool listNAME SIZE ALLOC FREE CKPOINT EXPANDSZ FRAG CAP DEDUP HEALTH ALTROOTelasticsearch-data 7T 156K 7.00T - - 0% 0% 1.00x ONLINE -root@esnode2:~# zpool status pool: elasticsearch-data state: ONLINE scan: none requestedconfig: NAME STATE READ WRITE CKSUM elasticsearch-data ONLINE 0 0 0 wwn-0x5000cca25ecbedbb-part4 ONLINE 0 0 0 wwn-0x5000cca25ecc2106-part4 ONLINE 0 0 0 wwn-0x5000cca25ecc2c6d-part4 ONLINE 0 0 0 wwn-0x5000cca25ecc4292-part4 ONLINE 0 0 0errors: No known data errors