Install the new bare metal server(s) for production elk cluster

changed the description

We need to ask for the access port configuration. it's usually done when we request the installation in a ticket for the DSI. This time, the servers were installed by ourselves so the request was not made.

heh, that may explain this [1]

[1]

mentioned in commit ipxe@c3014cbf

[stand-bye] we need to request the access port configuration to the dsi and vince will do it tomorrow.

changed the description

mentioned in commit c727d12b

mentioned in commit ipxe@60105032

mentioned in commit ipxe@18cc6460

The esnode8.yaml and esnode9.yaml configuration file are pushed in the repository ipxe btw. So if someone takes over, remains to build the iso and start the install.

mentioned in commit ipxe@dcf7a9cd

mentioned in commit ipxe@ac4a31ad

mentioned in commit ipxe@38e6945b

mentioned in commit ipxe@306b101a

The os got installed on both machine as bullseye though (only seen after the install).

I updated the template to default to bookworm so the next install should be with bookworm.

I've dist-upgraded the machines to bookworm.

Remains to configure the machines so they can join the es cluster.

mentioned in commit swh/infra/puppet/puppet-swh-site@6d3d64d6

mentioned in commit swh/infra/puppet/puppet-swh-site@de6a01ac

marked the checklist item (if needed) Add puppet configuration as completed

marked the checklist item Run puppet so it installs the required software with the es profile as completed

marked the checklist item Adapt the zfs configuration accordingly (if puppet can't do it already) as completed

marked the checklist item Install the OS as completed

changed the description

A few bumps in the road but esnode8 and 9 are joining the es cluster.

The puppet certificate had to be generated from pergamon (like we currently do when we need to renew the puppet certificates from a bookworm machine).

I've added a /usr/local/bin/puppet5-generate-certificate.sh $HOST_FQDN $HOST_IP script to ease the generation/signing/copy of the certicates files. It assumes the arborescence is present on the $HOST_FQDN machine (after a faulty puppet agent run for example, it will be ok). It's a basic script for now.

The first puppet agent failed so the elastic search sources.list had to be adapted slightly so the first puppet agent run actually goes through without issues.

The zpool manually got prepared [1] [2]. Once the first puppet agent run is done, we've got the necessary tools for that [1]

Some extra elasticsearch configuration had to be adapted so the service is ok to start (details are in the linked commit). (it was complaining about xpack settings not being set).

Some notes:

megacli is no longer packaged in bookworm so that failed to get installed
[repeat] we can't just trigger puppet agent the first time on bookworm machine while puppet master remains at puppet 5
we could probably evolve the elasticsearch role (swh-site) to use profile::zfs::common so we can declare the zpool (and avoid doing the zpool install manually, like we do for rancher metal nodes)

[1]

root@esnode8:~# zpool create elasticsearch nvme-MO003200KYDNC_S70NNT0X604237 nvme-MO003200KYDNC_S70NNT0X604239 nvme-MO003200KYDNC_S70NNT0X604240 nvme-MO003200KYDNC_S70NNT0X604242
root@esnode8:~# zpool status
  pool: elasticsearch
 state: ONLINE
config:

        NAME                                 STATE     READ WRITE CKSUM
        elasticsearch-data                   ONLINE       0     0     0
          nvme-MO003200KYDNC_S70NNT0X604237  ONLINE       0     0     0
          nvme-MO003200KYDNC_S70NNT0X604239  ONLINE       0     0     0
          nvme-MO003200KYDNC_S70NNT0X604240  ONLINE       0     0     0
          nvme-MO003200KYDNC_S70NNT0X604242  ONLINE       0     0     0

errors: No known data errors
root@esnode8:~# zpool list -v
NAME                                  SIZE  ALLOC   FREE  CKPOINT  EXPANDSZ   FRAG    CAP  DEDUP    HEALTH  ALTROOT
elasticsearch                        11.6T   600K  11.6T        -         -     0%     0%  1.00x    ONLINE  -
  nvme-MO003200KYDNC_S70NNT0X604237  2.91T    66K  2.91T        -         -     0%  0.00%      -    ONLINE
  nvme-MO003200KYDNC_S70NNT0X604239  2.91T   174K  2.91T        -         -     0%  0.00%      -    ONLINE
  nvme-MO003200KYDNC_S70NNT0X604240  2.91T   180K  2.91T        -         -     0%  0.00%      -    ONLINE
  nvme-MO003200KYDNC_S70NNT0X604242  2.91T   180K  2.91T        -         -     0%  0.00%      -    ONLINE
root@esnode8:~# zfs create -o mountpoint=/srv/elasticsearch -o atime=off elasticsearch/data
root@esnode8:~# mount | grep elastic
elasticsearch/data on /srv/elasticsearch type zfs (rw,noatime,xattr,noacl)
root@esnode8:~# zfs list
NAME                 USED  AVAIL     REFER  MOUNTPOINT
elasticsearch        214K  11.5T       24K  none
elasticsearch/data    24K  11.5T       24K  /srv/elasticsearch

[2]

root@esnode9:~# zpool create elasticsearch nvme-MO003200KYDNC_S70NNT0X604313 nvme-MO003200KYDNC_S70NNT0X604315 nvme-MO003200KYDNC_S70NNT0X604316 nvme-MO003200KYDNC_S70NNT0X604246
root@esnode9:~# zfs list
NAME            USED  AVAIL     REFER  MOUNTPOINT
elasticsearch   118K  11.5T       24K  /elasticsearch
root@esnode9:~# zfs create -o mountpoint=/srv/elasticsearch -o atime=off elasticsearch/data
root@esnode9:~# zfs list
NAME                 USED  AVAIL     REFER  MOUNTPOINT
elasticsearch        186K  11.5T       24K  /elasticsearch
elasticsearch/data    24K  11.5T       24K  /srv/elasticsearch
root@esnode9:~# mount | grep elastic
elasticsearch on /elasticsearch type zfs (rw,xattr,noacl)
elasticsearch/data on /srv/elasticsearch type zfs (rw,noatime,xattr,noacl)
root@esnode9:~# zfs set mountpoint=none elasticsearch
root@esnode9:~# mount | grep elastic
elasticsearch/data on /srv/elasticsearch type zfs (rw,noatime,xattr,noacl)

~~Create a swap at least the size of the machine's memory~~

esnode7 which has also 64g (like esnode8-9) of ram has no swap. esnode[1-3] have swap of 32g which match their ram.

~~Update firewall rules with the new machine's ip (e.g. swh_$environment_kube_workers, ...)~~

No need it seems we can already request it from our network.

root@pergamon:~# curl http://esnode8.internal.softwareheritage.org:9200
{
  "name" : "esnode8",
  "cluster_name" : "swh-logging-prod",
  "cluster_uuid" : "-pJ9DxzdTIGyqhG-p7fh0Q",
  "version" : {
    "number" : "8.15.1",
    "build_flavor" : "default",
    "build_type" : "deb",
    "build_hash" : "253e8544a65ad44581194068936f2a5d57c2c051",
    "build_date" : "2024-09-02T22:04:47.310170297Z",
    "build_snapshot" : false,
    "lucene_version" : "9.11.1",
    "minimum_wire_compatibility_version" : "7.17.0",
    "minimum_index_compatibility_version" : "7.0.0"
  },
  "tagline" : "You Know, for Search"
}
root@pergamon:~# curl http://esnode9.internal.softwareheritage.org:9200
{
  "name" : "esnode9",
  "cluster_name" : "swh-logging-prod",
  "cluster_uuid" : "-pJ9DxzdTIGyqhG-p7fh0Q",
  "version" : {
    "number" : "8.15.1",
    "build_flavor" : "default",
    "build_type" : "deb",
    "build_hash" : "253e8544a65ad44581194068936f2a5d57c2c051",
    "build_date" : "2024-09-02T22:04:47.310170297Z",
    "build_snapshot" : false,
    "lucene_version" : "9.11.1",
    "minimum_wire_compatibility_version" : "7.17.0",
    "minimum_index_compatibility_version" : "7.0.0"
  },
  "tagline" : "You Know, for Search"
}

marked the checklist item ~~Update firewall rules with the new machine's ip (e.g. swh_$environment_kube_workers, ...)~~ as completed

changed the description

The new esnode7-8 have fully integrated the cluster.

Let's close this.

root@esnode1:~# server=http://esnode1.internal.softwareheritage.org:9200; date; curl -s $server/_cat/allocation?v\&s=node; echo; curl -s $server/_cluster/health | jq
Tue Jan 21 13:45:48 UTC 2025
shards shards.undesired write_load.forecast disk.indices.forecast disk.indices disk.used disk.avail disk.total disk.percent host           ip             node    node.role
   638                0                 0.0                 5.3tb        5.3tb     5.3tb      1.5tb      6.8tb           77 192.168.100.61 192.168.100.61 esnode1 cdfhilmrstw
   629                0                 0.0                 5.5tb        5.5tb     5.5tb      1.3tb      6.8tb           80 192.168.100.62 192.168.100.62 esnode2 cdfhilmrstw
   636                0                 0.0                 5.3tb        5.3tb     5.4tb      1.4tb      6.8tb           78 192.168.100.63 192.168.100.63 esnode3 cdfhilmrstw
   644                0                 0.0                 5.2tb        5.2tb     5.2tb      8.6tb     13.8tb           37 192.168.100.64 192.168.100.64 esnode7 cdfhilmrstw
   644                0                 0.0                 5.2tb        5.2tb     5.2tb      6.2tb     11.4tb           45 192.168.100.65 192.168.100.65 esnode8 cdfhilmrstw
   641                0                 0.0                 5.2tb        5.2tb     5.3tb      6.1tb     11.4tb           46 192.168.100.66 192.168.100.66 esnode9 cdfhilmrstw

{
  "cluster_name": "swh-logging-prod",
  "status": "green",                         <- "green, green, super green!"
  "timed_out": false,
  "number_of_nodes": 6,                      <- 4 + the 2 new nodes, 6 good!
  "number_of_data_nodes": 6,
  "active_primary_shards": 1916,
  "active_shards": 3832,
  "relocating_shards": 0,                    <- No shards to reassign or
  "initializing_shards": 0,                  <- shuffle around, everything
  "unassigned_shards": 0,                    <- looks good
  "delayed_unassigned_shards": 0,            <-
  "number_of_pending_tasks": 0,              <-
  "number_of_in_flight_fetch": 0,
  "task_max_waiting_in_queue_millis": 0,
  "active_shards_percent_as_number": 100
}

closed

Install the new bare metal server(s) for production elk cluster

Designs

Child items ...

Activity