Huge slowdowns on louvre since 2018-08-20

marked this issue as related to #1069 (closed)

marked this issue as related to #1166

added System administration priority:Normal labels

At least three important changes were made on 2018-08-20:

Uffizi has been morphed from a Qemu VM to a lxc container
VM storage has been migrated to a Ceph backend
The Proxmox PVE Linux kernel has been updated

One of the impacted VM is logstash0:

[Mon Aug 20 15:45:38 2018] sd 2:0:0:0: [sda] tag#0 abort
...
Tue Aug 21 02:38:39 2018] INFO: task java:1321 blocked for more than 120 seconds.
[Tue Aug 21 02:38:39 2018]       Not tainted 4.9.0-8-amd64 #1 Debian 4.9.110-3+deb9u3
...
Fri Aug 24 13:26:00 2018] sd 2:0:0:0: [sda] tag#108 abort
...

Some puppet manifests no longer behave as expected on this machine, possibly due do some form of disk corruption.

logstash0 migrated to beaubourg as well (complete shutdown and restart included).

logstash0 migrated to beaubourg as well (complete shutdown and restart included).

And latest puppet agent --test now successfully writes the password changes it tried to apply time and again [1]

[1] https://forge.softwareheritage.org/migrated/migration$291$19-28

Some of the slow-downs are definitely I/O-related and caused by the switch to Ceph for VM disk image storage:

Most VMs suffer from I/O wait issues since August 20, 2018
Ceph nodes are not network-bandwidth limited and only sustain ~= 120Mb/s of peak bandwidth
Ceph nodes suffer from I/O wait The last point is is not very surprising since their storage mostly consist of rotating disk drives.

added state:wip label

munin0 disk image moved to local SSD storage on beaubourg, I/O wait numbers have vastly decreased.

The inability of Ceph storage to sustain random I/O workloads doesn't explain all issues on louvre: many VMs immediately experiences huge performance improvements when migrated to beaubourg, keeping the same storage backend.

added priority:UnbreakNow! label and removed priority:Normal label

Kibana0 migrated from Ceph to local SSD storage on louvre. It was the last VM running on louvre and also experiencing visible I/O wait.

Moma storage migrated from Ceph to SSD storage on Beaubourg. CPU and memory sized were way overkill and have been cut in half. If this VM has to be migrated to louvre again, it will definitely require less hypervisor resources.

marked this issue as related to #1176 (closed)

PCID option removed on some VMs in order to migrate them to orsay. The current plan is to completely replace louvre by a more recent and reliable machine for the hypervisor functions.

added priority:High label and removed priority:UnbreakNow! label

marked this issue as related to #1392 (closed)

added state:wontfix label and removed state:wip label

closed

"louvre" is not a hypervisor any longer.

mentioned in issue #1526 (closed)

Huge slowdowns on louvre since 2018-08-20

Designs

Child items ...

Activity