Huge slowdowns on louvre since 2018-08-20
The louvre hypervisor has seen tremendous slowdowns since 2018-08-20. Some VMs completely froze for minutes at a time and had to be migrated to beaubourg.
Migrated from T1173 (view on Phabricator)
Designs
- Show closed items
Activity
-
Newest first Oldest first
-
Show all activity Show comments only Show history only
- Phabricator Migration user marked this issue as related to #1069 (closed)
marked this issue as related to #1069 (closed)
- Phabricator Migration user marked this issue as related to #1166
marked this issue as related to #1166
- François Tigeot added System administration priority:Normal labels
added System administration priority:Normal labels
- Author
At least three important changes were made on 2018-08-20:
- Uffizi has been morphed from a Qemu VM to a lxc container
- VM storage has been migrated to a Ceph backend
- The Proxmox PVE Linux kernel has been updated
- Author
One of the impacted VM is logstash0:
[Mon Aug 20 15:45:38 2018] sd 2:0:0:0: [sda] tag#0 abort ... Tue Aug 21 02:38:39 2018] INFO: task java:1321 blocked for more than 120 seconds. [Tue Aug 21 02:38:39 2018] Not tainted 4.9.0-8-amd64 #1 Debian 4.9.110-3+deb9u3 ... Fri Aug 24 13:26:00 2018] sd 2:0:0:0: [sda] tag#108 abort ...
Some puppet manifests no longer behave as expected on this machine, possibly due do some form of disk corruption.
- Author
logstash0 migrated to beaubourg as well (complete shutdown and restart included).
- Owner
logstash0 migrated to beaubourg as well (complete shutdown and restart included).
And latest
puppet agent --test
now successfully writes the password changes it tried to apply time and again [1] - Author
Some of the slow-downs are definitely I/O-related and caused by the switch to Ceph for VM disk image storage:
- Most VMs suffer from I/O wait issues since August 20, 2018
- Ceph nodes are not network-bandwidth limited and only sustain ~= 120Mb/s of peak bandwidth
- Ceph nodes suffer from I/O wait The last point is is not very surprising since their storage mostly consist of rotating disk drives.
- François Tigeot added state:wip label
added state:wip label
- Author
munin0 disk image moved to local SSD storage on beaubourg, I/O wait numbers have vastly decreased.
- Author
The inability of Ceph storage to sustain random I/O workloads doesn't explain all issues on louvre: many VMs immediately experiences huge performance improvements when migrated to beaubourg, keeping the same storage backend.
- Stefano Zacchiroli added priority:UnbreakNow! label and removed priority:Normal label
added priority:UnbreakNow! label and removed priority:Normal label
- Author
Kibana0 migrated from Ceph to local SSD storage on louvre. It was the last VM running on louvre and also experiencing visible I/O wait.
- Author
Moma storage migrated from Ceph to SSD storage on Beaubourg. CPU and memory sized were way overkill and have been cut in half. If this VM has to be migrated to louvre again, it will definitely require less hypervisor resources.
- Phabricator Migration user marked this issue as related to #1176 (closed)
marked this issue as related to #1176 (closed)
- Author
PCID option removed on some VMs in order to migrate them to orsay. The current plan is to completely replace louvre by a more recent and reliable machine for the hypervisor functions.
- Stefano Zacchiroli added priority:High label and removed priority:UnbreakNow! label
added priority:High label and removed priority:UnbreakNow! label
- Phabricator Migration user marked this issue as related to #1392 (closed)
marked this issue as related to #1392 (closed)
- Nicolas Dandrimont added state:wontfix label and removed state:wip label
added state:wontfix label and removed state:wip label
- Nicolas Dandrimont closed
closed
- Owner
"louvre" is not a hypervisor any longer.
- François Tigeot mentioned in issue #1526 (closed)
mentioned in issue #1526 (closed)