Create an inventory of useful Munin metrics
- Show closed items
Activity
-
Newest first Oldest first
-
Show all activity Show comments only Show history only
- Phabricator Migration user marked this issue as related to #1408 (closed)
marked this issue as related to #1408 (closed)
- François Tigeot added Metrics/monitoring Sprint 2018 12 priority:Normal labels
added Metrics/monitoring Sprint 2018 12 priority:Normal labels
- Author
Disk
- I/Os per device
- Disk usage in percent
- Utilization per device
is this real ? it could be useful to see if a storage subsystem is overloaded
- Disk usage in absolute human values.
percentages are meaningless if we resize filesystems
Networking
- eth0 traffic
Database
- Postgres replication lag
- Postgres database size
- Postgres oldest query + oldest transaction
- Postgres scan types (sequential / indexed)
- Postgres wal segments
- Postgres nb. of transactions
System
- CPU usage
- load average
- Memory usage
- Pending packages
- Swap in/out
- Uptime
RabbitMQ
- Consumers
- Memory used by queue
- Unacknowledged messages
- Nb. of connections
Softwareheritage (prado)
- Almost everything
- Most importantly Software Heritage Objects
- François Tigeot added state:wip label
added state:wip label
- vlorentz added priority:High label and removed priority:Normal label
added priority:High label and removed priority:Normal label
- Maintainer
| Munin metric | Comment | Prometheus metric combination | Prometheus comment | ||||| | Disk | ||||| | I/Os per device || node_disk_reads_completed_total; node_disk_writes_completed_total | Add derivative to get IOPS | | Disk usage in percent (space) | | (node_filesystem_size_bytes - node_filesystem_{avail,free}bytes) / node_filesystem_size_bytes | avail = available to non-root, free = available to root (tune2fs -m / reserved-blocks-percentage) | | Disk usage in percent (inodes) | | (node_filesystem_files - node_filesystem_files_free) / node_filesystem_files | | | Utilization per device | is this real ? it could be useful to see if a storage subsystem is overloaded | node_disk_io_time_seconds_total | total time spent in seconds doing IO on the specified device; AFAICT the derivative of this counter is what munin calls "utilization per device" | | | | node_disk_io_time_weighted_seconds_total | counts the number of seconds spent doing IO multiplied by the number of concurrent IO requests; maybe more relevant ? Docs: https://www.kernel.org/doc/Documentation/iostats.txt | | Disk usage in absolute human values. | percentages are meaningless if we resize filesystems | node_filesystem_size_bytes - node_filesystem{avail,free}bytes | avail = available to non-root, free = available to root | ||||| | Networking | ||||| | eth0 traffic | | node_network_receive_bytes_total; node_network_transmit_bytes_total | derivative for bytes per second | | | | node_network_receive_packets_total; node_network_transmit_packets_total | derivative for packets per second | | | | node_network_receive_errs_total; node_network_transmit_errs_total | alert if non-zero | ||||| | Database | ||||| | | | | implemented with prometheus-sql-exporter | | Postgres replication lag | | sql_pg_stat_replication{col=~'(send_lag_bytes,flush_lag_bytes,replay_lag_bytes)'} | replace commas with pipes... | | Postgres database size | | sql_pg_stat_database{col="dbsize"} | | | Postgres oldest transaction | | sql_pg_stat_activity{col="max_tx_duration"} | | | Postgres oldest query | | ? | | | Postgres scan types (sequential / indexed) | | sql_pg_stat_user_tables;sql_pg_statio_user_tables | | | Postgres wal segments | | sql_archive_ready; sql_pg_stat_archiver | use derivative of sql_pg_stat_archiver values to get archival rates | | Postgres nb. of transactions | | sql_txid | derivative to get tps | ||||| | System ||| ||||| | CPU usage | | node_cpu_seconds_total | use derivative for CPU usage | | load average | | node_load{1,5,15} | | | Memory usage | | node_memory* | | | Pending packages | | XXX | needs to be implemented with the textfile collector (see /usr/share/doc/prometheus-node-exporter/examples/text_collector_examples/apt.sh) | | Swap in/out | | node_vmstat_pswpin; node_vmstat_pswpout | unit ?? probably absolute number of pages | | Uptime | | time() - node_boot_time_seconds | | ||||| | RabbitMQ ||| ||||| | | | | use https://github.com/kbudde/rabbitmq_exporter or https://github.com/deadtrickster/prometheus_rabbitmq_exporter | | Consumers | | | | | Memory used by queue | | | | | Unacknowledged messages | | | | | Nb. of connections | | | | ||||| | Softwareheritage (prado) ||| ||||| | Almost everything | | | integrate to sql-exporter configuration | | Most importantly Software Heritage Objects | | | |
- Author
Already marked as done on 2018-12-19.
- François Tigeot assigned to @ftigeot
assigned to @ftigeot
- François Tigeot removed state:wip label
removed state:wip label
- François Tigeot closed
closed