Proxmox Hypervisor Monitoring with Telegraf and InfluxDB
Published: 2021-05-05, Revised: 2024-02-13
TL;DR This is a description of the process to install Telegraf on proxmox to collect sensor reading, smart data and metrics in InfluxDB 2.0.
Motivation Critical infrastructure needs monitoring. For the proxmox hypervisor, I wanted to monitor:
- resource usage such as disk/cpu/memory
- HDD health (raid, S.M.A.R.T)
- System temperature, Fan RPM and other sensors
InfluxDB is well suited for this purpose. It can be directly connected to Grafana. The proxmox interface already offers the option to connect to a metric server such as InfluxDB. However, it will only send standard metrics that are available in the dashboard.
To include Smart monitoring and sensor readings, Telegraf must be installed on the proxmox host.
There are some instructions available how to do this, but I found no source that covers all required steps.
This post covers:
- installation of Telegraf and dependencies
- configuration of smart, sensors and metrics collectors in Telegraf
Not included here is the setup of InfluxDB 2.0 itself. I have installed it in a separate LXC container running debian, based on the default instructions from the docs.
Also, I run InfluxDB 2.0 behind a Nginx reverse proxy, which makes the interface available through HTTPS with Let's Encrypt SSL certs in a local subdomain. The instructions below are the same, regardless of whether InfluxDB is available through an IP or a domain name.
For the sake of completeness, see my nginx config for InfluxDB 2.0 below
server {
listen 443 ssl http2;
listen [::]:443 ssl http2;
server_name influx.local.mytld.com;
ssl on;
ssl_certificate /etc/nginx/ssl/wildcard.local.mytld.com.fullchain;
ssl_certificate_key /etc/nginx/ssl/wildcard.local.mytld.com.key;
ssl_protocols TLSv1.2 TLSv1.3;
ssl_ciphers HIGH:!MEDIUM:!LOW:!aNULL:!NULL:!SHA;
ssl_prefer_server_ciphers on;
ssl_session_cache shared:SSL:10m;
location / {
proxy_pass http://localhost:8086;
proxy_redirect off;
proxy_http_version 1.1;
proxy_max_temp_file_size 10m;
proxy_connect_timeout 20;
proxy_send_timeout 20;
proxy_read_timeout 20;
proxy_set_header Host $host;
proxy_set_header Upgrade $http_upgrade;
proxy_set_header Connection keep-alive;
proxy_set_header X-Real-IP $remote_addr;
proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
proxy_set_header X-Forwarded-Proto http;
proxy_set_header X-Original-For $proxy_add_x_forwarded_for;
proxy_set_header X-Original-Proto https;
proxy_cache_bypass $http_upgrade;
}
}
When I go to https://influx.local.mytld.com
, the InfluxDB 2.0 frontend opens.
Note that all metric collectors must be configured to use port 443 instead of 8086, and they must also have the current SSL certs available.
Redirect metric collection#
The first step is to redirect Proxmox metric collection to a local Socket that can be consumed by Telegraf.
The setting file can be found at:
/etc/pve/status.cfg
If you have anything set in the proxmox web interface, under, Datacenter > Metric Server, it will be stored in this file.
Edit the file (e.g. nano /etc/pve/status.cfg
) and replace with the following lines:
influxdb: InfluxDB
server 127.0.0.1
port 8089
You can select any name here for the metric server, I used InfluxDB
.
Using these settings, Proxmox will send metrics internally to port 8089 on localhost, which we will connect to from Telegraf in the next step.
Install Telegraf#
I do not like making modifications to the Proxmox host for several reasons, but this is unavoidable* if you want to directly collect smart data and sensor readings. Below is from the Telegraf Docs (check for changes first).
wget -qO- https://repos.influxdata.com/influxdb.key | sudo tee /etc/apt/trusted.gpg.d/influxdb.asc >/dev/null
source /etc/os-release
echo "deb https://repos.influxdata.com/${ID} ${VERSION_CODENAME} stable" | sudo tee /etc/apt/sources.list.d/influxdb.list
sudo apt-get update && sudo apt-get install telegraf
* Not exactly. You could use Pci passthrough to forward all sensors to a VM. This would be the cleanest approach, but also the most laborious.
Configure telegraf plugins#
A sample telegraf.conf
is available that contains all plugins.
Make a backup and create a new, empty telegraf.conf
.
cp /etc/telegraf/telegraf.conf /etc/telegraf/telegraf.conf.bak
rm /etc/telegraf/telegraf.conf
nano /etc/telegraf/telegraf.conf
Use the following configuration settings as a template.
[agent]
interval = "10s"
round_interval = true
metric_batch_size = 1000
metric_buffer_limit = 10000
collection_jitter = "0s"
flush_interval = "10s"
flush_jitter = "0s"
precision = ""
hostname = ""
omit_hostname = false
# Configuration for sending metrics to InfluxDB
[[outputs.influxdb_v2]]
urls = ["https://influx.local.tld.com"]
token = "your influxdb-2.0-token"
organization = "your business name"
bucket = "your_bucket"
# Gather metrics from proxmox based on what is in /etc/pve/setup.cfg
[[inputs.socket_listener]]
service_address = "udp://:8089"
[[inputs.smart]]
## Optionally specify the path to the smartctl executable
path_smartctl = "/usr/sbin/smartctl"
path_nvme = "/usr/sbin/nvme"
use_sudo = true
devices = [
"/dev/bus/0 -d megaraid,8",
"/dev/bus/0 -d megaraid,9",
"/dev/bus/0 -d megaraid,10",
"/dev/bus/0 -d megaraid,11"]
[[inputs.sensors]]
## Remove numbers from field names.
## If true, a field name like 'temp1_input' will be changed to 'temp_input'.
# remove_numbers = true
## Timeout is the maximum amount of time that the sensors command can run.
# timeout = "5s"
[[inputs.apcupsd]]
# A list of running apcupsd server to connect to.
# If not provided will default to tcp://127.0.0.1:3551
servers = ["tcp://127.0.0.1:3551"]
## Timeout for dialing server.
timeout = "5s"
[[outputs.influxdb_v2]]#
In InfluxDB 2.0, add a bucket and organization. Create a token. This will be used to by Telegraf to authenticate and write metrics.
Replace ["https://influx.local.tld.com"]
with your InfluxDB 2.0 domain or IP/Port.
[[inputs.socket_listener]]#
This is the metrics socket that Telegraf will connect to to collect the proxmox dashboard metrics (resources etc.).
[[inputs.smart]]#
This is the smart plugin of Telegraf.
If you have nvme
devices, install nvme-cli:
apt install nvme-cli
nvme list
Otherwise, remove the path to nvme
.
For allowing the Telegraf user to access smartctl
,
we need to install sudo and add an entry to the visudo
file.
apt-get install sudo
sudo visudo
Add:
# Cmnd alias specification
Cmnd_Alias SMARTCTL = /usr/sbin/smartctl
telegraf ALL=(ALL) NOPASSWD: SMARTCTL
Defaults!SMARTCTL !logfile, !syslog, !pam_session
These instructions come from Telegraf Issue 8690.2
You will also need to update the device list to capture SMART from. I have an LXI Megaraid MR9260-4i with two Raid 1, 2x Samsung SSD and 2x WD HDD that are directly mounted to the host.
This information can be shown with (e.g.):
cat /proc/scsi/scsi
> Host: scsi0 Channel: 02 Id: 00 Lun: 00
> Vendor: LSI Model: MR9260-4i Rev: 2.13
> Type: Direct-Access ANSI SCSI revision: 05
> Host: scsi0 Channel: 02 Id: 01 Lun: 00
> Vendor: LSI Model: MR9260-4i Rev: 2.13
> Type: Direct-Access ANSI SCSI revision: 05
Use smartctl
to test which settings work for you:
smartctl --scan
> /dev/sda -d scsi # /dev/sda, SCSI device
> /dev/sdb -d scsi # /dev/sdb, SCSI device
> /dev/bus/0 -d megaraid,8 # /dev/bus/0 [megaraid_disk_08], SCSI device
> /dev/bus/0 -d megaraid,9 # /dev/bus/0 [megaraid_disk_09], SCSI device
> /dev/bus/0 -d megaraid,10 # /dev/bus/0 [megaraid_disk_10], SCSI device
> /dev/bus/0 -d megaraid,11 # /dev/bus/0 [megaraid_disk_11], SCSI device
I ignored /dev/sda
and /dev/sdb
and only selected megaraid devices 8 to 11.
Test sample output:
smartctl -H /dev/bus/0 -d sat+megaraid,8
These commands vary, depending on the hardware config.
The final commands are then entered into the list of the telegraf.conf
[[inputs.smart]]
section.
devices = [
"/dev/bus/0 -d megaraid,8",
"/dev/bus/0 -d megaraid,9",
"/dev/bus/0 -d megaraid,10",
"/dev/bus/0 -d megaraid,11"]
If you are using a HBA (e.g. for ZFS), you can directly enter paths to drives.
devices = [
"/dev/sdc --all",
"/dev/sdd --all",
"/dev/sde --all",
"/dev/sdf --all",
"/dev/sdg --all",
"/dev/sdh --all"]
In this case, I prefer to use symlinks from /dev/disk/by-id/
,
to avoid switching drive letters.
Example
ls /dev/disk/by-id
devices = [
"/dev/disk/by-id/ata-WDC_WD80EFAX-68KNBN0_VAG9DU9L --all",
"/dev/disk/by-id/ata-WDC_WD80EFBX-68AZZN0_VRHZHPAK --all",
"/dev/disk/by-id/ata-WDC_WD80EFBX-68AZZN0_VRJ4GAHK --all",
"/dev/disk/by-id/ata-WDC_WD80EFZX-68UW8N0_R6GRD6YY --all",
"/dev/disk/by-id/ata-WDC_WD80EFZX-68UW8N0_R6GRS29Y --all",
"/dev/disk/by-id/ata-WDC_WD80EFZX-68UW8N0_R6GX82ZY --all",
"/dev/disk/by-id/ata-WDC_WDS500G1R0A-68A4W0_21270C441210 --all",
"/dev/disk/by-id/ata-WDC_WDS500G1R0A-68A4W0_21270C441916 --all",
"/dev/bus/0 -d sat+megaraid,9 --all",
"/dev/bus/0 -d sat+megaraid,11 --all"
]
Finally, if you want to collect all smart attributes (e.g. Total_LBAs_Written
):
[[inputs.smart]]
attributes = true
[[inputs.sensors]]#
In order to monitor sensors, you need lm-sensors.3
This may already be installed on proxmox.
apt-get install lm-sensors watch
You may need to run sensors-detect
first, to detect possible sensors:
sudo sensors-detect
Check sensors with:
watch -n 1 sensors
> nct6776-isa-0a30
> Adapter: ISA adapter
> Vcore: +1.46 V (min = +1.02 V, max = +1.69 V)
> in1: +1.87 V (min = +1.55 V, max = +2.02 V)
> AVCC: +3.39 V (min = +2.98 V, max = +3.63 V)
> +3.3V: +3.38 V (min = +2.98 V, max = +3.63 V)
> in4: +1.50 V (min = +0.97 V, max = +1.65 V)
> in5: +1.28 V (min = +1.07 V, max = +1.39 V)
> in6: +1.46 V (min = +0.89 V, max = +1.23 V) ALARM
> 3VSB: +3.36 V (min = +2.98 V, max = +3.63 V)
> Vbat: +3.15 V (min = +2.70 V, max = +3.63 V)
> fan1: 0 RPM (min = 712 RPM) ALARM
> fan2: 3006 RPM (min = 712 RPM)
> fan3: 898 RPM (min = 712 RPM)
> fan4: 5152 RPM (min = 712 RPM)
> fan5: 5232 RPM (min = 712 RPM)
> SYSTIN: +44.0°C (high = +85.0°C, hyst = +80.0°C) sensor = thermistor
> CPUTIN: +30.0°C (high = +85.0°C, hyst = +80.0°C) sensor = thermistor
> AUXTIN: +2.0°C (high = +80.0°C, hyst = +75.0°C) sensor = thermistor
> PECI Agent 0: +0.0°C (high = +80.0°C, hyst = +75.0°C)
> (crit = +100.0°C)
> PCH_CHIP_TEMP: +0.0°C
[[inputs.apcupsd]]#
If you have a UPS, such as from APC, you need to set up the server first, before receiving metrics with Telegraf.
apt-get update
apt-get install apcupsd
# verify connection usb
lsusb
Output
> Bus 003 Device 002: ID 8087:8000 Intel Corp.
> Bus 003 Device 001: ID 1d6b:0002 Linux Foundation 2.0 root hub
> Bus 001 Device 002: ID 8087:8008 Intel Corp.
> Bus 001 Device 001: ID 1d6b:0002 Linux Foundation 2.0 root hub
> Bus 004 Device 001: ID 1d6b:0003 Linux Foundation 3.0 root hub
> Bus 002 Device 004: ID 0557:2419 ATEN International Co., Ltd
> Bus 002 Device 003: ID 0557:7000 ATEN International Co., Ltd Hub
> Bus 002 Device 002: ID 051d:0003 American Power Conversion UPS <-- this
> Bus 002 Device 001: ID 1d6b:0002 Linux Foundation 2.0 root hub
nano /etc/apcupsd/apcupsd.conf
Example config
UPSNAME SRT1000XLI
UPSCABLE usb
UPSTYPE usb
DEVICE
POLLTIME 60
Afterwards, restart apcupsd
and verify output:
systemctl restart apcupsd
systemctl status apcupsd.service
/sbin/apcaccess
Then add the corresponding Telegraf plugin for local polling.
[[inputs.apcupsd]]
# A list of running apcupsd server to connect to.
# If not provided will default to tcp://127.0.0.1:3551
servers = ["tcp://127.0.0.1:3551"]
## Timeout for dialing server.
timeout = "5s"
[[inputs.zfs]]#
There is a specific Telegraf plugin available for collecting ZFS stats.
ZFS telegraf.conf
section
[[inputs.zfs]]
## ZFS kstat path. Ignored on FreeBSD
## If not specified, then default is:
# kstatPath = "/proc/spl/kstat/zfs"
## By default, telegraf gather all zfs stats
## Override the stats list using the kstatMetrics array:
## For FreeBSD, the default is:
# kstatMetrics = ["arcstats", "zfetchstats", "vdev_cache_stats"]
## For Linux, the default is:
# kstatMetrics = ["abdstats", "arcstats", "dnodestats", "dbufcachestats",
# "dmu_tx", "fm", "vdev_mirror_stats", "zfetchstats", "zil"]
## By default, don't gather zpool stats
poolMetrics = true
## By default, don't gather dataset stats
datasetMetrics = true
Test#
Test the Telegraf configuration with these commands:
telegraf --debug
sudo -u telegraf telegraf --config /etc/telegraf/telegraf.conf --test | grep smart
At this stage, I saw socket connection errors.
You can test if restarting the pvestatd
service fixes these.
systemctl restart pvestatd
I still saw Socket connection errors in tail --follow /var/log/syslog
, but the were gone
after a complete reboot of proxmox.
If you later change telegraf.conf
, reload Telegraf to apply changes.
systemctl reload telegraf
Configure Dashboard#
Now it is time to head over to your InfluxDB 2.0.
If you visualize data with Grafana, there is not much to do here. But I found the new 2.0 interface already suited to my needs, without requiring Grafana.
Create a Dashboard and then add Proxmox metrics through the Data Explorer.
smart_device
: S.M.A.R.T. collection datasensors
: lm-sensors dataAll else
: Proxmox metric collection
It requires a bit of time to get used to the syntax, but I did not find this terribly complicated. The metrics from Proxmox are largely cryptic, but make sense after careful investigation.
For example, to show the disk read/write performance for each LXC container,
use system > diskread/diskwrite > Select LXCs to monitor
and then select derivative
as the aggregate function, to render the
increase of disk r/w in separate time buckets.
Here is an example (of type "Graph") to monitor HDD Temperatures and
the corresponding InfluxDB 2.0 Query below.
InfluxDB 2.0 Query
from(bucket: "your_bucket")
|> range(start: v.timeRangeStart, stop: v.timeRangeStop)
|> filter(fn: (r) => r["_measurement"] == "smart_device")
|> filter(fn: (r) => r["_field"] == "temp_c")
|> filter(fn: (r) => r["serial_no"] == "R6GRD6YY" or
r["serial_no"] == "R6GX82ZY" or
r["serial_no"] == "VRHZHPAK" or
r["serial_no"] == "VRJ4GAHK" or
r["serial_no"] == "VAG9DU9L" or
r["serial_no"] == "R6GRS29Y" or
r["serial_no"] == "S3YJNF0JC37927V" or
r["serial_no"] == "S3YJNF0JC31937V" or
r["serial_no"] == "21270C441210" or
r["serial_no"] == "21270C441916")
|> aggregateWindow(every: v.windowPeriod, fn: mean, createEmpty: false)
|> yield(name: "mean")
I used the HDD serial ids, so I can directly identify the physical drive.
More Flux query examples:
Disk Wear (SSD) - extracted from extended Smart Attributes (Single Stat)
Note:
attributes = true
must be set intelegraf.conf
- most of these extended attribute names are vendor-specific
- for instance, Samsung Evo SSDs report
Total_LBAs_Written
, Western Digital SSDs showHost_Writes_GiB
- it makes sense to convert the values further to TBW
- this requires defining a function and providing additional information such as
Sector Size
4 - replace example
serial_no
below with your disk serial ids - Note that these queries can be used directly in Grafana, too
Samsung SSDs:
BYTES_PER_GB=1073741824.0
BYTES_PER_TB=1099511627776.0
LBA_SIZE=512.0
total_lba_written = from(bucket: "your_bucket")
|> range(start: v.timeRangeStart, stop: v.timeRangeStop)
|> filter(fn: (r) => r["_measurement"] == "smart_attribute")
|> filter(fn: (r) => r["_field"] == "raw_value")
|> filter(fn: (r) => r["serial_no"] == "S3YJNF1JC37347V")
|> filter(fn: (r) => r["name"] == "Total_LBAs_Written")
|> keep(columns:["_time", "_value", "serial_no", "model"])
|> last()
|> toFloat()
|> map(fn: (r) => ({
r with
_value: r._value * LBA_SIZE
})
)
|> map(fn: (r) => ({
r with
_value: r._value / BYTES_PER_TB
})
)
|> yield(name: "TBW (from LBAs written)")
Western Digital SSDs:
host_writes_gib = from(bucket: "your_bucket")
|> range(start: v.timeRangeStart, stop: v.timeRangeStop)
|> filter(fn: (r) => r["_measurement"] == "smart_attribute")
|> filter(fn: (r) => r["_field"] == "raw_value")
|> filter(fn: (r) => r["serial_no"] == "21270C411910")
|> filter(fn: (r) => r["name"] == "Host_Writes_GiB")
|> keep(columns:["_time", "_value", "serial_no", "model"])
|> last()
|> toFloat()
|> map(fn: (r) => ({
r with
_value: r._value * BYTES_PER_GB
})
)
|> map(fn: (r) => ({
r with
_value: r._value / BYTES_PER_TB
})
)
|> yield(name: "TBW (from Host Writes)")
Disk Error Rate (Graph)
read_error_rate = from(bucket: "monkey")
|> range(start: v.timeRangeStart, stop: v.timeRangeStop)
|> filter(fn: (r) => r["_measurement"] == "smart_device")
|> filter(fn: (r) => r["_field"] == "read_error_rate")
|> filter(fn: (r) => r["serial_no"] == "R6GX82ZY" or
r["serial_no"] == "VRHZHPAK" or
r["serial_no"] == "VRJ4GAHK" or
r["serial_no"] == "VAG9DU9L" or
r["serial_no"] == "R6GRS29Y" or
r["serial_no"] == "R6GRS79Y" or
|> aggregateWindow(every: v.windowPeriod, fn: mean, createEmpty: false)
|> keep(columns:["_time", "_value", "serial_no"])
|> yield(name: "Read Error Rate (HDD)")
seek_error_rate = from(bucket: "monkey")
|> range(start: v.timeRangeStart, stop: v.timeRangeStop)
|> filter(fn: (r) => r["_measurement"] == "smart_device")
|> filter(fn: (r) => r["_field"] == "seek_error_rate")
|> filter(fn: (r) => r["serial_no"] == "R6GX82ZY" or
r["serial_no"] == "VRHZHPAK" or
r["serial_no"] == "VRJ4GAHK" or
r["serial_no"] == "VAG9DU9L" or
r["serial_no"] == "R6GRS29Y" or
r["serial_no"] == "R6GRS79Y" or
|> aggregateWindow(every: v.windowPeriod, fn: mean, createEmpty: false)
|> keep(columns:["_time", "_value", "serial_no"])
|> yield(name: "Seek Error Rate (HDD)")
udma_crc_errors = from(bucket: "monkey")
|> range(start: v.timeRangeStart, stop: v.timeRangeStop)
|> filter(fn: (r) => r["_measurement"] == "smart_device")
|> filter(fn: (r) => r["_field"] == "udma_crc_errors")
|> filter(fn: (r) => r["serial_no"] == "R6GRL6YY" or
r["serial_no"] == "R6GX84ZY" or
r["serial_no"] == "VRHZHPXK" or
r["serial_no"] == "VRJ0GAHK" or
r["serial_no"] == "VAG9DU9L" or
r["serial_no"] == "R6GRS79Y" or
|> aggregateWindow(every: v.windowPeriod, fn: mean, createEmpty: false)
|> keep(columns:["_time", "_value", "serial_no"])
|> yield(name: "UDMA CRC Errors")
This is really only basic visualization, anything more fancy should be done in Grafana.
A final step would be to configure Alerts in InfluxDB 2.0, to get notified when (e.g.) temperature exceed a certain threshold, disks fill up, or the Raid health suddenly changes.
Changelog
2022-01-14
- Add additional Flux examples (Disk Error, Disk Wear)
- Add ZFS plugin example
- Add smart extended attribute collection
2022-01-03 Minor Update:
- Updated Telegraf install instructions
- Added example to monitor HDD Temperatures
- Added Telegraf Smart Config for HBA attached SCSI
- Add APC plugin instructions
-
Main source of steps from a blog post from Shift systems ↩
-
Instructions for updating sudoers from Telegraf Issue #8690 ↩
-
Instructions to install lm-sensors from a Reddit post ↩
-
Convert Total_LBAs_Written to TBW StackExchange ↩