Proxmox Hypervisor Monitoring with Telegraf and InfluxDB#


hifi


TL;DR This is a description of the process to install Telegraf on proxmox to collect sensor reading, smart data and metrics in InfluxDB 2.0.


Motivation Critical infrastructure needs monitoring. For the proxmox hypervisor, I wanted to monitor:

InfluxDB is well suited for this purpose. It can be directly connected to Grafana. The proxmox interface already offers the option to connect to a metric server such as InfluxDB. However, it will only send standard metrics that are available in the dashboard.

To include Smart monitoring and sensor readings, Telegraf must be installed on the proxmox host.

There are some instructions available how to do this, but I found no source that covers all required steps.

This post covers:

Not included here is the setup of InfluxDB 2.0 itself. I have installed it in a separate LXC container running debian, based on the default instructions from the docs.

Also, I run InfluxDB 2.0 behind a Nginx reverse proxy, which makes the interface available through HTTPS with Let's Encrypt SSL certs in a local subdomain. The instructions below are the same, regardless of whether InfluxDB is available through an IP or a domain name.

For the sake of completeness, see my nginx config for InfluxDB 2.0 below
server {
        listen 443 ssl http2;
        listen [::]:443 ssl http2;
        server_name influx.local.mytld.com;

        ssl      on;
        ssl_certificate     /etc/nginx/ssl/wildcard.local.mytld.com.fullchain;
        ssl_certificate_key /etc/nginx/ssl/wildcard.local.mytld.com.key;

        ssl_protocols           TLSv1.2 TLSv1.3;
        ssl_ciphers HIGH:!MEDIUM:!LOW:!aNULL:!NULL:!SHA;
        ssl_prefer_server_ciphers on;
        ssl_session_cache shared:SSL:10m;

        location / {
                proxy_pass http://localhost:8086;
                proxy_redirect off;
                proxy_http_version 1.1;
                proxy_max_temp_file_size 10m;
                proxy_connect_timeout 20;
                proxy_send_timeout 20;
                proxy_read_timeout 20;
                proxy_set_header Host $host;
                proxy_set_header Upgrade $http_upgrade;
                proxy_set_header Connection keep-alive;
                proxy_set_header X-Real-IP $remote_addr;
                proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
                proxy_set_header X-Forwarded-Proto http;
                proxy_set_header X-Original-For $proxy_add_x_forwarded_for;
                proxy_set_header X-Original-Proto https;
                proxy_cache_bypass $http_upgrade;
        }

}

When I go to https://influx.local.mytld.com, the InfluxDB 2.0 frontend opens.

Note that all metric collectors must be configured to use port 443 instead of 8086, and they must also have the current SSL certs available.


Redirect metric collection#

The first step is to redirect Proxmox metric collection to a local Socket that can be consumed by Telegraf.

The setting file can be found at:

/etc/pve/status.cfg

If you have anything set in the proxmox web interface, under, Datacenter > Metric Server, it will be stored in this file.

Edit the file (e.g. nano /etc/pve/status.cfg) and replace with the following lines:

influxdb: InfluxDB
   server 127.0.0.1
   port 8089

You can select any name here for the metric server, I used InfluxDB.

Using these settings, Proxmox will send metrics internally to port 8089 on localhost, which we will connect to from Telegraf in the next step.

Install Telegraf#

I do not like making modifications to the proxmox host for several reasons, but this is unavoidable if you want to directly collect smart data and sensor readings.

wget -qO- https://repos.influxdata.com/influxdb.key | apt-key add -
echo "deb https://repos.influxdata.com/debian buster stable" | tee /etc/apt/sources.list.d/influxdb.list
apt update && apt install telegraf

Configure telegraf plugins#

A sample telegraf.conf is available that contains all plugins.

Make a backup and create a new, empty telegraf.conf.

cp /etc/telegraf/telegraf.conf /etc/telegraf/telegraf.conf.bak
rm /etc/telegraf/telegraf.conf
nano /etc/telegraf/telegraf.conf

Use the following configuration settings as a template.

[agent]
  interval = "10s"
  round_interval = true
  metric_batch_size = 1000
  metric_buffer_limit = 10000
  collection_jitter = "0s"
  flush_interval = "10s"
  flush_jitter = "0s"
  precision = ""
  hostname = ""
  omit_hostname = false

# Configuration for sending metrics to InfluxDB
[[outputs.influxdb_v2]]
  urls = ["https://influx.local.tld.com"]
  token = "your influxdb-2.0-token"
  organization = "your business name"
  bucket = "your_bucket"

# Gather metrics from proxmox based on what is in /etc/pve/setup.cfg
[[inputs.socket_listener]]
  service_address = "udp://:8089"

[[inputs.smart]]
    ## Optionally specify the path to the smartctl executable
    path_smartctl = "/usr/sbin/smartctl"
    path_nvme = "/usr/sbin/nvme"
    use_sudo = true
    devices = [ 
        "/dev/bus/0 -d megaraid,8",
        "/dev/bus/0 -d megaraid,9",
        "/dev/bus/0 -d megaraid,10",
        "/dev/bus/0 -d megaraid,11"]

[[inputs.sensors]]
    ## Remove numbers from field names.
    ## If true, a field name like 'temp1_input' will be changed to 'temp_input'.
    # remove_numbers = true

    ## Timeout is the maximum amount of time that the sensors command can run.
    # timeout = "5s"    

[[outputs.influxdb_v2]]#

In InfluxDB 2.0, add a bucket and organization. Create a token. This will be used to by Telegraf to authenticate and write metrics.

Replace ["https://influx.local.tld.com"] with your InfluxDB 2.0 domain or IP/Port.

[[inputs.socket_listener]]#

This is the metrics socket that Telegraf will connect to to collect the proxmox dashboard metrics (resources etc.).

[[inputs.smart]]#

This is the smart plugin of Telegraf.

If you have nvme devices, install nvme-cli:

apt install nvme-cli
nvme list

Otherwise, remove the path to nvme.

For allowing the Telegraf user to access smartctl, we need to install sudo and add an entry to the visudo file.

apt-get install sudo
sudo visudo

Add:

# Cmnd alias specification
Cmnd_Alias SMARTCTL = /usr/sbin/smartctl
telegraf  ALL=(ALL) NOPASSWD: SMARTCTL
Defaults!SMARTCTL !logfile, !syslog, !pam_session

These instructions come from Telegraf Issue 8690.2

You will also need to update the device list to capture SMART from. I have an LXI Megaraid MR9260-4i with two Raid 1, 2x Samsung SSD and 2x WD HDD that are directly mounted to the host.

This information can be shown with (e.g.):

cat /proc/scsi/scsi

> Host: scsi0 Channel: 02 Id: 00 Lun: 00
>   Vendor: LSI      Model: MR9260-4i        Rev: 2.13
>   Type:   Direct-Access                    ANSI  SCSI revision: 05
> Host: scsi0 Channel: 02 Id: 01 Lun: 00
>   Vendor: LSI      Model: MR9260-4i        Rev: 2.13
>   Type:   Direct-Access                    ANSI  SCSI revision: 05

Use smartctl to test which settings work for you:

smartctl --scan

> /dev/sda -d scsi # /dev/sda, SCSI device
> /dev/sdb -d scsi # /dev/sdb, SCSI device
> /dev/bus/0 -d megaraid,8 # /dev/bus/0 [megaraid_disk_08], SCSI device
> /dev/bus/0 -d megaraid,9 # /dev/bus/0 [megaraid_disk_09], SCSI device
> /dev/bus/0 -d megaraid,10 # /dev/bus/0 [megaraid_disk_10], SCSI device
> /dev/bus/0 -d megaraid,11 # /dev/bus/0 [megaraid_disk_11], SCSI device

I ignored /dev/sda and /dev/sdb and only selected megaraid devices 8 to 11.

Test sample output:

smartctl -H /dev/bus/0 -d sat+megaraid,8

These commands vary, depending on the hardware config.

The final commands are then entered into the list of the telegraf.conf [[inputs.smart]] section.

devices = [ 
    "/dev/bus/0 -d megaraid,8",
    "/dev/bus/0 -d megaraid,9",
    "/dev/bus/0 -d megaraid,10",
    "/dev/bus/0 -d megaraid,11"]

[[inputs.sensors]]#

In order to monitor sensors, you need lm-sensors.3

This may already be installed on proxmox.

apt-get install lm-sensors sensors-detect watch

Check sensors with:

watch -n 1 sensors

> nct6776-isa-0a30
> Adapter: ISA adapter
> Vcore:          +1.46 V  (min =  +1.02 V, max =  +1.69 V)
> in1:            +1.87 V  (min =  +1.55 V, max =  +2.02 V)
> AVCC:           +3.39 V  (min =  +2.98 V, max =  +3.63 V)
> +3.3V:          +3.38 V  (min =  +2.98 V, max =  +3.63 V)
> in4:            +1.50 V  (min =  +0.97 V, max =  +1.65 V)
> in5:            +1.28 V  (min =  +1.07 V, max =  +1.39 V)
> in6:            +1.46 V  (min =  +0.89 V, max =  +1.23 V)  ALARM
> 3VSB:           +3.36 V  (min =  +2.98 V, max =  +3.63 V)
> Vbat:           +3.15 V  (min =  +2.70 V, max =  +3.63 V)
> fan1:             0 RPM  (min =  712 RPM)  ALARM
> fan2:          3006 RPM  (min =  712 RPM)
> fan3:           898 RPM  (min =  712 RPM)
> fan4:          5152 RPM  (min =  712 RPM)
> fan5:          5232 RPM  (min =  712 RPM)
> SYSTIN:         +44.0°C  (high = +85.0°C, hyst = +80.0°C)  sensor = thermistor
> CPUTIN:         +30.0°C  (high = +85.0°C, hyst = +80.0°C)  sensor = thermistor
> AUXTIN:          +2.0°C  (high = +80.0°C, hyst = +75.0°C)  sensor = thermistor
> PECI Agent 0:    +0.0°C  (high = +80.0°C, hyst = +75.0°C)
>                          (crit = +100.0°C)
> PCH_CHIP_TEMP:   +0.0°C

Test#

Test the Telegraf configuration with these commands:

telegraf --debug
sudo -u telegraf telegraf --config /etc/telegraf/telegraf.conf  --test | grep smart

At this stage, I saw socket connection errors.

You can test if restarting the pvestatd service fixes these.

systemctl restart pvestatd

I still saw Socket connection errors in tail --follow /var/log/syslog, but the were gone after a complete reboot of proxmox.

Configure Dashboard#

Now it is time to head over to your InfluxDB 2.0.

If you visualize data with Grafana, there is not much to do here. But I found the new 2.0 interface already suited to my needs, without requiring Grafana.

Create a Dashboard and then add Proxmox metrics through the Data Explorer.

influxdb2.0

It requires a bit of time to get used to the syntax, but I did not find this terribly complicated. The metrics from Proxmox are largely cryptic, but make sense after careful investigation.

For example, to show the disk read/write performance for each LXC container, use system > diskread/diskwrite > Select LXCs to monitor and then select derivative as the aggregate function, to render the increase of disk r/w in separate time buckets.

influxdb2.0

This is really only basic visualization, anything more fancy should be done in Grafana.

A final step would be to configure Alerts in InfluxDB 2.0, to get notified when (e.g.) temperature exceed a certain threshold, disks fill up, or the Raid health suddenly changes.


  1. Main source of steps from a blog post from Shift systems 

  2. Instructions for updating sudoers from Telegraf Issue #8690 

  3. Instructions to install lm-sensors from a Reddit post