Improving My Monitoring and Alerting


Shortly after getting my basic alerting script marvin written, I got the feeling that my monitorix/marvin system was a bit too hacky for my liking and decided to upgrade to a more proper ecosystem. After some research, I settled on using Prometheus as my metrics aggregator, Grafana for visualizations, and integrating Prometheus’ Alertmanager with a webhook configuration to report alerts to my Matrix rooms. The work spanned about a month of on-and-off focus, so I wanted to get some documentation written on what was involved in this setup before I forget too much.

All of this was implemented to target services running on single Arch Linux server, so YMMV if you are using this as reference for your own setup. I’ll opt for Arch and AUR packages over compiled binaries and containers wherever possible as it is easier to keep in-sync with a bleeding edge system, but you should follow official documentation for installation of these packages if you are running a different operating system.

  1. Prometheus
  2. Prometheus Node Exporter
  3. systemd Metrics
  4. nginx VTS Metrics
  5. Grafana
  6. worldPing
  7. Alerting to Matrix Rooms

Prometheus

Prometheus appeared to be the best monitoring toolkit for my use-case, largely because it is open source, has no commercial offerings, and includes first-class alerting integration. They have a great write-up in their docs comparing their platform to alternative solutions, that helped sell me on using it. Ultimately what makes me happy with this solution is the widespread use of Prometheus and the ease of creating exporters – for these reasons it was easy to get metrics on everything I wanted to track with very little friction.

There is an Arch package for Prometheus that you should be able to install and get running without issue. Install it with pacman -S prometheus, and start/enable the daemon with systemctl enable prometheus && systemctl start prometheus. At this point you should be able to see the Prometheus dashboard in a web browser by pointing to port 9090 of your server. However, to get anything useful out of Prometheus you will need to setup exporters for Prometheus to scrape and aggregate metrics from.

Prometheus Node Exporter

The best solution to getting system metrics to your Prometheus instance is through their node exporter. Once again, there is an Arch package for this that can be installed by running pacman -S prometheus-node-exporter. Like with the prometheus package, you’ll want to enable and start the daemon after installing.

At this point you should be able to see your machine’s metrics being exported by pinging port 9100 on your server. Note that Prometheus is designed to scrape metrics over HTTP, so the easiest way to check that your exporters are working is by pointing a web browser to the appropriate port.

Now that your node exporter is running, you need to tell Prometheus where to scrape these metrics from. On Arch, the Prometheus configuration file is located at /etc/prometheus/prometheus.yml by default. Open that file up in your text editor of choice, and look for the scrape_configs entry. After a fresh installation, you should see a single entry for the prometheus job. You will want to add a job for your node, so that your new scrape_configs will look something like the below snippet.

scrape_configs:
  - job_name: 'prometheus'
    static_configs:
      - targets: ['localhost:9090']

  - job_name: 'localhost'
    static_configs:
      - targets: ['localhost:9100']

I chose to name the job localhost based on the example I saw on the Arch wiki, but you can choose something more descriptive to your system if you wish. If you are setting up multiple nodes to scrape, you will want to describe the nodes more explicitly in your job naming scheme.

Once your Prometheus configuration file is updated, you will need to restart the service by running systemctl restart prometheus. You should now be able to write PromQL queries against your node from the Prometheus dashboard (exposed at port 9090 by default). As a test, go to that dashboard and enter node_memory_MemAvailable_bytes, execute the query, and you should be able to view a graph of your system’s available memory. The Prometheus dashboard has great auto-completion in the query textbox – try typing in node_ and look through all the metrics you now have from your node exporter if you want to get an idea of what is now available to you.

systemd Metrics

If you wish to monitor systemd service statuses, you can enable collection of these metrics through your Prometheus node exporter. There might be other ways to accomplish this, but I chose to add the systemd collector configuration through command line flags in the service’s unit file.

To make edits to the package’s systemd unit file, you will need to invoke systemctl edit prometheus-node-exporter. This should open up a text editor with a blank file where you can write in your changes to the unit file. At a minimum, you will need to update the ExecStart line to include the --collector.systemd flag. In my case, the packages default unit file had the following ExecStart call:

ExecStart=/usr/bin/prometheus-node-exporter $NODE_EXPORTER_ARGS

So I needed to write the following in my override file:

[Service]
ExecStart=
ExecStart=/usr/bin/prometheus-node-exporter --collector.systemd $NODE_EXPORTER_ARGS

That second line is not accidental, you must include the empty ExecStart initialization in order for the override to work. If you wish, you can also choose to monitor specific services instead of all of your systems services by specifying a whitelist.

ExecStart=/usr/bin/prometheus-node-exporter --collector.systemd --collector.systemd.unit-whitelist="(nginx|sshd|etc).service" $NODE_EXPORTER_ARGS

After making your override file, a call to systemctl restart prometheus-node-exporter should allow you to write PromQL queries against systemd services.

nginx VTS Metrics

I use nginx as a reverse proxy for every public-facing site on my server, so tracking the traffic being routed to each site was one of the most important metrics I wanted to visualize. Luckily, nginx-module-vts handles everything needed to get these metrics – from monitoring the virtual host traffic to providing an exporter for Prometheus to scrape. There is a lot of support online for nginx-vts-exporter as a means of exporting virtual host traffic from nginx-module-vts to Prometheus, but I found the included exporter to be more than enough for what I wanted to accomplish.

There is an AUR package for compiling nginx-module-vts for the nginx-mainline Arch package that, at the time of writing, works wonderfully. As with any AUR package, you should take a look at the included PKGBUILD file before installing. Once you are familiar with what it does, clone the repository and install the package with makepkg -sci (or use whatever method you normally use to install AUR packages).

After a successful installation, you should be able to find the module at /usr/lib/nginx/modules/ngx_http_vhost_traffic_status_module.so. Make sure that file exists, then open up your nginx configuration file (likely at /etc/nginx/nginx.conf). Adding the below changes to your nginx configuration should handle loading of the dynamic module and exposing localhost:8080/status for viewing nginx virtual traffic statistics.

load_module "/usr/lib/nginx/modules/ngx_http_vhost_traffic_status_module.so";
...
http {
  ...
  vhost_traffic_status_zone;
  ...
  server {
    server_name 127.0.0.1;
    location /status {
      vhost_traffic_status_display;
      vhost_traffic_status_display_format html;
      allow 127.0.0.1;
      deny all;
    }
    listen 127.0.0.1:8080;
  }
}

Be sure to change the allow field or add additional addresses to suit your needs. In my case, the statistics are only going to be accessible by the local machine for Prometheus scraping, so this example only exposes the statistics to that machine. You will want to be as restrictive as possible here to avoid accidentally exposing your metrics to anyone who shouldn’t see them.

At this point, you will need to restart your nginx daemon to start serving your new status server. Prior to restarting the service, you should also run nginx -t to test your configuration and make sure your dynamic module loading is working correctly. You should now be able to get your VTS traffic status response from your server, so try running curl localhost:8080/status/format/prometheus and make sure you get an appropriate response.

After you are confident that your newly configured vhost traffic exporter is working correctly, you will need to configure Prometheus to scrape the metrics. Open up your Prometheus configuration file at /etc/prometheus/prometheus.yml and add the following entry to your scrape_configs.

scrape_configs:
  - job_name: 'nginx-vts'
    metrics_path: '/status/format/prometheus'
    static_configs:
      - targets: ['localhost:8080']

After restarting your prometheus daemon, you should now be able to write PromQL queries from your Prometheus dashboard against your new metrics. Autocomplete is again your friend here for seeing what metrics are available to you. Try running nginx_vts_server_requests_total{code="2xx"} to see all the 200 range responses from your server by hostname, or start typing nginx_ to get a list of suggestions.

Grafana

Grafana provides a great interface for visualizing Prometheus metrics and integrates easily with the existing Prometheus metrics configured above. On Arch, Grafana can be installed through the grafana package, and will start serving its web interface on port 3000 after starting and enabling the grafana service. Once you get the service up and running, you’ll need to login with the user admin and password admin, after which you’ll be prompted to setup a proper account.

Grafana is interacted with almost entirely through the web UI, and I find it untuitive enough that I don’t feel it is necessary to document how to use it. I do, however, want to mention a few things that I found useful in getting started with Grafana.

Start with Existing Dashboards

If you don’t want to spend a ton of time getting your basic visualizations setup, I’d recommend you start with dashboards that other people have created and shared online. I started off with a dashboard created to visualize Prometheus node exporter metrics and another for nginx-mod-vts metrics. Once I had them imported, I just deleted or tweaked everything that didn’t work, and restructured and adjusted to what I felt was necessary and helpful.

Test Queries on the Prometheus Dashboard

I find it easier to work within the Prometheus dashboard to execute my PromQL queries and set them up for use within Grafana. Being able to see the actual results of your execution is helpful for understanding how it will be visualized over time.

Rely on the Editor

Just like with the Prometheus dashboard, Grafana will provide helpful suggestions as you type in your queries (maybe even better suggestions than the Prometheus dashboard). You should also preview your graphs frequently, and let Grafana tell you when something is wrong. The graph editor is good at telling you when your query is broken, so listen to it.

worldPing

There were a few ways to monitor sites and alert on downtime that I considered during this transition. My previous solution was marvin, but I didn’t feel like maintaining him as one-off script. There is also the prometheus-blackbox-exporter which can monitor sites and alert on them for you, which is a good step up from marvin’s capabilities.

I ended up settling on worldPing, a Grafana plugin that monitors your sites over HTTP/S, DNS and through sending ICMP Echo packets to your domains. They provide 1 million requests per month in their free tier, which is enough to check HTTPS, DNS and send Echo packets from a few different locations every couple minutes for the production domains I manage. This also has the benefit of not requiring me to manage another server to alert me of my central server’s outages, which I would have had to do with the other solutions.

Grafana provides a pretty neat plugin management CLI – you should be able to install worldPing by running grafana-cli plugins install raintank-worldping-app, and restarting the grafana service. After installing you should have a new set of worldPing dashboards available through the Grafana UI that you can configure to your liking. I set mine up to monitor my two production domains in a way that came under their free tier limit, and to email me of any outages.

 

Alerting to Matrix Rooms

If you are as cheap and paranoid as I am, you may be running a Matrix/synapse node with rooms that you would like to send alerts to. Prometheus’ Alertmanager tool provides an interface to alert on the values of PromQL queries, and send alerts as JSON to a custom webhook, so I went ahead and set up a few alerts to cover the bases that my previous monitoring tool covered: daemon’s in a failed state and CPU/RAM usage being above a threshold.

To get started, you’ll need to install the alertmanager package with pacman, which should give you a file to configure at /etc/alertmanager/alertmanager.yml. I currently have this configured to send requests to a webhook I manage at http://localhost:4050/services/hooks/YWxlcnRtYW5hZ2VyX3NlcnZpY2U. This address will be explained in further detail later, but assuming you are following this document, you will want a configuration that looks like the below. If you are not setting up Alertmanager to send alerts to a custom webhook, refer to their documentation and tailor the example to your use case.

global:
  resolve_timeout: 5m

route:
  group_by: ['alertname']
  group_wait: 10s
  group_interval: 10s
  repeat_interval: 1h
  receiver: 'web.hook'

receivers:
  - name: 'web.hook'
  webhook_configs:
    - url: 'http://localhost:4050/services/hooks/YWxlcnRtYW5hZ2VyX3NlcnZpY2U'

Now we will need to get some actual alerts configured. I created rules to alert if CPU/RAM usage is above a 90% threshold over 2 minutes, and if some of the daemons I setup to track are in a failed state for 30 seconds. I put these into a file at /etc/prometheus/alerting_rules.yml:

groups:
- name: systemd
  rules:
  - alert: Nginx Inactive
    expr: node_systemd_unit_state{name="nginx.service",state="active"} != 1
    for: 30s
    labels:
      severity: critical
    annotations:
      summary: Nginx service is inactive
  
  - alert: SSH Inactive
    expr: node_systemd_unit_state{name="sshd.service",state="active"} != 1
    for: 30s 
    labels:
      severity: critical
    annotations:
      summary: SSH service is inactive
  
  - alert: Plex Inactive
    expr: node_systemd_unit_state{name="plexmediaserver.service",state="active"} != 1
    for: 30s
    labels:
      severity: critical
    annotations:
      summary: Plex media server is inactive

- name: resources 
  rules:
  - alert: CPU Resources Above Threshold
    expr: (100 - (avg(irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)) > 90
    for: 2m
    labels:
      severity: warning
    annotations:
      summary: Average CPU usage is exceeding 90% 
  
  - alert: Memory Resources Above Threshold
    expr: ((node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes) / node_memory_MemTotal_bytes) * 100 > 90
    for: 2m
    labels:
      severity: warning
    annotations:
      summary: Average CPU usage is exceeding 90%

And then referenced this file in my Prometheus configuration at /etc/prometheus/prometheus/yml.

alerting:
  alertmanagers:
    - static_configs:
      - targets:
        - localhost:9093

With those configurations in place, you should be able to start/enable the alertmanager service, restart the prometheus service, and then be ready to start alerting to your webhook at http://localhost:4050/services/hooks/YWxlcnRtYW5hZ2VyX3NlcnZpY2U. The problem is that there is nothing listening at that address yet.

To manage alerting to my Matrix rooms, I decided to use their Go-NEB bot, a newer iteration of their Matrix-NEB bot written in Go instead of Python. I already had a script in place that would have worked with a little tweaking, but I decided to use Go-NEB for their Giphy and RSS services which I did not want to write myself. It also seems easily extensible if I wish to add any extra services in the future.

There are a few ways to run this bot – I ran into issues getting the services to handle requests correctly through their containerized solution, so I ultimately setup this bot to bootstrap its configuration through a YAML file, and I managed the bot through a systemd unit file. They provide a sample configuration file in their repository, but there were a couple typos that gave me trouble (and they haven’t merged my PR in to correct them yet) so I’ll include a redacted version of my configuration for people to reference.

clients:
  - UserID: "@marvin:matrix.devinadooley.com"
    AccessToken: "MARVIN_USER_ACCESS_TOKEN"
    HomeserverURL: "https://matrix.devinadooley.com"
    Sync: true
    AutoJoinRooms: true
    DisplayName: "Marvin"

services:
  - ID: "alertmanager_service"
    Type: "alertmanager"
    UserID: "@marvin:matrix.devinadooley.com"
    Config:
      webhook_url: "http://localhost:4050/services/hooks/YWxlcnRtYW5hZ2VyX3NlcnZpY2U"
      rooms:
        "!ROOM_ID:matrix.devinadooley.com":
          text_template: "{{range .Alerts -}} [{{ .Status }}] {{index .Labels \"alertname\" }}: {{index .Annotations \"description\"}} {{ end -}}"
          html_template: "{{range .Alerts -}}  {{ $severity := index .Labels \"severity\" }}    {{ if eq .Status \"firing\" }}      {{ if eq $severity \"critical\"}}        <font color='red'><b>[FIRING - CRITICAL]</b></font>      {{ else if eq $severity \"warning\"}}        <font color='orange'><b>[FIRING - WARNING]</b></font>      {{ else }}        <b>[FIRING - {{ $severity }}]</b>      {{ end }}    {{ else }}      <font color='green'><b>[RESOLVED]</b></font>    {{ end }}  {{ index .Labels \"alertname\"}} : {{ index .Annotations \"description\"}}   <a href=\"{{ .GeneratorURL }}\">source</a><br/>{{end -}}"
          msg_type: "m.text"

This is likely a bit overwhelming, so let me break down what this configuration specifies to the program.

When Go-NEB is given a configuration file, it will parse it and add all specified user credentials and services into an in-memory SQLite database (rather than creating a persistent SQLite database interacted over JSON HTTP when ran without a configuration file). This configuration tells Go-NEB to create a service called alertmanager_service of type alertmanager, and forward alerts through the @marvin user to a given room when sent to the correct URL.

The webhook_url specified ends in the base64-encoding of the service ID – that is, the base64 encoding of the string “alertmanager_service”. According to the comments in the sample configuration, the webhook_url is informational and does not change your actual configuration. I believe all services are configured to serve under http://BASE_URL/services/hooks/$ENCODED_SERVICE_ID. That is why we configured our AlertManager instance to send alerts to this URL earlier.

Getting a user access token can be tricky. I found the token through the Riot.im webapp interface under the user settings, but found that the token would not persist as long as I needed. It turns out these tokens are invalidated on a logout through the Riot client, so for this access token to persist you need to: login, retrieve the access token corresponding to that login session, then close your browser without logging out. This means that any future login under the user will invalidate the token, so make sure to configure as much as you need to while you are logged into that session.

After configuring the user, you need to update the ROOM_ID value (can be retrieved through the Riot interface under the room settings), and update the text_template that is specified (it uses Go templates) to suit your needs.

You can run this bot anyway you wish, though I do so through systemd. I have the following unit file written at /etc/systemd/system/go-neb.service, which runs the bot whose code is located at /var/automation/go-neb. This also assumes you have compiled the go-neb binary in the WorkingDirectory.

[Unit]
Description=Go-NEB Matrix Bot
After=network.target

[Service]
Type=simple
Environment="BIND_ADDRESS=:4050"
Environment="DATABASE_TYPE=sqlite3"
Environment="BASE_URL=https://localhost:4050"
Environment="CONFIG_FILE=config.yaml"
ExecStart=/var/automation/go-neb/go-neb
WorkingDirectory=/var/automation/go-neb
Restart=on-failure

[Install]
WantedBy=default.target

At this point, you should be able to test your alerting easily by bringing down one of your tracked services.