Marvin


I recently setup a Matrix server to give my friends and I a place to collaborate on work. I quickly realized this is the perfect place to output reports and alerts regarding my services, so I wrote up a Python script to report on site outages, excessive server load and systemd services that are in a failed state.

I like to personify bots as much as possible, so I named it Marvin after the paranoid android in The Hitchhiker’s Guide to the Galaxy series. He’s not the most efficient or well-structured bot, but he does a great job at what few things I’ve tasked him with so far. All of the code is public over on my Github, but I’ll go over some of the trickier parts of the service in this blog post to provide some reasoning and areas of improvement for people who wish to implement something similar.

I wrote Marvin in Python because there were only Javascript and Python libraries available for communicating to Matrix rooms, and I didn’t want to waste time formatting requests to their HTTP API. It was a nice change of pace from all the Go code I’ve been writing lately. I’m still productive in Python, but I’m thankful that I don’t have to deal with it too much. Even though I mainly write Javascript and Ruby code at work these days, even the amount of strange conventions in Python can start to give me a headache, and I prefer a better type system.

If you wish to jump around to a relevant section for your issue or use-case, feel free. I’ve tried to format the code in a way that it can be copied separately of the snippets in other sections.

  1. Communicating With a Matrix Server
  2. Managing Environment
  3. Scheduling Jobs
  4. Monitoring Site Outages
  5. Monitoring CPU and RAM Usage
  6. Monitoring Systemd Services
  7. Daemonization
  8. Closing Thoughts

Communicating With a Matrix Server

Matrix provides a simple interface to send messages to rooms. I created a user for Marvin manually through my Riot desktop client, and created a room for him to post messages into. The relevant code to report to a list of rooms is simple enough that I don’t think I will need to explain it too much.

from matrix_client.client import MatrixClient

client = MatrixClient(HOST)
token = client.login(username=USERNAME, password=PASSWORD)
rooms = [client.join_room(room) for room in ROOMS]
def message_rooms(message):
    for room in rooms:
        room.send_text(message)

The capitalized variables are those I load through the environment file. This message_rooms method is what I use to send all of the error strings to the server, and I format and call the method from the scheduled subroutines.

Managing Environment

I found the python-dotenv library to be a good way to manage environment variables. I keep an example .env file in the repository so I don’t forget the format.

USERNAME=user
PASSWORD=pass
HOST=https://example.matrix.org  # Can also be a local address like http://localhost:8008
ROOMS=["#room-alias:host.com"]   # A JSON list of room aliases to post alerts in
SITES=["site1.com", "site2.org"] # Sites to ping
SERVICES=["sshd", "nginx"]       # Systemd services to monitor
CPU_PERCENT_THRESHOLD=90
MEM_PERCENT_THRESHOLD=90
TIMEOUT_THRESHOLD=5              # Seconds until a site request times out and reports failure

From the file that starts up the service, I load all the environment variables. I had to rely on the json package to parse the arrays from the environment file into a list.

import json
import os
from dotenv import load_dotenv

load_dotenv()
USERNAME = os.getenv("USERNAME")
PASSWORD = os.getenv("PASSWORD")
HOST = os.getenv("HOST")
ROOMS = json.loads(os.getenv("ROOMS"))
SITES = json.loads(os.getenv("SITES"))
SERVICES = json.loads(os.getenv("SERVICES"))
CPU_PERCENT_THRESHOLD = float(os.getenv("CPU_PERCENT_THRESHOLD"))
MEM_PERCENT_THRESHOLD = float(os.getenv("MEM_PERCENT_THRESHOLD"))
TIMEOUT_THRESHOLD = float(os.getenv("TIMEOUT_THRESHOLD"))

The last step that I consider essential to setting up a Python development environment is scripting the virtual environment usage into the Makefile.

venv: venv/bin/activate
venv/bin/activate: requirements.txt
	test -d venv || python -m venv venv
	venv/bin/pip install -r requirements.txt

start: venv
	venv/bin/python marvin.py

If I wish to include more packages in the environment, I activate the virtual environment manually with . venv/bin/activate, pip install the package I need, and update the requirements.txt file with pip freeze > requirements.txt.

Scheduling Jobs

The job scheduling is one of the hackier parts of the code. So far, it works well enough that I haven’t decided to find anything better. I’ve created a wrapper method for each of the jobs, and I have them scheduled at the intervals that I found appropriate after a bit of experimentation. They are also configured to run in parallel, since there is overlap in when these jobs will be running. I run the schedule_run_pending() method to find pending jobs with a delay every idle second using a call to time.sleep(1).

import schedule
import time

# monitor_sites, monitor_services and monitor_system are provided later
def run_threaded(job):
    job_thread = threading.Thread(target=job)
    job_thread.start()

schedule.every(1).minutes.do(run_threaded, monitor_sites)
schedule.every(15).seconds.do(run_threaded, monitor_services)
schedule.every(15).seconds.do(run_threaded, monitor_system)

while True:
    schedule.run_pending()
    time.sleep(1)

Monitoring Site Outages

The requests library that I enjoy using was giving me some trouble when making subsequent HTTPS requests from the server through DNS and back to the server. As is usually the case, it worked fine from my machine, but on the server itself it was causing timeouts. I attempted to manage connections better, but ultimately ended up switching over to the urllib library, which is a shame as requests gives much better error handling around SSL issues.

The code to monitor the sites makes HTTP requests to the sites configured in the .env file. Any non-200 response will cause Marvin to queue up an error message. I handle all instances of the urllib.error class separately to try and give as much detail as possible in his chat message.

Each type of error has a generic header that is specific to the site, and all error messages are stored in a dictionary. The structure of the sites dictionary is:

{
  URL: {
    error_key: error_message
  }
}

This way, for each URL we can account for many types of errors without always causing a change that will trigger Marvin to send a message. As long as we only check for changes in the error_key values for each URL, we can keep relevant error_message values logging to the journal, but not spam the chat messages.

import urllib.request
import copy
import logging

logger = logging.getLogger("marvin")

# sites should be a dictionary of the form {url: {error_key: error_message}}
#   error_keys are strings indicating errors encounted, and
#   error_messages provide details on the most recent error.
#   All errors are cleared on a successful request.
def check_sites(sites, timeout_threshold):
    latest_sites = copy.deepcopy(sites)
    for site in latest_sites.keys():
        try:
            with urllib.request.urlopen(site) as r:
                if r.getcode() != 200:
                    error_key = site + ' returned non-200 status code '
                    error_message = 'Status code: ' + str(r.getcode())
                    latest_sites[site][error_key] = error_message
                    logger.error(latest_sites[site])
                else:
                    latest_sites[site] = {}

        except urllib.error.URLError as e:
            error_key =  'Error thrown connecting to ' + site
            error_message = 'Reason: ' + e.reason
            latest_sites[site][error_key] = error_message
            logger.exception(str(e))

        except urllib.error.HTTPError as e:
            error_key = 'Error thrown connecting to ' + site
            error_message =  'Status code: ' + str(e.code) + '\n'
            error_message += 'Headers: ' + e.headers + '\n'
            error_message += 'Reason: ' + e.reason
            latest_sites[site][error_key] = error_message
            logger.exception(str(e))

        except Exception as e:
            error_key = 'Encountered an unexpected error connecting to ' + site
            error_message = str(e)
            latest_sites[site][error_key] = error_message
            logger.exception(str(e))
           
    return latest_sites

From my main service file, marvin.py, I handle the detection of changes from the returned dictionary the check_sites method provides us with. If a change is detected, it outputs a message with all of the latest error_messages for each error_key. This monitor_sites method is the one called by the job scheduler and is the same pattern that I follow for the remaining jobs described below.

import logging
from jobs import sites

logger = logging.Logger("marvin")

site_statuses = {site: {} for site in SITES}
def monitor_sites():
    global site_statuses

    updated_sites = sites.check_sites(site_statuses, TIMEOUT_THRESHOLD)
    for site in site_statuses:
        site_errors = sorted(list(site_statuses[site].keys()))
        updated_errors = sorted(list(updated_sites[site].keys()))
        if site_errors != updated_errors:
            if updated_sites[site] == {} :
                message_rooms('Request errors to ' + site + ' have been resolved')
                logger.info('Request errors to ' + site + ' have been resolved')
            else:
                message = ""
                for error_key in updated_sites[site].keys():
                    message += error_key + '\n' + updated_sites[site][error_key] + '\n\n'

                message_rooms(message)

    site_statuses = updated_sites

Monitoring CPU and RAM Usage

The psutil library provides a simple way to estimate CPU and RAM usage percentages at a given point in time. The only thing that needs explanation here is that the psutil.cpu_percent method requires a polling interval. I provided 1, but did not experiment for too long with it. You may wish to change this depending on your system or need for accuracy.

import psutil

def check_cpu_usage(threshold):
    usage = psutil.cpu_percent(interval=1)
    if usage >= threshold:
        return usage

    return 0

def check_memory_usage(threshold):
    usage = psutil.virtual_memory().percent
    if usage >= threshold:
        return usage

    return 0

Just like before, we create a wrapper for these methods that can be queued up by our job scheduler. All error logging is handled by this method, as there is nothing I found to be specific or unstable in the job itself that would require logging.

import logging
from jobs import system

logger = logging.Logger("marvin")

cpu_alerting = False
mem_alerting = False
def monitor_system():
    global cpu_alerting
    global mem_alerting

    cpu_usage = system.check_cpu_usage(CPU_PERCENT_THRESHOLD) 
    if cpu_usage and not cpu_alerting:
        message_rooms('CPU usage is above threshold')
        logger.warning('CPU usage is above threshold')
    elif not cpu_usage and cpu_alerting:
        message_rooms('CPU usage is back to normal')
        logger.info('CPU usage is back to normal')

    mem_usage = system.check_memory_usage(MEM_PERCENT_THRESHOLD)
    if mem_usage and not mem_alerting:
        message_rooms('Memory usage is above threshold')
        logger.warning('Memory usage is above threshold')
    elif not mem_usage and mem_alerting:
        message_rooms('Memory usage is back to normal')
        logger.info('Memory usage is back to normal')

    cpu_alerting = bool(cpu_usage)
    mem_alerting = bool(mem_usage)

Monitoring Systemd Services

I found a neat trick online for getting systemctl to return an exit code, so I just ended up calling it directly using os.system. It’s sufficient for reporting whether a service is running or not, but in the future I may wish to parse the status message to provide better information.

import os
import copy

# services is a dictionary mapping service names to booleans
#   value determines if the service is active or not
def check_services(services):
    latest_services = copy.deepcopy(services)

    for service in latest_services.keys():
        latest_services[service] = bool(os.system('systemctl is-active --quiet ' + service))

    return latest_services

Calling the service and checking changes is performed similar to how I check the usage percentages, except the services are stored in a dictionary mapping services to booleans rather than separate boolean variables.

import logging

logger = logging.Logger("marvin")

service_statuses = {service: False for service in SERVICES}
def monitor_services():
    global service_statuses

    updated_services = services.check_services(service_statuses)
    for service in service_statuses:
        if service_statuses[service] != updated_services[service]:
            if updated_services[service]:
                message_rooms(service + ' is not active')
            else:
                message_rooms(service + ' is once again active')

    service_statuses = updated_services

Daemonization

This is one place where having a well-configured environment and Makefile can come in handy. To get this to work, I cloned Marvin’s repository to /var/automation/marvin, and configured the .env file to monitor the sites and services I wanted. Since my Matrix server is running on the same server as Marvin, I supplied the Matrix host as localhost:8008, bypassing the reverse proxy I have setup for HTTPS with nginx.

I’m on a real systemd kick lately, so I configured Marvin to run as a daemon by creating the following unit file at /etc/systemd/system/marvin.service.

[Unit]
Description=A Paranoid System Monitor
After=network.target

[Service]
Type=simple
ExecStart=make start
WorkingDirectory=/var/automation/marvin
Restart=on-failure

[Install]
WantedBy=default.target

Once the unit file was created I reloaded the systemd unit files with systemctl daemon-reload, then enabled and started the daemon with systemctl enable marvin && systemctl start marvin.

Closing Thoughts

That was all it took to get some basic monitoring automated. My uptime has apparently been pretty good as he hasn’t detected any outages that I didn’t cause in my testing. The major issue I still have with Marvin is his lack of emojis and Markdown formatting in his messages. I’ve tried to get them to work, but something about the way the messages are getting sent is not getting it to render correctly in my Riot client.

He could also use a little bit of snark and some callbacks to his inspiring character. It has been so long since I read the books that nothing came to me as I was creating him.

All said, here is a screenshot of the final product, it still makes me a bit giddy to see him actually post messages.