pabis.eu

Monitor if your systems are up to date with Datadog

03 March 2023

Updating your systems, especially those that are constantly connected to the Internet, is very important from a security standpoint. Currently, state of the art, cloud-native solution is to use containers and orchestrators such as Docker and Kubernetes. There's also an option to use virtual machines and images, made with Packer and deployed with Terraform. However, when you inherit multiple legacy systems, configured by hand, serving the customers and tightly coupled, without integration or end-to-end tests, things get more complicated. To ensure that we patch our systems often enough, we need to have a list with their state. At first I used a spreadsheet created with Ansible but that proved to be ineffective due to need of running it manually. Much better solution is to use a monitoring service such as Datadog. In this post, I will explain how to create a custom Datadog check in Python.

Repository for this post is available at: https://github.com/ppabis/datadog-system-state.

Boilerplate for a Datadog check

Let's start with a simple file named SystemState.py and put the following contents:

import os, shutil, socket
from datetime import datetime

from datadog_checks.base import AgentCheck
from datadog_checks.base.utils.subprocess_output import get_subprocess_output

class SystemState(AgentCheck):
    def check(self, instance):
        pass

This simple script will not report anything so far. It imports the necessary libraries for Datadog integration (namely AgentCheck class) and some other utilities we will use later to implement our functions. class SystemState extends basic AgentCheck and overrides its check method that will be periodically run by Datadog Agent. self in this context contains global check configuration and all the methods to interact with Datadog. instance contains just configuration for current instance of the check - more on that later on.

To test the check locally and have code suggestions, we can install Datadog libraries using pip or by following official guide from Datadog using (pipx). For the pip method you can install two packages (under venv is recommended but not required):

$ pip install datadog-checks-base
$ pip install datadog-checks-dev

Implementing functions for our check

We plan to report the following information:

Currently upgradable packages

To detect if there are any packages that can be upgraded, on Debian-based systems we can use apt command. We first list the packages that can be upgraded and then count the number of lines in the output. In bash we do it like this:

$ apt -qq list --upgradable 2>/dev/null | wc -l

We query apt cache, use -qq to not print extra Listing... line, redirect stderr to null to ignore any warnings and count the lines with wc -l. On RedHat-based systems, we can use dnf command:

$ dnf -q check-update 2>/dev/null | grep -v ^$ | wc -l

Here we need to add also grep -v ^$ to remove empty lines from the output as dnf adds an empty line at the top of the output (as of RockyLinux 8.7). You should test how those commands work on your system beforehand so that the check is implemented correctly.

So let's convert this to Python now. We will use get_subprocess_output provided by Datadog to run the command and get the output. We will use Python built-in functions to do all the needed transformations. To detect the available package manager, we use shutil.which that returns the path to the executable if it exists, otherwise it returns None. The function will be a new method in SystemState class.

class SystemState(AgentCheck):
    def get_upgradable_packages(self):

        if shutil.which("apt") is not None:
            cmd = [ "apt", "-qq", "list", "--upgradable" ] # Debian

        elif shutil.which("dnf") is not None:
            cmd = [ "dnf", "-q", "check-update" ] # RedHat

        else:
            return -1 # Not supported

        out, _, _ = get_subprocess_output( cmd, self.log, raise_on_empty_output=False )

        return len(     # Count the lines (wc -l)
         list(          # Convert to list
          filter(       # Filter out empty lines (grep -v ^$)
           lambda x: len(x) > 0, out.splitlines()
          )
         )
        )

Days since last reboot

Rebooting is a common routine for cloud-native solutions such as Kubernetes Pods - they can be replaced multiple times a day and microservices are designed to be loosely coupled. However, in case of firmly installed software, this is often not the case.

To get the boot date and time we can use who -b or uptime command. Output of this command can be then parsed by datetime and the difference between then and now will be the number of days that passed since then. uptime returns directly the number of days (if larger than 1). However, we will use uptime -s in this example as it works cross-platform, formatted independent of locale, and even in Docker (although you need to install procps or procps-ng package first). Because of known and predictable date format reported by uptime, we can easily parse it in Python and subtract it from the current date.

class SystemState(AgentCheck):
    def get_days_since_last_reboot(self):
        out, _, _ = get_subprocess_output( [ "uptime", "-s" ], self.log )
        boot_time = datetime.strptime( out.replace("\n", ""), "%Y-%m-%d %H:%M:%S" )
        return ( datetime.now() - boot_time ).days

Major version of the OS

To get the major version of the OS we can use lsb_release command if it's available. Otherwise we will use check if /etc/lsb-release or /etc/os-release file exists and parse it.

class SystemState(AgentCheck):

    def get_os_major_version(self):
        version = "-1"

        # Using lsb_release binary if present
        if shutil.which( "lsb_release" ) is not None:
            out, _, _ = get_subprocess_output( [ "lsb_release", "-rs" ], self.log )
            version = out

        # Using /etc/lsb-release file if present
        elif os.path.isfile( "/etc/lsb-release" ):
            with open( "/etc/lsb-release", "r" ) as f:
                for line in f:
                    if line.startswith( "DISTRIB_RELEASE=" ):
                        version = line.split("=")[1]
                        break

        # Using /etc/os-release file if present
        elif os.path.isfile( "/etc/os-release" ):
            with open( "/etc/os-release", "r" ) as f:
                for line in f:
                    if line.startswith( "VERSION_ID=" ):
                        version = line.split("=")[1]
                        break

        return int( version.strip('"').split(".")[0] )

Putting it together

So now we are able to use each of the functions we implemented to report the data we want. We will use self.gauge to report the data. Within the instance we can add custom parameters such as tags - they will be configured later on in the YAML file.

import os, shutil
from datetime import datetime
from datadog_checks.base import AgentCheck
from datadog_checks.base.utils.subprocess_output import get_subprocess_output

__version__ = "0.1.0"

class SystemState(AgentCheck):
    #
    # [...] functions from above
    #
    def check(self, instance):
        tags = instance.get('tags', [])
        tags.append(f"hostname:{socket.gethostname()}") # Report hostname as tag

        self.gauge( 'systemstate.upgradable_packages', self.get_upgradable_packages(), tags=tags )
        self.gauge( 'systemstate.days_since_last_reboot', self.get_days_since_last_reboot(), tags=tags )
        self.gauge( 'systemstate.os_major_version', self.get_os_major_version(), tags=tags )

The data will be sent to Datadog under names starting with systemstate.

Using the check

To use the check with Datadog Agent, you have to place it in /etc/datadog-agent/checks.d directory and restart the service. Also, each check requires a minimal configuration file in /etc/datadog-agent/conf.d/. It can be either a single YAML file SystemState.yaml or a directory SystemState.d that contains more YAML files. Both should be named the same as the check script. Let's use a single file for this example. In SystemState.yaml put the following:

init_config:

instances:
  - min_collection_interval: 3600
    tags:
      - env:prod
      - os:debian

instances takes a list of dictionaries. tags is our custom argument mentioned earlier. min_collection_interval is predefined by Datadog. The check will be run every hour (3600 seconds) and for each metric it will append the tags listed for this instance.

Change ownership of those files to dd-agent and restart Datadog Agent for the changes to take effect.

$ sudo chown dd-agent:dd-agent /etc/datadog-agent/checks.d/SystemState.py
$ sudo chown dd-agent:dd-agent /etc/datadog-agent/conf.d/SystemState.yaml
$ sudo systemctl restart datadog-agent

After a while, in Datadog you should see the new metrics. Create monitors to show red alerts or yellow warnings when the values deviate from desired state, for example number of outdated packages is larger than 20 or major release of Debian is older than 10. That way, even if manual, it will automatically keep track of all the servers in your infrastructure. For example use this Terraform resource to create a monitor:

resource "datadog_monitor" "upgradable-packages" {
  type = "query alert"
  name = "SystemState - Upgradable packages on Production {{ hostname.name }}"
  require_full_window = false

  query = "avg(last_1h):avg:systemstate.upgradable_packages{env:prod} by {hostname} > 20"
  require_full_window = false

  monitor_thresholds {
    critical = 20
    warning = 10
  }

  message = <<-EOT
    {{#is_alert}}
    More than 20 packages are upgradable on {{ hostname.name }}!
    {{/is_alert}}
    {{#is_warning}}
    More than 10 packages are upgradable on {{ hostname.name }}!
    {{/is_warning}}
    {{#is_recovery}}
    {{ hostname.name }} has permissible amount of upgradable packages.
    {{/is_recovery}}
    EOT

}

In the next post we will explore how to report the same metrics to CloudWatch.