Monitoring¶

The monitoring systems WebUIs are reachable at https://monitoring.customer-internal-domain.internal-tld. They are commonly only accessible via your companies VPN. You can log in using your employee username and password.

Granting employees access to the monitoring system¶

Access to the monitoring system is managed via HTTP basic auth. In order to add a new user to the basic auth, edit the nginx_htpasswd_present Ansible variable in playbook-infrastructure-company/plays/utilities/prometheus-server/main.yml and apply the prometheus-server Ansible play.

Please also refer to the documentation on how to open the firewall for specific employees to connect to the monitoring system over the wireguard VPN.

prometheus-exporters (agent)¶

Prometheus-exporters are small programs, installed on the to-be monitored servers, which collect metrics upon receiving requests to export those metrics. Metrics are collected from various sources and ressources, like cpu, ram, disk and so on, or running programs, like nginx, postgresql and others.

Theory of operation¶

Each prometheus-exporter listens for HTTP requests on 127.0.0.1:specific-exporter-port. Metrics can commonly be fetched from the path /metrics.

In addition to the various prometheus-exporters running on each server, there is a lighttpd webserver installed on every monitoring client server. On Debian 11 and above, lighttpd it is installed via apt. On lower debian versions, it is compiled to provide a version high enough to support proxy pass functionality. The lighttpd server listens for incoming requests on the server internal VPN. The requests may only originate from the monitoring server, which is enforced using the monitoring client servers firewall. The lighttpd server then proxy passes the requests from the monitoring server to the specific prometheus-exporter using a specific path. Hence, the monitoring server aggregating monitoring metrics from its clients looks like this:

(1) monitoring server: GET https://monitoring-client/node
(2) monitoring client: lighttpd proxy pass /node -> localhost:10900
(3) monitoring client: prometheus-node-exporter answers the request on :10900

For lighttpd to be able to answer requests in HTTPS, each monitoring client has a letsencrypt certificate for its own hostname in /etc/letsencrypt/live/{{ inventory_hostname }}/, which is renewed by ansible every 2.5 month.

The association of which prometheus-exporter belongs to which port on lighttpd can be found in the role-prometheus-exporters at https://git.blunix.com/ansible-roles/role-prometheus-exporters/-/blob/master/defaults/main.yml.

Upon requests, the exporters will provide data in the following structure:

root@monitoring-server ~ # curl https://monitoring-client:10999/node
# HELP apt_reboot_required is a reboot required by apt
# TYPE apt_reboot_required gauge
apt_reboot_required 0
# HELP apt_security_upgrades number of available apt security upgrades
# TYPE apt_security_upgrades gauge
apt_security_upgrades 0
[...]

This data is then aggregated by prometheus-server (commonly for two weeks) and optionally stored for long term usage (three years) by prometheus-downsampling. Prometheus-server also uses rules to evaluate those metrics and create alerts, which are pulled by prometheus-alertmanager to notify humans. The data of prometheus-server and prometheus-downsampling can be pulled into grafana to create a better overview using graphs and charts.

Prometheus node exporter¶

The one prometheus-exporter installed on all servers is the prometheus-node-exporter. It monitors all the basics of a common Debian installation, like CPU, RAM, disk space and so on. All other exporters are optional and may be installed to monitor specific programs, like for example nginx, postgresql or similar.

The metrics configured to be exported by the prometheus-node-exporter can be found in the Ansible dictionary prometheus_exporter_node_systemd_service_arguments, which is included in the Blunix Ansible role prometheus-exporters defaults/main.yml file.
This dictionary can of course be overridden in the playbook-infrastructure-company. Changes to the dictionary are however commonly not required. The description of each argument contained in the dictionary can be found here.

Prometheus node exporter - textfile collector¶

The prometheus-node-exporter comes with one small and very helpful gimmick - it can read textfiles from /var/lib/prometheus-node-textfile-collector/logs/, which in turn contain metrics. This path is a default of the Blunix Ansible role, not of prometheus itself. These textfiles are commonly created by using cron jobs / systemd timers located in /var/lib/prometheus-node-textfile-collector/scripts/. For example:

root@monitoring-server ~ # cat /var/lib/prometheus-node-textfile-collector/logs/cronjob_prometheus_node_textfile_collector_apt.sh.prom
# HELP cronjob_duration Elapsed real (wall clock) time used by the process, in seconds
# TYPE cronjob_duration gauge
cronjob_duration{name="prometheus_node_textfile_collector_apt.sh"} 2
# HELP cronjob_memory Average total (data+stack+text) memory use of the process, in Kilobytes
# TYPE cronjob_memory gauge
cronjob_memory{name="prometheus_node_textfile_collector_apt.sh"} 0
[...]

The cron job / systemd timer can run in any timeframe and does not have to run every 30 seconds (the default scraping interval of the prometheus-server). All files present in the /var/lib/prometheus-node-textfile-collector/logs/ directory that have the .prom extension will be evaluated every 30 seconds (the default scrape interval of prometheus-server). Malformatted files will increase the counter on the node_textfile_scrape_error metric, which is also exported by prometheus-node-exporter, while all other correctly formatted files will still be scraped as expected.

Other prometheus exporters¶

Depending on the services running on a specific server, other prometheus-exporters may be required. Some commonly required and publicly available prometheus-exporters are already implemented in role-prometheus-exporters. Their configuration options can be found here.
In many cases it is enough to just enable the exporter in Ansible, like so: prometheus_exporter_elasticsearch: True. For others, additional variables may have to be defined (see the defaults/main.yml file of the role-prometheus-exporters for configurable options).

A mostly complete list of exporters maintained by the prometheus developers or community members can be found here. Searching github may also be helpful. If you would like an exporter from this list to be implemented in the Blunix role-prometheus-exporters, please open an issue in this repository.

Custom prometheus exporters¶

In order to provide custom metrics about custom programs to prometheus-server, there are two options:

Easy: Add a custom BASH / python / other script to /var/lib/prometheus-node-textfile-collector/scripts/ and setup a cron job or systemd timer for it.
Complex: Writing a custom prometheus exporter, for example in Python or Golang.

Adding a custom prometheus-node-exporter textfile collector script¶

(1) Create a list of dictionaries prometheus_exporter_node_textfile_templates_custom in playbook-infrastructure-company/inventory/{group_vars,host_vars}/{group_name,host_name}.yml. Copy the structure from prometheus_exporter_node_textfile_templates_defaults, which can be found here.

(2) Place the script in playbook-infrastrcture-company/plays/utilities/prometheus-exporters/templates/var/lib/prometheus-node-textfile-collector/scripts/{{ name-of-your-script }}.j2 and make sure to add its path to prometheus_exporter_node_textfile_templates_custom.

playbook-infrastructure-company/inventory/{group_vars,host_vars}/{group_name,host_name}.yml should then look like this:

prometheus_exporter_node_textfile_templates_custom:
    # Name of the script
  - name: my-custom-textfile-exporter-script.sh
    # Relative path below playbook-infrastructure-company/plays/utilities/prometheus-exporters/
    src: "templates/var/lib/prometheus-node-textfile-collector/scripts/my-custom-textfile-exporter-script.sh.j2"
    # Run script as
    user: root
    # Cron settings when to run the script
    cron:
      minute: "0"
      hour: "0"

Example scripts can be found here.

(3) Apply the changes using Ansible.

Writing a custom prometheus-exporter¶

Documentation on how to write prometheus-exporters can be found here.

(1) Write a program that collects data from your specific source and is able to listen on a given port on localhost. Upon GET on localhost:port/metrics, your program should return data in the following format:

# HELP metric_name description of the metric
# TYPE metric_name gauge
metric_name 12345

If you only need good / bad, then 0 is good and 1 is bad.

(2) Write a Ansible play (or role) that installs your exporter (ask Blunix GmbH for help when required).

(3) Install the exporter on the given server(s) using Ansible (ask Blunix GmbH for help when required).

(4) Add the custom port of your exporter to the lighttpd configuration on the server(s). Copy the list of dictionaries prometheus_exporters_ports from the Blunix role prometheus-exporters defaults/main.yml to playbook-infrastructure-company/inventory/group_vars/all.yml, where a custom port and path can be added for the new exporter. Custom ports should start from 12000.

(5) Apply the changes to lighttpd by running the prometheus-exporter play on the server(s).

(6) To instruct prometheus-server to collect metrics from https://monitoring-client/your-exporter-path on the server(s), add or edit the dictionary prometheus_scrape_configs in playbook-infrastructure-company/inventory/{group_vars,host_vars}/{group_name,host_name}.yml. An example for this dictionary can be found in the Blunix role prometheus-exporters molecule/default/playbook.yml file.

(7) Add custom alert rules, described here.

(8) Apply the prometheus-server play via Ansible.

prometheus-server (agent data aggregation and evaluation)¶

The prometheus server sends HTTPS requests to each monitoring client server within an interval of 30 seconds. The metrics aggregated are then saved into prometheus own, similar-to-postgresql database, where they are stored for 14 days.

The prometheus-server provides a WebUI which allows administrators to query current metrics or over given timeframes. It can be found at https://monitoring.customer-internal-domain.internal-tld. The WebUI is mostly designed for debugging purposes. For more usable graphs and visualizations, use grafana (described below).

prometheus-server alerts¶

To create new alerts for specific metrics, new alert rules have to be added to the dictionary prometheus_server_rules_custom in playbook-infrastructure-company/plays/utilities/prometheus-server/main.yml.
Examples can be found in the Blunix role prometheus-exporters molecule/default/playbook.yml file. Place the rules in playbook-infrastructure-company/plays/utilities/prometheus-server/templates/etc/prometheus/rules.d/my-custom-exporter.yml.j2. Examples for rules can be found in the [Blunix role prometheus-exporters templates/etc/prometheus-server/rules.d/ directory.

prometheus-server-downsampling (long term metrics)¶

Your company may want to save some metrics for long term, for example to provide data to proof the fulfillment of certain Service Level Agreements or similar.

Please note that it is most likely not required to save all data for long term, and that this may wast storage space. You will most likely not care how many apt upgrades were available on machine xyz 2,5 years ago. Hence, the first step is to compile a list of metrics you wish to save for long term. HTTP uptime and response times are a good example.

For this usecase there is a second instance of prometheus-server running on the monitoring server, called prometheus-downsampling, which has a default data retention period of three years instaed of two weeks. It is using only a single data source, which is the actual prometheus-server. It has a scrape interval of five minutes.

prometheus-server uses so called record rules to save existing metrics with a prefix, in this case downsampling. For example a metric called http_request_duration_microseconds_count would be additionally saved as downsampling_http_request_duration_microseconds_count. The prometheus-downsampling process has only one rule defined - to pull all metrics matching downsampling.* from prometheus-server.

Adding new metrics for long term storage¶

In order to configure further metrics for aggregation with prometheus-downsampling, rules like the following have to be created for the prometheus-server.

Example: templates/etc/prometheus-server/rules.d/my-downsampling-export.yml:

groups:
  - name: DownsamplingHttpRequestDuration
    rules:
      - record: downsampling:http_request_duration_microseconds_count
        expr: avg(rate(http_request_duration_microseconds_count{handler="federate"}[2m])) by (job, federate)

The documentation on how to write prometheus rules can be found here.

To apply your new rules add them to the Ansible dictionary prometheus_server_rules_custom and apply the Ansible prometheus-server role.

Incident priorities¶

Priority	Description	Action
P1	Critical	Text-to-speech in monitoring-alerts
P2	High	List in monitoring-alerts but do not text-to-speech
P3	Moderate	List in monitoring-alerts `overview.py`
P4	Low	Show in webui only

The variable prometheus_exporter_node_textfile_system_priority from role-prometheus-exporters overrides that value downward s. Example:
Check priority: 1 System priority: 2 Actual priority in monitoring-alerts: 2

How to implement new collector with alerts¶

write a collector
add it to the role-prometheus-exporters
add it to the prometheus-exporters play
apply the role
add the exporter to exporters loop in templates/etc/prometheus-server/prometheus.yml.j2 in this role
check that the metrics come in in the prometheus webui
create rules for the metrics

prometheus-alertmanager (alert humans)¶

The prometheus-alertmanager provides a WebUI where currently active alerts can be displayed. It is also responsible for sending email alerts or triggering other actions to notify humans. In addition, it provides an API that can be pulled for currently active alerts.

Alert priorities¶

Alerts are categorized into priorities:

P5 Informational - No notification needed - Show only on dashboards - Examples: Actions which are regular during self healing such as failover actions

P4 Low - Notifications during office hours / loud hours - Incidents that require manual intervention, but may wait up to one week - Priority may be increased after 1 week - Examples: Missed cronjobs, expiring ssl certificates, nonworking backups etc

P3 Moderate - Notifications during office hours immediately - Notifications delayed by 3 hours between 10pm and 7am - Incidents that may reduce functionality if unhandled, but should recover by self healing mechanism - Example: High CPU, high load, disk space predictions, high exception count

P2 High - Immediate notifications 24/7 - Escalates to CTO after 1 hour - Incidents that may have major impact to infrastructure or product - Examples: TargetDown, NoTargetFound, high error count, high response times

P1 Critical - Immediate notifications 24/7 - Notifies CTO / CEO immediately - Confirmed outages of product related services - Examples: BlackboxProbe (HTTP check) failed, High amount of customer facing errors

Alert notifications (emails, API)¶

The Blunix role-prometheus-server is currently only configured to send emails, not to send alerts by chat systems.

Blunix GmbH employees are generally using the API to be notified of alerts. Emails are mostly for customers.
If you are serious about being alerted by monitoring systems (24/7), you will sooner rather than later notice that a short sound from your mobile phone has little effect when you are more than a few meters away from your phone or distracted. In the authors opinion, critical monitoring alerts should continue notifying a human until they are either resolved or acknowledged to then be processed. Hence, the author prefers to have a dedicated program running on his mobile device which pulls the prometheus-alertmanager API every 30 seconds and plays an alert sound (by simply reading the alert with text to speech) until it is acknowledged or no longer active. This is the case for Priority one and two alerts. All other alerts are send to Blunix GmbH employees by email as well.

In order to add new receivers to emails, modify the following Ansible variables and apply the prometheus-server Ansible play.

# Default for all following variables
prometheus_alertmanager_email_recipient: "monitoring@example.com, foo.bar@example.com"

# If undefined, all following dictionaries will default to "prometheus_alertmanager_email_recipient"
# List of recipients for Priority 1 alerts
prometheus_alertmanager_email_recipient_p1: "monitoring@example.com"
# List of recipients for Priority 2 alerts
prometheus_alertmanager_email_recipient_p2: "monitoring@example.com, bar.qux@example.com"
# All following variables will default to "prometheus_alertmanager_email_recipient"
#prometheus_alertmanager_email_recipient_p3
#prometheus_alertmanager_email_recipient_p4
#prometheus_alertmanager_email_recipient_p5

As the WebUI of the default prometheus-alertmanager is not very pretty to look at and does not provide a good overview of currently active alerts, the WebUI karma is installed in addition to it, and can be found from the main Website of the monitoring server.

grafana (fancy agent data visualization)¶

TODO

Doc: https://grafana.com/docs/