Skip to content

Observability

Full-stack monitoring with metrics, logs, alerting, and network monitoring.

Prometheus

Prometheus collects time-series metrics from all cluster services and nodes. It runs as a StatefulSet with persistent storage for metric retention.

SettingValue
Scrape interval60 seconds
Evaluation interval60 seconds
StorageLonghorn PVC
DeploymentStatefulSet

Why 60-second intervals?

The evaluation interval is set to 60 seconds (not the typical 30s) to prevent slow aggregation queries on the ARM-based nodes. Shorter intervals can cause query timeouts with complex recording rules.

Grafana

Grafana provides visualization dashboards connected to both Prometheus (metrics) and Loki (logs) as data sources. Pre-built dashboards cover cluster health, node resources, and application-specific metrics.

Loki

Loki aggregates logs from all pods using a label-based indexing approach similar to Prometheus. It pairs with Grafana for log exploration and correlation with metrics.

Alertmanager

Alertmanager handles alert routing, deduplication, and notification delivery. Alerts from Prometheus are routed to Discord for real-time notifications.

Observium

Observium provides SNMP-based network monitoring for switches, routers, and infrastructure devices.

SettingValue
Nodeaimax (amd64-only image)
DatabaseMariaDB (StatefulSet)
SNMP pollingUses host IP via flannel masquerade
LoadBalancer IP192.168.64.116

TIP

SNMP polls originate from the pod's host IP due to flannel masquerade. Ensure network devices have SNMP ACLs that permit the node's IP address.

Homelab Infrastructure

Build: a625fb4