Observability
Full-stack monitoring with metrics, logs, alerting, and network monitoring.
Prometheus
Prometheus collects time-series metrics from all cluster services and nodes. It runs as a StatefulSet with persistent storage for metric retention.
| Setting | Value |
|---|---|
| Scrape interval | 60 seconds |
| Evaluation interval | 60 seconds |
| Storage | Longhorn PVC |
| Deployment | StatefulSet |
Why 60-second intervals?
The evaluation interval is set to 60 seconds (not the typical 30s) to prevent slow aggregation queries on the ARM-based nodes. Shorter intervals can cause query timeouts with complex recording rules.
Grafana
Grafana provides visualization dashboards connected to both Prometheus (metrics) and Loki (logs) as data sources. Pre-built dashboards cover cluster health, node resources, and application-specific metrics.
Loki
Loki aggregates logs from all pods using a label-based indexing approach similar to Prometheus. It pairs with Grafana for log exploration and correlation with metrics.
Alertmanager
Alertmanager handles alert routing, deduplication, and notification delivery. Alerts from Prometheus are routed to Discord for real-time notifications.
Observium
Observium provides SNMP-based network monitoring for switches, routers, and infrastructure devices.
| Setting | Value |
|---|---|
| Node | aimax (amd64-only image) |
| Database | MariaDB (StatefulSet) |
| SNMP polling | Uses host IP via flannel masquerade |
| LoadBalancer IP | 192.168.64.116 |
TIP
SNMP polls originate from the pod's host IP due to flannel masquerade. Ensure network devices have SNMP ACLs that permit the node's IP address.