Monitoring & Alerting¶
Monitoring stack¶
graph LR
Targets[Pod/Node] -->|scrape| Prometheus
Prometheus -->|query| Grafana
Prometheus -->|rules| Alertmanager
Alertmanager -->|notify| Telegram[Telegram Bot]
Flux -->|errors| FluxNotif[Flux Notification]
FluxNotif -->|notify| Telegram
Gatus -->|HTTP checks| Services
Gatus -->|alert| Telegram
Prometheus¶
- Chart: kube-prometheus-stack v72.6.2
- Retention: 30 days / 18GB max
- Storage: nfs-flash (SSD), 20Gi
- ServiceMonitor: auto-detects from all namespaces
Enabled components¶
| Component | Status | Notes |
|---|---|---|
| Prometheus | ✅ | Core metrics collection |
| Alertmanager | ✅ | Alert routing → Telegram |
| kube-state-metrics | ✅ | K8s resource metrics |
| node-exporter | ✅ | OS/hardware metrics |
| Grafana | ❌ | Managed separately in apps/grafana |
Disabled components (Talos)¶
On Talos Linux these components do not expose metrics in the standard way:
kubeProxy→ disabledkubeControllerManager→ disabledkubeScheduler→ disabledkubeEtcd→ disabled
Alertmanager¶
Configuration¶
The full configuration is in a SOPS Secret (alertmanager-config) containing:
- Routing rules
- Telegram receiver (bot_token + chat_id)
- Inhibition rules
- Group/repeat intervals
Active alerts (examples)¶
| Alert | Severity | Description |
|---|---|---|
| KubePodCrashLooping | critical | Pod in crash loop |
| KubePodNotReady | warning | Pod not ready > 15m |
| NodeFilesystemSpaceFillingUp | warning | Disk almost full |
| NodeMemoryHighUtilization | warning | RAM > 90% |
| PrometheusTargetDown | critical | Target scrape failed |
| NfsServerUnreachable | critical | All nodes without NFS traffic > 30m |
Flux Notifications¶
Alerts separate from Prometheus monitoring, specific for GitOps errors:
- Trigger: any Flux resource in
errorstate - Channel: same Telegram bot
- Filters: ignores routine messages (
waiting.*retrying,Health check passed)
Gatus¶
Uptime monitoring with periodic HTTP checks:
- Dashboard:
status.${DOMAIN} - Checks: health endpoint of each service
- Alert: Telegram notification if a service is down
- History: persisted on nfs-spacex
Grafana¶
- URL:
grafana.${DOMAIN} - Auth: OIDC via Authentik
- Datasource: Prometheus (auto-configured)
- Storage: nfs-flash for persistent dashboards
Where notifications go¶
📱 Telegram
├── 🔴 Prometheus Alertmanager (cluster/app metrics)
├── 🟠 Flux Notifications (GitOps deploy errors)
└── 🟡 Gatus (HTTP uptime checks)
All notifications arrive in the same Telegram chat, with different prefixes to distinguish the source.