Most Raspberry Pi projects run fine until they do not. Without metrics and logs, troubleshooting becomes guesswork. This post describes a minimal but production-like observability stack that fits on Pi hardware.
1. Monitoring objectives
Define what you need to detect:
- service down events
- thermal throttling
- disk pressure
- memory leaks
- network instability
Monitoring without clear objectives leads to dashboards that look impressive but do not prevent incidents.
2. Suggested stack
A practical setup for one or several Pis:
- Node Exporter for host metrics
- Prometheus for scraping and retention
- Grafana for dashboards
- Loki + Promtail for logs (optional but valuable)
- Alertmanager for notifications
For small setups, run everything in Docker Compose on one Pi 4 with SSD storage.
3. Storage and retention planning
The weakest point is usually storage, not CPU. Prefer SSD over SD card for long-running telemetry.
Retention guidance:
- high-resolution metrics: 7 to 14 days
- downsampled trends: 30 to 90 days
- logs: based on incident analysis needs
Always set explicit retention and maximum disk usage policies.
4. Baseline dashboard design
Your first dashboard should answer these questions in seconds:
- Is host healthy now?
- Which service changed recently?
- Is this a compute, memory, disk, or network issue?
Core panels:
- CPU load and temperature
- memory used and swap activity
- disk free and I/O latency
- network throughput and drops
- service restart counters
Avoid overloading a dashboard with low-value charts.
5. Alerting model
Alerts should be actionable and low-noise. Start with:
- host unreachable for 2 minutes
- disk free below 15 percent
- sustained high temperature
- service restart loop detected
Include runbook hints in alert descriptions, such as command snippets to inspect logs.
6. Log strategy
Metrics tell you that something is wrong. Logs explain why. Standardize log format from your own services:
- timestamp
- severity
- component
- request or correlation ID
- clear error message
Unstructured logs slow incident response significantly.
7. Security hygiene
Monitoring components often expose sensitive internal data. Protect them:
- bind admin UI to private interfaces
- put dashboards behind auth
- rotate credentials and API keys
- keep base images patched
Do not leave Grafana default credentials in place, even in a private network.
8. Maintenance routine
Every month:
- verify backup of monitoring config
- review alert fatigue and disable noisy rules
- test one simulated failure end to end
- review retention vs available storage
Observability is an ongoing system, not a one-time install.
Final note
A small observability stack on Raspberry Pi pays for itself quickly. The first time an alert catches a failing disk before total outage, the setup effort is already justified.