Observability

Monitoring Tools: Prometheus & Grafana

Metrics, alerting, and dashboards used by SRE teams to keep production reliable.

Yash Gupta
Aug 2025
8 min read

Prometheus Deep Dive

Architecture & Key Components:

  • Pull Model: Targets are scraped at defined intervals (configurable via scrape_configs). Overcomes ephemeral IP issues in dynamic environments
  • Service Discovery: Integrates with Kubernetes, Consul, AWS EC2, etc., to auto-detect monitoring targets
  • Storage: Time-series database (TSDB) with local on-disk storage. Data retention policies (default 15d) and block compaction
  • Pushgateway: For short-lived jobs (e.g., cron jobs) that can't be scraped directly
  • Exporters: 900+ community exporters (e.g., Node Exporter for OS metrics, JMX Exporter for Java apps)

Advanced PromQL:

# Error rate (5m) for HTTP 500 responses
rate(http_requests_total{status="500"}[5m])

# Prediction: Disk space exhaustion in 4h
predict_linear(node_filesystem_free_bytes[6h], 4*3600) < 0

# SLO Compliance: 99.9% success rate over 28d
avg_over_time(up{job="api"}[28d]) > 0.999

Limitations & Solutions:

  • Scalability: Federated Prometheus for hierarchical aggregation (e.g., regional → global)
  • Long-term Storage: Integrate with Thanos, Cortex, or Mimir for infinite retention

Grafana Advanced Implementation

Dynamic Dashboards:

  • Variables: Create dropdowns for environment (env=prod|staging), service, or datacenter
  • Annotations: Overlay deployment events or incidents from CI/CD pipelines (e.g., via webhooks)
  • Mixed Data Sources: Correlate Prometheus metrics with Loki logs or Tempo traces

SLO Visualization:

Define SLI (e.g., HTTP success rate > 99.95%). Configure error budget burn rate in Grafana:

# Error budget remaining
(1 - (sum(increase(http_requests_failed_total[7d])) / sum(increase(http_requests_total[7d]))) - 0.9995

Alert when budget consumption exceeds 2% per hour.

Alerting Pipeline:

  • Grafana Alert Rules: Multi-dimensional (e.g., cluster=eu-west, app=payment)
  • Notification Policies: Route critical alerts to PagerDuty/Slack, low-priority to email
  • Silencing: Mute alerts during maintenance windows

Advanced Use Cases

1. Kubernetes & Cloud-Native Monitoring

  • Control Plane: Monitor etcd latency, API server errors, scheduler queue depth
  • Workloads: Track HPA scaling events, OOMKills, PVC capacity
  • Custom Metrics: Expose app metrics (e.g., orders_processed_sec) for autoscaling

Example Dashboard: Per-namespace resource usage, ingress error rates, CRD health

2. Distributed Tracing Integration

  • OpenTelemetry Collector: Ingest traces → Tempo/Jaeger
  • Grafana Explore: Jump from Prometheus metric spike (e.g., high latency) to correlated traces

3. Business KPI Monitoring

Metric Examples:

  • cart_abandonment_rate
  • user_signups_per_minute
  • payment_failure_rate

ETL Pipeline: Ingest business metrics via Pushgateway or custom exporter

4. Multi-Cluster / Hybrid Cloud

  • Thanos Setup: Global query layer across 10+ Prometheus instances
  • Grafana Data Sources: Unified view of on-prem + AWS/GCP/Azure metrics

SRE Best Practices

Define Critical Dashboards:

  • Golden Signals: Traffic, Error Rate, Latency, Saturation (USE/RED methods)
  • Dependency Map: Service topology with upstream/downstream health

Alert Design:

  • Avoid "alert storms" – aggregate using sum() or max()
  • Use multi-window burn-rate alerts for SLOs

Runbook Integration:

  • Link Grafana alerts to Confluence/Squadcast runbooks
  • Example: "High CPU → Check runbook #23 for node diagnostics"

Chaos Engineering:

Monitor Gremlin/Chaos Mesh experiments in real-time dashboards

Emerging Trends

  • eBPF Integration: Monitor network/security via Pixie or Kindling exporters
  • Continuous Profiling: Pyroscope/Phlare integration with Grafana
  • AIOPs: Anomaly detection using Grafana ML (e.g., predict_linear deviations)

Security Hardening:

Prometheus:

  • TLS/mTLS for scrape endpoints
  • RBAC via Kubernetes ServiceAccounts

Grafana:

  • SAML/OAuth
  • Dashboard permissions
  • Encrypted credentials

Optimization Tips:

  • Reduce metric cardinality (avoid high-cardinality labels like user_id)
  • Use recording rules for expensive PromQL queries
  • Configure blocked_exporters in Grafana to limit data source access

Conclusion

This extended guide equips SREs to implement production-grade observability with Prometheus & Grafana, covering everything from architecture nuances to real-world incident management workflows.