Overview
- Namespace:
monitoring
- Purpose: Prometheus/Grafana Stack - PRODUCTION
- Age: ~382 days (since October 2023)
- Status: Active - Complete monitoring and observability stack
- Workloads: 10+ deployments/StatefulSets (all active)
- Environment: PRODUCTION - Metrics collection and visualization
Architecture
Comprehensive monitoring stack with Prometheus, Grafana, Loki, and Alloy:
- Prometheus: Metrics collection (StatefulSet, 1 replica)
- AlertManager: Alert routing and management (StatefulSet, 1 replica)
- Grafana: Metrics visualization and dashboards (1 deployment)
- Loki: Log aggregation (StatefulSet, 2 replicas)
- Promtail: Log collection (DaemonSet, 1 per node)
- Alloy: Observability agent (StatefulSet, 1 replica)
- Node Exporter: Node metrics (DaemonSet, 1 per node)
- Kube State Metrics: Kubernetes state metrics (1 deployment)
- Loki Canary: Log availability monitoring (DaemonSet)
Auto-Scaling Configuration
❌ Not Auto-Scaled:
- Monitoring stack uses fixed replicas
- DaemonSets run on all nodes (1 per node)
- StatefulSets maintain persistent state
Workload Categories
Core Monitoring (StatefulSets)
| Name | Replicas | Status | Purpose |
|---|
| alertmanager-kube-prometheus-stack-alertmanager | 1/1 | Running | Alert routing and grouping |
| prometheus (implied) | 1/1 | Running | Metrics collection (StatefulSet) |
Metrics Collection
| Name | Type | Status | Purpose |
|---|
| kube-prometheus-stack-kube-state-metrics | Deployment | Running | Kubernetes object metrics |
| kube-prometheus-stack-operator | Deployment | Running | Prometheus Operator |
Log Aggregation
| Name | Replicas | Status | Purpose |
|---|
| loki-stack | 2/2 | Running | Log aggregation (StatefulSet) |
| loki-canary (DaemonSet) | N/N | Running | Log availability monitoring |
| loki-stack-promtail (DaemonSet) | N/N | Running | Log collection from all nodes |
Visualization & Collection
| Name | Replicas | Status | Purpose |
|---|
| grafana | 1/1 | Running | Metrics dashboards |
| alloy | 1/1 | Running | Observability agent |
| node-exporter (DaemonSet) | N/N | Running | Node system metrics |
Recommendations
- Prometheus HA: Consider 2+ replicas with external storage
- AlertManager Clustering: Configure 3-member cluster for HA
- Loki: Current 2 replicas good, monitor storage growth
- Storage Monitoring: Check disk usage regularly
- Monitoring Priorities: Scrape success, alert delivery, log ingestion
Current Scale
- Prometheus: 1 StatefulSet, 1 replica
- AlertManager: 1 StatefulSet, 1 replica
- Loki: 1 StatefulSet, 2 replicas
- Grafana: 1 Deployment, 1 replica
- Node Exporters: DaemonSet (1 per node)
- Total Active Pods: 15+ pods
Stability
- Stack Age: ~382 days (very mature)
- Status: All components healthy
- Critical Role: Observability for entire cluster