Skip to main content

monitoring

Overview

  • Namespace: monitoring
  • Purpose: Prometheus/Grafana Stack - PRODUCTION
  • Age: ~382 days (since October 2023)
  • Status: Active - Complete monitoring and observability stack
  • Workloads: 10+ deployments/StatefulSets (all active)
  • Environment: PRODUCTION - Metrics collection and visualization

Architecture

Comprehensive monitoring stack with Prometheus, Grafana, Loki, and Alloy:

  • Prometheus: Metrics collection (StatefulSet, 1 replica)
  • AlertManager: Alert routing and management (StatefulSet, 1 replica)
  • Grafana: Metrics visualization and dashboards (1 deployment)
  • Loki: Log aggregation (StatefulSet, 2 replicas)
  • Promtail: Log collection (DaemonSet, 1 per node)
  • Alloy: Observability agent (StatefulSet, 1 replica)
  • Node Exporter: Node metrics (DaemonSet, 1 per node)
  • Kube State Metrics: Kubernetes state metrics (1 deployment)
  • Loki Canary: Log availability monitoring (DaemonSet)

Auto-Scaling Configuration

Not Auto-Scaled:

  • Monitoring stack uses fixed replicas
  • DaemonSets run on all nodes (1 per node)
  • StatefulSets maintain persistent state

Workload Categories

Core Monitoring (StatefulSets)

NameReplicasStatusPurpose
alertmanager-kube-prometheus-stack-alertmanager1/1RunningAlert routing and grouping
prometheus (implied)1/1RunningMetrics collection (StatefulSet)

Metrics Collection

NameTypeStatusPurpose
kube-prometheus-stack-kube-state-metricsDeploymentRunningKubernetes object metrics
kube-prometheus-stack-operatorDeploymentRunningPrometheus Operator

Log Aggregation

NameReplicasStatusPurpose
loki-stack2/2RunningLog aggregation (StatefulSet)
loki-canary (DaemonSet)N/NRunningLog availability monitoring
loki-stack-promtail (DaemonSet)N/NRunningLog collection from all nodes

Visualization & Collection

NameReplicasStatusPurpose
grafana1/1RunningMetrics dashboards
alloy1/1RunningObservability agent
node-exporter (DaemonSet)N/NRunningNode system metrics

Recommendations

  1. Prometheus HA: Consider 2+ replicas with external storage
  2. AlertManager Clustering: Configure 3-member cluster for HA
  3. Loki: Current 2 replicas good, monitor storage growth
  4. Storage Monitoring: Check disk usage regularly
  5. Monitoring Priorities: Scrape success, alert delivery, log ingestion

Performance Metrics

Current Scale

  • Prometheus: 1 StatefulSet, 1 replica
  • AlertManager: 1 StatefulSet, 1 replica
  • Loki: 1 StatefulSet, 2 replicas
  • Grafana: 1 Deployment, 1 replica
  • Node Exporters: DaemonSet (1 per node)
  • Total Active Pods: 15+ pods

Stability

  • Stack Age: ~382 days (very mature)
  • Status: All components healthy
  • Critical Role: Observability for entire cluster