Skip to main content

spc--noti-centre--be

Overview

  • Namespace: spc--noti-centre--be
  • Purpose: Sapoche Notification Centre Backend - PRODUCTION
  • Age: 51 days (~1.7 months, since September 2025)
  • Status: 🔴 CRITICAL - BROKEN - All pods failing or scaled to 0
  • Workloads: 2 deployments (0 active, 2 broken/scaled to 0)
  • Environment: PRODUCTION - Notification center (COMPLETELY DOWN!)

Architecture

Notification centre system - CURRENTLY COMPLETELY BROKEN:

  • Main Application: REST API backend - x Scaled to 0/0
  • Event Consumer: Default queue processing - 🔴 CrashLoopBackOff (6003 restarts!)

Auto-Scaling Configuration

No Auto-Scaling Configured:

  • No HorizontalPodAutoscalers (HPAs)
  • No KEDA scaled objects
  • Fixed replica counts

Workload Categories

Main Application (1 deployment - Scaled to 0)

NameReplicasStatusPurpose
spc--noti-centre--be--app--prod0/0x Scaled to 0Main notification API (INACTIVE)

Event Consumer (1 deployment - BROKEN)

NameReplicasStatusPurpose
consumer-default0/1🔴 CrashLoopBackOffDefault queue (6003 restarts in 21 days!)

Critical Issues

🔴 CRITICAL - NAMESPACE COMPLETELY BROKEN

Consumer Pod Status:

  • CrashLoopBackOff: 1 pod with 6003 restarts in 21 days (~286 restarts/day!)
  • ContainerStatusUnknown: 2 pods (244 and 5765 restarts)
  • Evicted: 22 pods evicted (resource pressure or node issues)
  • Desired 1, Available 0: Consumer cannot start successfully

Main API:

  • Scaled to 0/0 (completely inactive)
  • No running pods
  • Service exists but no backend

Duration: Broken for ~42 days (consumer age)

Services

NameTypeCluster IPPortsNodePortPurpose
spc--noti-centre--be--app--prodNodePort10.8.24.1138030968Main notification API (NO BACKEND!)

Access & Management

Check broken consumer:

# Check consumer pods (will show CrashLoopBackOff)
kubectl get pods -n spc--noti-centre--be | grep consumer

# Check logs from crashing pod
kubectl logs -n spc--noti-centre--be -l app=spc--noti-centre--be--consumer-default--prod --tail=100

# Check previous logs
kubectl logs -n spc--noti-centre--be -l app=spc--noti-centre--be--consumer-default--prod --tail=100 --previous

# Describe pod for crash reason
kubectl describe pods -n spc--noti-centre--be -l app=spc--noti-centre--be--consumer-default--prod

Check scaled-to-0 main app:

# Check deployment
kubectl get deployment spc--noti-centre--be--app--prod -n spc--noti-centre--be

# Scale up if needed (after fixing consumer issues)
kubectl scale deployment spc--noti-centre--be--app--prod -n spc--noti-centre--be --replicas=1

View all resources:

kubectl get all -n spc--noti-centre--be

Monitoring

Resource usage:

# Will likely show nothing or failing pods
kubectl top pods -n spc--noti-centre--be

Events (CRITICAL):

# Check events for errors
kubectl get events -n spc--noti-centre--be --sort-by='.lastTimestamp' | head -50

# Check for eviction events
kubectl get events -n spc--noti-centre--be | grep -i "evicted\|oom\|failed"

Data Flow (CURRENTLY BROKEN)

Notification Centre Request

spc--noti-centre--be--app--prod (NodePort 30968)

🔴 NO BACKEND PODS (scaled to 0)

Message Queue (Kafka/Redpanda)

🔴 consumer-default (CrashLoopBackOff - 6003 restarts)

x NOTIFICATIONS NOT BEING PROCESSED

Production Considerations

🔴 CRITICAL ISSUES - IMMEDIATE ACTION REQUIRED

  1. Consumer CrashLoopBackOff (HIGHEST PRIORITY):

    • Pod crashing with 6003 restarts in 21 days (~286 restarts/day)
    • ~12 crashes per hour - completely broken
    • Investigate logs immediately
    • Likely causes:
      • Configuration error
      • Missing dependencies
      • Database connection failure
      • Message queue connection failure
      • Code bug causing immediate crash
      • Resource limits too low
  2. Main API Scaled to 0:

    • No notification API available
    • Service routing to nothing
    • Complete service outage
  3. Multiple Pod Evictions (22 pods):

    • Node resource pressure
    • Out of Memory (OOM)
    • Disk pressure
    • Check node health
  4. ContainerStatusUnknown:

    • 2 pods in unknown state
    • Possible node communication issues
    • Kubelet issues
  1. IMMEDIATE - Fix Consumer Crash:

    # Get detailed crash logs
    kubectl logs -n spc--noti-centre--be -l app=spc--noti-centre--be--consumer-default--prod --tail=500

    # Check previous crash
    kubectl logs -n spc--noti-centre--be -l app=spc--noti-centre--be--consumer-default--prod --previous

    # Describe pod for resource/error details
    kubectl describe pods -n spc--noti-centre--be -l app=spc--noti-centre--be--consumer-default--prod

    # Check events
    kubectl get events -n spc--noti-centre--be | grep consumer
  2. Investigate Root Cause:

    • Check application logs for errors
    • Verify configuration (ConfigMaps, Secrets)
    • Verify database connectivity
    • Verify message queue connectivity
    • Check resource limits (CPU, memory)
    • Review deployment YAML for issues
  3. Cleanup Evicted/Failed Pods:

    # Delete evicted pods
    kubectl delete pods -n spc--noti-centre--be --field-selector=status.phase=Failed

    # Delete pods in unknown state
    kubectl get pods -n spc--noti-centre--be | grep Unknown | awk '{print $1}' | xargs kubectl delete pod -n spc--noti-centre--be
  4. Scale Up Main API (after fixing consumer):

    kubectl scale deployment spc--noti-centre--be--app--prod -n spc--noti-centre--be --replicas=1
  5. Add Monitoring:

    • Set up alerts for CrashLoopBackOff
    • Monitor restart counts
    • Set up pod eviction alerts

Troubleshooting

CRITICAL - Consumer crash investigation:

# Step 1: Get current crash logs
kubectl logs -n spc--noti-centre--be -l app=spc--noti-centre--be--consumer-default--prod --tail=500

# Step 2: Get previous crash logs
kubectl logs -n spc--noti-centre--be -l app=spc--noti-centre--be--consumer-default--prod --previous --tail=500

# Step 3: Describe pod for detailed error
kubectl describe pods -n spc--noti-centre--be -l app=spc--noti-centre--be--consumer-default--prod | grep -A 50 "Events:"

# Step 4: Check resource usage if pod started
kubectl top pods -n spc--noti-centre--be | grep consumer

# Step 5: Get deployment details
kubectl get deployment spc--noti-centre--be--consumer-default--prod -n spc--noti-centre--be -o yaml

# Step 6: Check ConfigMaps and Secrets
kubectl get configmaps -n spc--noti-centre--be
kubectl get secrets -n spc--noti-centre--be

# Step 7: Check events for OOMKilled or other issues
kubectl get events -n spc--noti-centre--be | grep -i "oom\|kill\|fail\|error" | head -20

Pod eviction investigation:

# Check node resources
kubectl top nodes

# Check node conditions
kubectl get nodes -o wide

# Check pod resource requests
kubectl get pods -n spc--noti-centre--be -o yaml | grep -A 5 "resources:"

# Check eviction events
kubectl get events -n spc--noti-centre--be | grep -i evict

Cleanup and restart:

# Cleanup failed pods
kubectl delete pods -n spc--noti-centre--be --field-selector=status.phase=Failed

# Delete stuck pods
kubectl get pods -n spc--noti-centre--be | grep -E "Unknown|Evicted|Error" | awk '{print $1}' | xargs kubectl delete pod -n spc--noti-centre--be --force --grace-period=0

# Restart consumer deployment (after fixing root cause)
kubectl rollout restart deployment/spc--noti-centre--be--consumer-default--prod -n spc--noti-centre--be

# Scale up main app (after consumer is healthy)
kubectl scale deployment spc--noti-centre--be--app--prod -n spc--noti-centre--be --replicas=1

Performance Metrics

Current Scale

  • Main API: 0 replicas (scaled to 0 - NO SERVICE)
  • Consumer: 0/1 (desired 1, actual 0 - CRASHLOOPBACKOFF)
  • Total Active Pods: 0 pods (complete service outage)
  • Failed Pods: 25 pods (22 evicted, 2 unknown, 1 crashing)

Stability

  • Namespace Age: 51 days (very new)
  • Consumer Age: 42 days (crashing since creation)
  • Restart Count: 6003 restarts in 21 days (~286 restarts/day)
  • Status: 🔴 COMPLETELY BROKEN SINCE CREATION

Critical Statistics

  • Uptime: 0% (service completely down)
  • Crash Rate: ~12 crashes per hour
  • Service Availability: NONE
  • Impact: High - notification centre completely unavailable

x IMMEDIATE ACTION REQUIRED

This namespace requires URGENT attention:

  • Step 1: Investigate consumer crash logs (highest priority)
  • Step 2: Fix configuration or code bug causing crashes
  • Step 3: Clean up evicted/failed pods
  • Step 4: Scale up main API after consumer is healthy
  • Step 5: Add monitoring and alerts
  • Step 6: Consider rolling back to previous working version if available

This service has been broken for ~42 days. If notifications are critical, escalate immediately!