spc--noti-centre--be

Overview

Namespace: spc--noti-centre--be
Purpose: Sapoche Notification Centre Backend - PRODUCTION
Age: 51 days (~1.7 months, since September 2025)
Status: 🔴 CRITICAL - BROKEN - All pods failing or scaled to 0
Workloads: 2 deployments (0 active, 2 broken/scaled to 0)
Environment: PRODUCTION - Notification center (COMPLETELY DOWN!)

Architecture

Notification centre system - CURRENTLY COMPLETELY BROKEN:

Main Application: REST API backend - x Scaled to 0/0
Event Consumer: Default queue processing - 🔴 CrashLoopBackOff (6003 restarts!)

Auto-Scaling Configuration

No Auto-Scaling Configured:

No HorizontalPodAutoscalers (HPAs)
No KEDA scaled objects
Fixed replica counts

Workload Categories

Main Application (1 deployment - Scaled to 0)

Name	Replicas	Status	Purpose
spc--noti-centre--be--app--prod	0/0	x Scaled to 0	Main notification API (INACTIVE)

Event Consumer (1 deployment - BROKEN)

Name	Replicas	Status	Purpose
consumer-default	0/1	🔴 CrashLoopBackOff	Default queue (6003 restarts in 21 days!)

Critical Issues

🔴 CRITICAL - NAMESPACE COMPLETELY BROKEN

Consumer Pod Status:

CrashLoopBackOff: 1 pod with 6003 restarts in 21 days (~286 restarts/day!)
ContainerStatusUnknown: 2 pods (244 and 5765 restarts)
Evicted: 22 pods evicted (resource pressure or node issues)
Desired 1, Available 0: Consumer cannot start successfully

Main API:

Scaled to 0/0 (completely inactive)
No running pods
Service exists but no backend

Duration: Broken for ~42 days (consumer age)

Services

Name	Type	Cluster IP	Ports	NodePort	Purpose
spc--noti-centre--be--app--prod	NodePort	10.8.24.113	80	30968	Main notification API (NO BACKEND!)

Access & Management

Check broken consumer:

# Check consumer pods (will show CrashLoopBackOff)
kubectl get pods -n spc--noti-centre--be | grep consumer

# Check logs from crashing pod
kubectl logs -n spc--noti-centre--be -l app=spc--noti-centre--be--consumer-default--prod --tail=100

# Check previous logs
kubectl logs -n spc--noti-centre--be -l app=spc--noti-centre--be--consumer-default--prod --tail=100 --previous

# Describe pod for crash reason
kubectl describe pods -n spc--noti-centre--be -l app=spc--noti-centre--be--consumer-default--prod

Check scaled-to-0 main app:

# Check deployment
kubectl get deployment spc--noti-centre--be--app--prod -n spc--noti-centre--be

# Scale up if needed (after fixing consumer issues)
kubectl scale deployment spc--noti-centre--be--app--prod -n spc--noti-centre--be --replicas=1

View all resources:

kubectl get all -n spc--noti-centre--be

Monitoring

Resource usage:

# Will likely show nothing or failing pods
kubectl top pods -n spc--noti-centre--be

Events (CRITICAL):

# Check events for errors
kubectl get events -n spc--noti-centre--be --sort-by='.lastTimestamp' | head -50

# Check for eviction events
kubectl get events -n spc--noti-centre--be | grep -i "evicted\|oom\|failed"

Data Flow (CURRENTLY BROKEN)

Notification Centre Request
    ↓
spc--noti-centre--be--app--prod (NodePort 30968)
    ↓
🔴 NO BACKEND PODS (scaled to 0)
    ↓
Message Queue (Kafka/Redpanda)
    ↓
🔴 consumer-default (CrashLoopBackOff - 6003 restarts)
    ↓
x NOTIFICATIONS NOT BEING PROCESSED

Production Considerations

🔴 CRITICAL ISSUES - IMMEDIATE ACTION REQUIRED

Consumer CrashLoopBackOff (HIGHEST PRIORITY):
- Pod crashing with 6003 restarts in 21 days (~286 restarts/day)
- ~12 crashes per hour - completely broken
- Investigate logs immediately
- Likely causes:
  - Configuration error
  - Missing dependencies
  - Database connection failure
  - Message queue connection failure
  - Code bug causing immediate crash
  - Resource limits too low
Main API Scaled to 0:
- No notification API available
- Service routing to nothing
- Complete service outage
Multiple Pod Evictions (22 pods):
- Node resource pressure
- Out of Memory (OOM)
- Disk pressure
- Check node health
ContainerStatusUnknown:
- 2 pods in unknown state
- Possible node communication issues
- Kubelet issues

Recommended Actions

IMMEDIATE - Fix Consumer Crash:

# Get detailed crash logs
kubectl logs -n spc--noti-centre--be -l app=spc--noti-centre--be--consumer-default--prod --tail=500

# Check previous crash
kubectl logs -n spc--noti-centre--be -l app=spc--noti-centre--be--consumer-default--prod --previous

# Describe pod for resource/error details
kubectl describe pods -n spc--noti-centre--be -l app=spc--noti-centre--be--consumer-default--prod

# Check events
kubectl get events -n spc--noti-centre--be | grep consumer

Investigate Root Cause:
- Check application logs for errors
- Verify configuration (ConfigMaps, Secrets)
- Verify database connectivity
- Verify message queue connectivity
- Check resource limits (CPU, memory)
- Review deployment YAML for issues

Cleanup Evicted/Failed Pods:

# Delete evicted pods
kubectl delete pods -n spc--noti-centre--be --field-selector=status.phase=Failed

# Delete pods in unknown state
kubectl get pods -n spc--noti-centre--be | grep Unknown | awk '{print $1}' | xargs kubectl delete pod -n spc--noti-centre--be

Scale Up Main API (after fixing consumer):

kubectl scale deployment spc--noti-centre--be--app--prod -n spc--noti-centre--be --replicas=1

Add Monitoring:
- Set up alerts for CrashLoopBackOff
- Monitor restart counts
- Set up pod eviction alerts

Troubleshooting

CRITICAL - Consumer crash investigation:

# Step 1: Get current crash logs
kubectl logs -n spc--noti-centre--be -l app=spc--noti-centre--be--consumer-default--prod --tail=500

# Step 2: Get previous crash logs
kubectl logs -n spc--noti-centre--be -l app=spc--noti-centre--be--consumer-default--prod --previous --tail=500

# Step 3: Describe pod for detailed error
kubectl describe pods -n spc--noti-centre--be -l app=spc--noti-centre--be--consumer-default--prod | grep -A 50 "Events:"

# Step 4: Check resource usage if pod started
kubectl top pods -n spc--noti-centre--be | grep consumer

# Step 5: Get deployment details
kubectl get deployment spc--noti-centre--be--consumer-default--prod -n spc--noti-centre--be -o yaml

# Step 6: Check ConfigMaps and Secrets
kubectl get configmaps -n spc--noti-centre--be
kubectl get secrets -n spc--noti-centre--be

# Step 7: Check events for OOMKilled or other issues
kubectl get events -n spc--noti-centre--be | grep -i "oom\|kill\|fail\|error" | head -20

Pod eviction investigation:

# Check node resources
kubectl top nodes

# Check node conditions
kubectl get nodes -o wide

# Check pod resource requests
kubectl get pods -n spc--noti-centre--be -o yaml | grep -A 5 "resources:"

# Check eviction events
kubectl get events -n spc--noti-centre--be | grep -i evict

Cleanup and restart:

# Cleanup failed pods
kubectl delete pods -n spc--noti-centre--be --field-selector=status.phase=Failed

# Delete stuck pods
kubectl get pods -n spc--noti-centre--be | grep -E "Unknown|Evicted|Error" | awk '{print $1}' | xargs kubectl delete pod -n spc--noti-centre--be --force --grace-period=0

# Restart consumer deployment (after fixing root cause)
kubectl rollout restart deployment/spc--noti-centre--be--consumer-default--prod -n spc--noti-centre--be

# Scale up main app (after consumer is healthy)
kubectl scale deployment spc--noti-centre--be--app--prod -n spc--noti-centre--be --replicas=1

Performance Metrics

Current Scale

Main API: 0 replicas (scaled to 0 - NO SERVICE)
Consumer: 0/1 (desired 1, actual 0 - CRASHLOOPBACKOFF)
Total Active Pods: 0 pods (complete service outage)
Failed Pods: 25 pods (22 evicted, 2 unknown, 1 crashing)

Stability

Namespace Age: 51 days (very new)
Consumer Age: 42 days (crashing since creation)
Restart Count: 6003 restarts in 21 days (~286 restarts/day)
Status: 🔴 COMPLETELY BROKEN SINCE CREATION

Critical Statistics

Uptime: 0% (service completely down)
Crash Rate: ~12 crashes per hour
Service Availability: NONE
Impact: High - notification centre completely unavailable

x IMMEDIATE ACTION REQUIRED

This namespace requires URGENT attention:

Step 1: Investigate consumer crash logs (highest priority)
Step 2: Fix configuration or code bug causing crashes
Step 3: Clean up evicted/failed pods
Step 4: Scale up main API after consumer is healthy
Step 5: Add monitoring and alerts
Step 6: Consider rolling back to previous working version if available

This service has been broken for ~42 days. If notifications are critical, escalate immediately!

Overview​

Architecture​

Auto-Scaling Configuration​

Workload Categories​

Main Application (1 deployment - Scaled to 0)​

Event Consumer (1 deployment - BROKEN)​

Critical Issues​

🔴 CRITICAL - NAMESPACE COMPLETELY BROKEN​

Services​

Access & Management​

Check broken consumer:​

Check scaled-to-0 main app:​

View all resources:​

Monitoring​

Resource usage:​

Events (CRITICAL):​

Data Flow (CURRENTLY BROKEN)​

Production Considerations​

🔴 CRITICAL ISSUES - IMMEDIATE ACTION REQUIRED​

Recommended Actions​

Troubleshooting​

CRITICAL - Consumer crash investigation:​

Pod eviction investigation:​

Cleanup and restart:​

Performance Metrics​

Current Scale​

Stability​

Critical Statistics​

x IMMEDIATE ACTION REQUIRED​