spc--noti-centre--be
Overview
- Namespace:
spc--noti-centre--be - Purpose: Sapoche Notification Centre Backend - PRODUCTION
- Age: 51 days (~1.7 months, since September 2025)
- Status: 🔴 CRITICAL - BROKEN - All pods failing or scaled to 0
- Workloads: 2 deployments (0 active, 2 broken/scaled to 0)
- Environment: PRODUCTION - Notification center (COMPLETELY DOWN!)
Architecture
Notification centre system - CURRENTLY COMPLETELY BROKEN:
- Main Application: REST API backend - x Scaled to 0/0
- Event Consumer: Default queue processing - 🔴 CrashLoopBackOff (6003 restarts!)
Auto-Scaling Configuration
No Auto-Scaling Configured:
- No HorizontalPodAutoscalers (HPAs)
- No KEDA scaled objects
- Fixed replica counts
Workload Categories
Main Application (1 deployment - Scaled to 0)
| Name | Replicas | Status | Purpose |
|---|---|---|---|
| spc--noti-centre--be--app--prod | 0/0 | x Scaled to 0 | Main notification API (INACTIVE) |
Event Consumer (1 deployment - BROKEN)
| Name | Replicas | Status | Purpose |
|---|---|---|---|
| consumer-default | 0/1 | 🔴 CrashLoopBackOff | Default queue (6003 restarts in 21 days!) |
Critical Issues
🔴 CRITICAL - NAMESPACE COMPLETELY BROKEN
Consumer Pod Status:
- CrashLoopBackOff: 1 pod with 6003 restarts in 21 days (~286 restarts/day!)
- ContainerStatusUnknown: 2 pods (244 and 5765 restarts)
- Evicted: 22 pods evicted (resource pressure or node issues)
- Desired 1, Available 0: Consumer cannot start successfully
Main API:
- Scaled to 0/0 (completely inactive)
- No running pods
- Service exists but no backend
Duration: Broken for ~42 days (consumer age)
Services
| Name | Type | Cluster IP | Ports | NodePort | Purpose |
|---|---|---|---|---|---|
| spc--noti-centre--be--app--prod | NodePort | 10.8.24.113 | 80 | 30968 | Main notification API (NO BACKEND!) |
Access & Management
Check broken consumer:
# Check consumer pods (will show CrashLoopBackOff)
kubectl get pods -n spc--noti-centre--be | grep consumer
# Check logs from crashing pod
kubectl logs -n spc--noti-centre--be -l app=spc--noti-centre--be--consumer-default--prod --tail=100
# Check previous logs
kubectl logs -n spc--noti-centre--be -l app=spc--noti-centre--be--consumer-default--prod --tail=100 --previous
# Describe pod for crash reason
kubectl describe pods -n spc--noti-centre--be -l app=spc--noti-centre--be--consumer-default--prod
Check scaled-to-0 main app:
# Check deployment
kubectl get deployment spc--noti-centre--be--app--prod -n spc--noti-centre--be
# Scale up if needed (after fixing consumer issues)
kubectl scale deployment spc--noti-centre--be--app--prod -n spc--noti-centre--be --replicas=1
View all resources:
kubectl get all -n spc--noti-centre--be
Monitoring
Resource usage:
# Will likely show nothing or failing pods
kubectl top pods -n spc--noti-centre--be
Events (CRITICAL):
# Check events for errors
kubectl get events -n spc--noti-centre--be --sort-by='.lastTimestamp' | head -50
# Check for eviction events
kubectl get events -n spc--noti-centre--be | grep -i "evicted\|oom\|failed"
Data Flow (CURRENTLY BROKEN)
Notification Centre Request
↓
spc--noti-centre--be--app--prod (NodePort 30968)
↓
🔴 NO BACKEND PODS (scaled to 0)
↓
Message Queue (Kafka/Redpanda)
↓
🔴 consumer-default (CrashLoopBackOff - 6003 restarts)
↓
x NOTIFICATIONS NOT BEING PROCESSED
Production Considerations
🔴 CRITICAL ISSUES - IMMEDIATE ACTION REQUIRED
-
Consumer CrashLoopBackOff (HIGHEST PRIORITY):
- Pod crashing with 6003 restarts in 21 days (~286 restarts/day)
- ~12 crashes per hour - completely broken
- Investigate logs immediately
- Likely causes:
- Configuration error
- Missing dependencies
- Database connection failure
- Message queue connection failure
- Code bug causing immediate crash
- Resource limits too low
-
Main API Scaled to 0:
- No notification API available
- Service routing to nothing
- Complete service outage
-
Multiple Pod Evictions (22 pods):
- Node resource pressure
- Out of Memory (OOM)
- Disk pressure
- Check node health
-
ContainerStatusUnknown:
- 2 pods in unknown state
- Possible node communication issues
- Kubelet issues
Recommended Actions
-
IMMEDIATE - Fix Consumer Crash:
# Get detailed crash logs
kubectl logs -n spc--noti-centre--be -l app=spc--noti-centre--be--consumer-default--prod --tail=500
# Check previous crash
kubectl logs -n spc--noti-centre--be -l app=spc--noti-centre--be--consumer-default--prod --previous
# Describe pod for resource/error details
kubectl describe pods -n spc--noti-centre--be -l app=spc--noti-centre--be--consumer-default--prod
# Check events
kubectl get events -n spc--noti-centre--be | grep consumer -
Investigate Root Cause:
- Check application logs for errors
- Verify configuration (ConfigMaps, Secrets)
- Verify database connectivity
- Verify message queue connectivity
- Check resource limits (CPU, memory)
- Review deployment YAML for issues
-
Cleanup Evicted/Failed Pods:
# Delete evicted pods
kubectl delete pods -n spc--noti-centre--be --field-selector=status.phase=Failed
# Delete pods in unknown state
kubectl get pods -n spc--noti-centre--be | grep Unknown | awk '{print $1}' | xargs kubectl delete pod -n spc--noti-centre--be -
Scale Up Main API (after fixing consumer):
kubectl scale deployment spc--noti-centre--be--app--prod -n spc--noti-centre--be --replicas=1 -
Add Monitoring:
- Set up alerts for CrashLoopBackOff
- Monitor restart counts
- Set up pod eviction alerts
Troubleshooting
CRITICAL - Consumer crash investigation:
# Step 1: Get current crash logs
kubectl logs -n spc--noti-centre--be -l app=spc--noti-centre--be--consumer-default--prod --tail=500
# Step 2: Get previous crash logs
kubectl logs -n spc--noti-centre--be -l app=spc--noti-centre--be--consumer-default--prod --previous --tail=500
# Step 3: Describe pod for detailed error
kubectl describe pods -n spc--noti-centre--be -l app=spc--noti-centre--be--consumer-default--prod | grep -A 50 "Events:"
# Step 4: Check resource usage if pod started
kubectl top pods -n spc--noti-centre--be | grep consumer
# Step 5: Get deployment details
kubectl get deployment spc--noti-centre--be--consumer-default--prod -n spc--noti-centre--be -o yaml
# Step 6: Check ConfigMaps and Secrets
kubectl get configmaps -n spc--noti-centre--be
kubectl get secrets -n spc--noti-centre--be
# Step 7: Check events for OOMKilled or other issues
kubectl get events -n spc--noti-centre--be | grep -i "oom\|kill\|fail\|error" | head -20
Pod eviction investigation:
# Check node resources
kubectl top nodes
# Check node conditions
kubectl get nodes -o wide
# Check pod resource requests
kubectl get pods -n spc--noti-centre--be -o yaml | grep -A 5 "resources:"
# Check eviction events
kubectl get events -n spc--noti-centre--be | grep -i evict
Cleanup and restart:
# Cleanup failed pods
kubectl delete pods -n spc--noti-centre--be --field-selector=status.phase=Failed
# Delete stuck pods
kubectl get pods -n spc--noti-centre--be | grep -E "Unknown|Evicted|Error" | awk '{print $1}' | xargs kubectl delete pod -n spc--noti-centre--be --force --grace-period=0
# Restart consumer deployment (after fixing root cause)
kubectl rollout restart deployment/spc--noti-centre--be--consumer-default--prod -n spc--noti-centre--be
# Scale up main app (after consumer is healthy)
kubectl scale deployment spc--noti-centre--be--app--prod -n spc--noti-centre--be --replicas=1
Performance Metrics
Current Scale
- Main API: 0 replicas (scaled to 0 - NO SERVICE)
- Consumer: 0/1 (desired 1, actual 0 - CRASHLOOPBACKOFF)
- Total Active Pods: 0 pods (complete service outage)
- Failed Pods: 25 pods (22 evicted, 2 unknown, 1 crashing)
Stability
- Namespace Age: 51 days (very new)
- Consumer Age: 42 days (crashing since creation)
- Restart Count: 6003 restarts in 21 days (~286 restarts/day)
- Status: 🔴 COMPLETELY BROKEN SINCE CREATION
Critical Statistics
- Uptime: 0% (service completely down)
- Crash Rate: ~12 crashes per hour
- Service Availability: NONE
- Impact: High - notification centre completely unavailable
x IMMEDIATE ACTION REQUIRED
This namespace requires URGENT attention:
- Step 1: Investigate consumer crash logs (highest priority)
- Step 2: Fix configuration or code bug causing crashes
- Step 3: Clean up evicted/failed pods
- Step 4: Scale up main API after consumer is healthy
- Step 5: Add monitoring and alerts
- Step 6: Consider rolling back to previous working version if available
This service has been broken for ~42 days. If notifications are critical, escalate immediately!