pat--dlq--be
Overview
- Namespace:
pat--dlq--be - Purpose: Patient Dead Letter Queue Backend - PRODUCTION
- Age: ~2 years 145 days (since June 2023)
- Status: Active - Failed message handling and retry system
- Workloads: 5 deployments (all active)
- Environment: PRODUCTION - Message failure recovery
Architecture
Dead Letter Queue (DLQ) system handling failed messages and retry logic:
- Main Application: REST API backend (3 replicas) - High Availability
- Event Consumer: Retry failed messages (1 deployment)
- Workers: Background job processing (2 deployments)
- Scheduler: Cron jobs for scheduled tasks
Auto-Scaling Configuration
No Auto-Scaling Configured:
- No HorizontalPodAutoscalers (HPAs)
- No KEDA scaled objects
- Fixed replica counts (Main app: 3, others: 1)
Workload Categories
Main Application (1 deployment)
| Name | Replicas | Status | Purpose |
|---|---|---|---|
| pat--dlq--be--app--prod | 3/3 | Running | Main DLQ API (HA configured) |
Event Consumer (1 deployment)
| Name | Replicas | Status | Purpose |
|---|---|---|---|
| consumer-retry-message | 1/1 | Running | Retry failed messages from DLQ |
Workers (2 deployments)
| Name | Replicas | Status | Purpose |
|---|---|---|---|
| wrk--default | 1/1 | Running | Default worker queue |
| wrk--notifications | 1/1 | Running | Notification processing |
Scheduler (1 deployment)
| Name | Replicas | Status | Purpose |
|---|---|---|---|
| cron--prod | 1/1 | Running | Scheduled cron jobs |
Services
| Name | Type | Cluster IP | Ports | NodePort | Purpose |
|---|---|---|---|---|---|
| pat--dlq--be--app--prod | NodePort | 10.8.25.61 | 80 | 31871 | Main DLQ API |
Access & Management
View all resources:
kubectl get all -n pat--dlq--be
Check main application:
# View app pods (3 replicas)
kubectl get pods -n pat--dlq--be | grep "app--prod"
# View logs from all replicas
kubectl logs -f deployment/pat--dlq--be--app--prod -n pat--dlq--be
# Check specific replica
kubectl logs -f deployment/pat--dlq--be--app--prod -n pat--dlq--be --all-containers=true
Check retry consumer:
kubectl get pods -n pat--dlq--be | grep retry
kubectl logs -f deployment/pat--dlq--be--consumer-retry-message--prod -n pat--dlq--be
Check workers:
kubectl get pods -n pat--dlq--be | grep wrk
kubectl logs -f deployment/pat--dlq--be--wrk--notifications--prod -n pat--dlq--be
Restart services:
# Restart main app (all 3 replicas)
kubectl rollout restart deployment/pat--dlq--be--app--prod -n pat--dlq--be
# Restart retry consumer
kubectl rollout restart deployment/pat--dlq--be--consumer-retry-message--prod -n pat--dlq--be
# Restart all workers
kubectl get deployments -n pat--dlq--be | grep wrk | awk '{print $1}' | xargs -I {} kubectl rollout restart deployment/{} -n pat--dlq--be
Monitoring
Resource usage:
kubectl top pods -n pat--dlq--be --sort-by=memory
kubectl top pods -n pat--dlq--be --sort-by=cpu
Events:
kubectl get events -n pat--dlq--be --sort-by='.lastTimestamp' | head -20
Data Flow
Failed Message Event
↓
pat--dlq--be--app--prod (NodePort 31871)
↓
Main DLQ API (3 replicas - HA)
↓
Dead Letter Queue (Kafka/Redpanda DLQ Topic)
↓
consumer-retry-message → Retry Logic
├─ Success → Republish to original queue
└─ Fail → Keep in DLQ, alert/log
↓
Workers Process Background Jobs
↓
Cron Jobs → Scheduled DLQ cleanup/monitoring
↓
Failed message recovery, alerts
DLQ Workflow
1. DLQ API (High Availability)
- 3 replicas for redundancy
- Receive failed messages from various services
- Store failed messages for analysis
- Provide UI/API for DLQ management
- Manual retry triggers
- DLQ monitoring and metrics
2. Message Retry Consumer
- Automatically retry failed messages
consumer-retry-messageprocesses retry logic- Exponential backoff strategy
- Maximum retry attempts
- Success: republish to original queue
- Permanent failure: keep in DLQ with error details
3. Background Workers
- wrk--notifications: Process notification failures
- wrk--default: General DLQ processing
- Alert on persistent failures
- Generate failure reports
4. Scheduled Tasks
- Cron jobs for DLQ maintenance
- Cleanup old DLQ messages
- Generate DLQ reports
- Alert on high DLQ volumes
Production Considerations
High Availability
Well Configured:
- Main API: 3 replicas for redundancy
- Mature namespace (~2 years)
x Single Points of Failure:
- consumer-retry-message: 1 replica (critical for auto-retry)
- All workers: 1 replica each
- Cron job: 1 replica
Recommendations
-
Retry Consumer Resilience:
- Currently 1 replica (single point of failure)
- Consider 2+ replicas for redundancy
- Critical for automatic failure recovery
-
Add Auto-Scaling:
- Consider HPA for main API (currently fixed 3)
- Add KEDA for retry consumer based on DLQ depth
- Scale during high failure periods
-
Worker Resilience:
- wrk--notifications: 1 replica (consider 2)
- wrk--default: 1 replica (consider 2)
- Important for DLQ processing
-
Monitoring Priorities:
- DLQ message volume
- Retry success rates
- Message age in DLQ
- Consumer lag (retry consumer)
- Failed message patterns
-
Alerting:
- High DLQ volume
- Old messages in DLQ
- Retry consumer failures
- Permanent failure patterns
Troubleshooting
Main API issues:
# Check all 3 API pods
kubectl get pods -n pat--dlq--be | grep "app--prod"
# Check logs from all replicas
kubectl logs deployment/pat--dlq--be--app--prod -n pat--dlq--be --all-containers=true --tail=100
# Check specific pod
POD_NAME=$(kubectl get pods -n pat--dlq--be | grep "app--prod" | head -1 | awk '{print $1}')
kubectl logs $POD_NAME -n pat--dlq--be --tail=100
# Test API endpoint
kubectl port-forward -n pat--dlq--be service/pat--dlq--be--app--prod 8080:80
# Access http://localhost:8080
Retry consumer issues:
# Check retry consumer
kubectl logs -f deployment/pat--dlq--be--consumer-retry-message--prod -n pat--dlq--be
# Check for retry errors
kubectl logs deployment/pat--dlq--be--consumer-retry-message--prod -n pat--dlq--be --tail=100 | grep -i "error\|retry\|fail"
# Check consumer resource usage
kubectl top pods -n pat--dlq--be | grep retry
# Restart retry consumer
kubectl rollout restart deployment/pat--dlq--be--consumer-retry-message--prod -n pat--dlq--be
High DLQ volume:
# Check retry consumer logs for patterns
kubectl logs deployment/pat--dlq--be--consumer-retry-message--prod -n pat--dlq--be --tail=500 | grep -i "permanent\|max.*retry"
# Check API logs for DLQ write patterns
kubectl logs deployment/pat--dlq--be--app--prod -n pat--dlq--be --tail=200 | grep -i "dlq\|failed"
# Check which services are generating failures
kubectl logs deployment/pat--dlq--be--app--prod -n pat--dlq--be --tail=500 | grep -o "source:.*" | sort | uniq -c | sort -rn
Worker issues:
# Check notification worker
kubectl logs -f deployment/pat--dlq--be--wrk--notifications--prod -n pat--dlq--be
# Check default worker
kubectl logs -f deployment/pat--dlq--be--wrk--default--prod -n pat--dlq--be
# Check for worker errors
kubectl logs deployment/pat--dlq--be--wrk--notifications--prod -n pat--dlq--be --tail=100 | grep -i "error\|fail"
# Restart workers
kubectl rollout restart deployment/pat--dlq--be--wrk--notifications--prod -n pat--dlq--be
kubectl rollout restart deployment/pat--dlq--be--wrk--default--prod -n pat--dlq--be
Cron job failures:
# Check cron pod
kubectl get pods -n pat--dlq--be | grep cron
# Check cron logs
kubectl logs -f deployment/pat--dlq--be--cron--prod -n pat--dlq--be
# Restart cron
kubectl rollout restart deployment/pat--dlq--be--cron--prod -n pat--dlq--be
Load distribution issues:
# Check resource usage across API replicas
kubectl top pods -n pat--dlq--be | grep app--prod
# Check logs from each replica
for pod in $(kubectl get pods -n pat--dlq--be | grep "app--prod" | awk '{print $1}'); do
echo "=== $pod ==="
kubectl logs $pod -n pat--dlq--be --tail=20
done
# Restart all to redistribute load
kubectl rollout restart deployment/pat--dlq--be--app--prod -n pat--dlq--be
Performance Metrics
Current Scale
- Main API: 3 replicas (good HA)
- Retry Consumer: 1 replica
- Workers: 2 workers at 1 replica each
- Total Active Pods: ~7 pods
Stability
- Namespace Age: ~2 years (mature, stable)
- Recent Updates: 205 days ago (stable)
- HA Configuration: 3 replicas for main API