pat--dlq--be

Overview

Namespace: pat--dlq--be
Purpose: Patient Dead Letter Queue Backend - PRODUCTION
Age: ~2 years 145 days (since June 2023)
Status: Active - Failed message handling and retry system
Workloads: 5 deployments (all active)
Environment: PRODUCTION - Message failure recovery

Architecture

Dead Letter Queue (DLQ) system handling failed messages and retry logic:

Main Application: REST API backend (3 replicas) - High Availability
Event Consumer: Retry failed messages (1 deployment)
Workers: Background job processing (2 deployments)
Scheduler: Cron jobs for scheduled tasks

Auto-Scaling Configuration

No Auto-Scaling Configured:

No HorizontalPodAutoscalers (HPAs)
No KEDA scaled objects
Fixed replica counts (Main app: 3, others: 1)

Workload Categories

Main Application (1 deployment)

Name	Replicas	Status	Purpose
pat--dlq--be--app--prod	3/3	Running	Main DLQ API (HA configured)

Event Consumer (1 deployment)

Name	Replicas	Status	Purpose
consumer-retry-message	1/1	Running	Retry failed messages from DLQ

Workers (2 deployments)

Name	Replicas	Status	Purpose
wrk--default	1/1	Running	Default worker queue
wrk--notifications	1/1	Running	Notification processing

Scheduler (1 deployment)

Name	Replicas	Status	Purpose
cron--prod	1/1	Running	Scheduled cron jobs

Services

Name	Type	Cluster IP	Ports	NodePort	Purpose
pat--dlq--be--app--prod	NodePort	10.8.25.61	80	31871	Main DLQ API

Access & Management

View all resources:

kubectl get all -n pat--dlq--be

Check main application:

# View app pods (3 replicas)
kubectl get pods -n pat--dlq--be | grep "app--prod"

# View logs from all replicas
kubectl logs -f deployment/pat--dlq--be--app--prod -n pat--dlq--be

# Check specific replica
kubectl logs -f deployment/pat--dlq--be--app--prod -n pat--dlq--be --all-containers=true

Check retry consumer:

kubectl get pods -n pat--dlq--be | grep retry
kubectl logs -f deployment/pat--dlq--be--consumer-retry-message--prod -n pat--dlq--be

Check workers:

kubectl get pods -n pat--dlq--be | grep wrk
kubectl logs -f deployment/pat--dlq--be--wrk--notifications--prod -n pat--dlq--be

Restart services:

# Restart main app (all 3 replicas)
kubectl rollout restart deployment/pat--dlq--be--app--prod -n pat--dlq--be

# Restart retry consumer
kubectl rollout restart deployment/pat--dlq--be--consumer-retry-message--prod -n pat--dlq--be

# Restart all workers
kubectl get deployments -n pat--dlq--be | grep wrk | awk '{print $1}' | xargs -I {} kubectl rollout restart deployment/{} -n pat--dlq--be

Monitoring

Resource usage:

kubectl top pods -n pat--dlq--be --sort-by=memory
kubectl top pods -n pat--dlq--be --sort-by=cpu

Events:

kubectl get events -n pat--dlq--be --sort-by='.lastTimestamp' | head -20

Data Flow

Failed Message Event
    ↓
pat--dlq--be--app--prod (NodePort 31871)
    ↓
Main DLQ API (3 replicas - HA)
    ↓
Dead Letter Queue (Kafka/Redpanda DLQ Topic)
    ↓
consumer-retry-message → Retry Logic
    ├─ Success → Republish to original queue
    └─ Fail → Keep in DLQ, alert/log
    ↓
Workers Process Background Jobs
    ↓
Cron Jobs → Scheduled DLQ cleanup/monitoring
    ↓
Failed message recovery, alerts

DLQ Workflow

1. DLQ API (High Availability)

3 replicas for redundancy
Receive failed messages from various services
Store failed messages for analysis
Provide UI/API for DLQ management
Manual retry triggers
DLQ monitoring and metrics

2. Message Retry Consumer

Automatically retry failed messages
consumer-retry-message processes retry logic
Exponential backoff strategy
Maximum retry attempts
Success: republish to original queue
Permanent failure: keep in DLQ with error details

3. Background Workers

wrk--notifications: Process notification failures
wrk--default: General DLQ processing
Alert on persistent failures
Generate failure reports

4. Scheduled Tasks

Cron jobs for DLQ maintenance
Cleanup old DLQ messages
Generate DLQ reports
Alert on high DLQ volumes

Production Considerations

High Availability

Well Configured:

Main API: 3 replicas for redundancy
Mature namespace (~2 years)

x Single Points of Failure:

consumer-retry-message: 1 replica (critical for auto-retry)
All workers: 1 replica each
Cron job: 1 replica

Recommendations

Retry Consumer Resilience:
- Currently 1 replica (single point of failure)
- Consider 2+ replicas for redundancy
- Critical for automatic failure recovery
Add Auto-Scaling:
- Consider HPA for main API (currently fixed 3)
- Add KEDA for retry consumer based on DLQ depth
- Scale during high failure periods
Worker Resilience:
- wrk--notifications: 1 replica (consider 2)
- wrk--default: 1 replica (consider 2)
- Important for DLQ processing
Monitoring Priorities:
- DLQ message volume
- Retry success rates
- Message age in DLQ
- Consumer lag (retry consumer)
- Failed message patterns
Alerting:
- High DLQ volume
- Old messages in DLQ
- Retry consumer failures
- Permanent failure patterns

Troubleshooting

Main API issues:

# Check all 3 API pods
kubectl get pods -n pat--dlq--be | grep "app--prod"

# Check logs from all replicas
kubectl logs deployment/pat--dlq--be--app--prod -n pat--dlq--be --all-containers=true --tail=100

# Check specific pod
POD_NAME=$(kubectl get pods -n pat--dlq--be | grep "app--prod" | head -1 | awk '{print $1}')
kubectl logs $POD_NAME -n pat--dlq--be --tail=100

# Test API endpoint
kubectl port-forward -n pat--dlq--be service/pat--dlq--be--app--prod 8080:80
# Access http://localhost:8080

Retry consumer issues:

# Check retry consumer
kubectl logs -f deployment/pat--dlq--be--consumer-retry-message--prod -n pat--dlq--be

# Check for retry errors
kubectl logs deployment/pat--dlq--be--consumer-retry-message--prod -n pat--dlq--be --tail=100 | grep -i "error\|retry\|fail"

# Check consumer resource usage
kubectl top pods -n pat--dlq--be | grep retry

# Restart retry consumer
kubectl rollout restart deployment/pat--dlq--be--consumer-retry-message--prod -n pat--dlq--be

High DLQ volume:

# Check retry consumer logs for patterns
kubectl logs deployment/pat--dlq--be--consumer-retry-message--prod -n pat--dlq--be --tail=500 | grep -i "permanent\|max.*retry"

# Check API logs for DLQ write patterns
kubectl logs deployment/pat--dlq--be--app--prod -n pat--dlq--be --tail=200 | grep -i "dlq\|failed"

# Check which services are generating failures
kubectl logs deployment/pat--dlq--be--app--prod -n pat--dlq--be --tail=500 | grep -o "source:.*" | sort | uniq -c | sort -rn

Worker issues:

# Check notification worker
kubectl logs -f deployment/pat--dlq--be--wrk--notifications--prod -n pat--dlq--be

# Check default worker
kubectl logs -f deployment/pat--dlq--be--wrk--default--prod -n pat--dlq--be

# Check for worker errors
kubectl logs deployment/pat--dlq--be--wrk--notifications--prod -n pat--dlq--be --tail=100 | grep -i "error\|fail"

# Restart workers
kubectl rollout restart deployment/pat--dlq--be--wrk--notifications--prod -n pat--dlq--be
kubectl rollout restart deployment/pat--dlq--be--wrk--default--prod -n pat--dlq--be

Cron job failures:

# Check cron pod
kubectl get pods -n pat--dlq--be | grep cron

# Check cron logs
kubectl logs -f deployment/pat--dlq--be--cron--prod -n pat--dlq--be

# Restart cron
kubectl rollout restart deployment/pat--dlq--be--cron--prod -n pat--dlq--be

Load distribution issues:

# Check resource usage across API replicas
kubectl top pods -n pat--dlq--be | grep app--prod

# Check logs from each replica
for pod in $(kubectl get pods -n pat--dlq--be | grep "app--prod" | awk '{print $1}'); do
  echo "=== $pod ==="
  kubectl logs $pod -n pat--dlq--be --tail=20
done

# Restart all to redistribute load
kubectl rollout restart deployment/pat--dlq--be--app--prod -n pat--dlq--be

pat--dlq--be

Overview

Architecture

Auto-Scaling Configuration

Workload Categories

Main Application (1 deployment)

Event Consumer (1 deployment)

Workers (2 deployments)

Scheduler (1 deployment)

Services

Access & Management

View all resources:

Check main application:

Check retry consumer:

Check workers:

Restart services:

Monitoring

Resource usage:

Events:

Data Flow

DLQ Workflow

1. DLQ API (High Availability)

2. Message Retry Consumer

3. Background Workers

4. Scheduled Tasks

Production Considerations

High Availability

Recommendations

Troubleshooting

Main API issues:

Retry consumer issues:

High DLQ volume:

Worker issues:

Cron job failures:

Load distribution issues:

Performance Metrics

Current Scale

Stability

Overview​

Architecture​

Auto-Scaling Configuration​

Workload Categories​

Main Application (1 deployment)​

Event Consumer (1 deployment)​

Workers (2 deployments)​

Scheduler (1 deployment)​

Services​

Access & Management​

View all resources:​

Check main application:​

Check retry consumer:​

Check workers:​

Restart services:​

Monitoring​

Resource usage:​

Events:​

Data Flow​

DLQ Workflow​

1. DLQ API (High Availability)​

2. Message Retry Consumer​

3. Background Workers​

4. Scheduled Tasks​

Production Considerations​

High Availability​

Recommendations​

Troubleshooting​

Main API issues:​

Retry consumer issues:​

High DLQ volume:​

Worker issues:​

Cron job failures:​

Load distribution issues:​

Performance Metrics​

Current Scale​

Stability​

Overview

Architecture

Auto-Scaling Configuration

Workload Categories

Main Application (1 deployment)

Event Consumer (1 deployment)

Workers (2 deployments)

Scheduler (1 deployment)

Services

Access & Management

View all resources:

Check main application:

Check retry consumer:

Check workers:

Restart services:

Monitoring

Resource usage:

Events:

Data Flow

DLQ Workflow

1. DLQ API (High Availability)

2. Message Retry Consumer

3. Background Workers

4. Scheduled Tasks

Production Considerations

High Availability

Recommendations

Troubleshooting

Main API issues:

Retry consumer issues:

High DLQ volume:

Worker issues:

Cron job failures:

Load distribution issues:

Performance Metrics

Current Scale

Stability