Skip to main content

pat--dlq--be

Overview

  • Namespace: pat--dlq--be
  • Purpose: Patient Dead Letter Queue Backend - PRODUCTION
  • Age: ~2 years 145 days (since June 2023)
  • Status: Active - Failed message handling and retry system
  • Workloads: 5 deployments (all active)
  • Environment: PRODUCTION - Message failure recovery

Architecture

Dead Letter Queue (DLQ) system handling failed messages and retry logic:

  • Main Application: REST API backend (3 replicas) - High Availability
  • Event Consumer: Retry failed messages (1 deployment)
  • Workers: Background job processing (2 deployments)
  • Scheduler: Cron jobs for scheduled tasks

Auto-Scaling Configuration

No Auto-Scaling Configured:

  • No HorizontalPodAutoscalers (HPAs)
  • No KEDA scaled objects
  • Fixed replica counts (Main app: 3, others: 1)

Workload Categories

Main Application (1 deployment)

NameReplicasStatusPurpose
pat--dlq--be--app--prod3/3RunningMain DLQ API (HA configured)

Event Consumer (1 deployment)

NameReplicasStatusPurpose
consumer-retry-message1/1RunningRetry failed messages from DLQ

Workers (2 deployments)

NameReplicasStatusPurpose
wrk--default1/1RunningDefault worker queue
wrk--notifications1/1RunningNotification processing

Scheduler (1 deployment)

NameReplicasStatusPurpose
cron--prod1/1RunningScheduled cron jobs

Services

NameTypeCluster IPPortsNodePortPurpose
pat--dlq--be--app--prodNodePort10.8.25.618031871Main DLQ API

Access & Management

View all resources:

kubectl get all -n pat--dlq--be

Check main application:

# View app pods (3 replicas)
kubectl get pods -n pat--dlq--be | grep "app--prod"

# View logs from all replicas
kubectl logs -f deployment/pat--dlq--be--app--prod -n pat--dlq--be

# Check specific replica
kubectl logs -f deployment/pat--dlq--be--app--prod -n pat--dlq--be --all-containers=true

Check retry consumer:

kubectl get pods -n pat--dlq--be | grep retry
kubectl logs -f deployment/pat--dlq--be--consumer-retry-message--prod -n pat--dlq--be

Check workers:

kubectl get pods -n pat--dlq--be | grep wrk
kubectl logs -f deployment/pat--dlq--be--wrk--notifications--prod -n pat--dlq--be

Restart services:

# Restart main app (all 3 replicas)
kubectl rollout restart deployment/pat--dlq--be--app--prod -n pat--dlq--be

# Restart retry consumer
kubectl rollout restart deployment/pat--dlq--be--consumer-retry-message--prod -n pat--dlq--be

# Restart all workers
kubectl get deployments -n pat--dlq--be | grep wrk | awk '{print $1}' | xargs -I {} kubectl rollout restart deployment/{} -n pat--dlq--be

Monitoring

Resource usage:

kubectl top pods -n pat--dlq--be --sort-by=memory
kubectl top pods -n pat--dlq--be --sort-by=cpu

Events:

kubectl get events -n pat--dlq--be --sort-by='.lastTimestamp' | head -20

Data Flow

Failed Message Event

pat--dlq--be--app--prod (NodePort 31871)

Main DLQ API (3 replicas - HA)

Dead Letter Queue (Kafka/Redpanda DLQ Topic)

consumer-retry-message → Retry Logic
├─ Success → Republish to original queue
└─ Fail → Keep in DLQ, alert/log

Workers Process Background Jobs

Cron Jobs → Scheduled DLQ cleanup/monitoring

Failed message recovery, alerts

DLQ Workflow

1. DLQ API (High Availability)

  • 3 replicas for redundancy
  • Receive failed messages from various services
  • Store failed messages for analysis
  • Provide UI/API for DLQ management
  • Manual retry triggers
  • DLQ monitoring and metrics

2. Message Retry Consumer

  • Automatically retry failed messages
  • consumer-retry-message processes retry logic
  • Exponential backoff strategy
  • Maximum retry attempts
  • Success: republish to original queue
  • Permanent failure: keep in DLQ with error details

3. Background Workers

  • wrk--notifications: Process notification failures
  • wrk--default: General DLQ processing
  • Alert on persistent failures
  • Generate failure reports

4. Scheduled Tasks

  • Cron jobs for DLQ maintenance
  • Cleanup old DLQ messages
  • Generate DLQ reports
  • Alert on high DLQ volumes

Production Considerations

High Availability

Well Configured:

  • Main API: 3 replicas for redundancy
  • Mature namespace (~2 years)

x Single Points of Failure:

  • consumer-retry-message: 1 replica (critical for auto-retry)
  • All workers: 1 replica each
  • Cron job: 1 replica

Recommendations

  1. Retry Consumer Resilience:

    • Currently 1 replica (single point of failure)
    • Consider 2+ replicas for redundancy
    • Critical for automatic failure recovery
  2. Add Auto-Scaling:

    • Consider HPA for main API (currently fixed 3)
    • Add KEDA for retry consumer based on DLQ depth
    • Scale during high failure periods
  3. Worker Resilience:

    • wrk--notifications: 1 replica (consider 2)
    • wrk--default: 1 replica (consider 2)
    • Important for DLQ processing
  4. Monitoring Priorities:

    • DLQ message volume
    • Retry success rates
    • Message age in DLQ
    • Consumer lag (retry consumer)
    • Failed message patterns
  5. Alerting:

    • High DLQ volume
    • Old messages in DLQ
    • Retry consumer failures
    • Permanent failure patterns

Troubleshooting

Main API issues:

# Check all 3 API pods
kubectl get pods -n pat--dlq--be | grep "app--prod"

# Check logs from all replicas
kubectl logs deployment/pat--dlq--be--app--prod -n pat--dlq--be --all-containers=true --tail=100

# Check specific pod
POD_NAME=$(kubectl get pods -n pat--dlq--be | grep "app--prod" | head -1 | awk '{print $1}')
kubectl logs $POD_NAME -n pat--dlq--be --tail=100

# Test API endpoint
kubectl port-forward -n pat--dlq--be service/pat--dlq--be--app--prod 8080:80
# Access http://localhost:8080

Retry consumer issues:

# Check retry consumer
kubectl logs -f deployment/pat--dlq--be--consumer-retry-message--prod -n pat--dlq--be

# Check for retry errors
kubectl logs deployment/pat--dlq--be--consumer-retry-message--prod -n pat--dlq--be --tail=100 | grep -i "error\|retry\|fail"

# Check consumer resource usage
kubectl top pods -n pat--dlq--be | grep retry

# Restart retry consumer
kubectl rollout restart deployment/pat--dlq--be--consumer-retry-message--prod -n pat--dlq--be

High DLQ volume:

# Check retry consumer logs for patterns
kubectl logs deployment/pat--dlq--be--consumer-retry-message--prod -n pat--dlq--be --tail=500 | grep -i "permanent\|max.*retry"

# Check API logs for DLQ write patterns
kubectl logs deployment/pat--dlq--be--app--prod -n pat--dlq--be --tail=200 | grep -i "dlq\|failed"

# Check which services are generating failures
kubectl logs deployment/pat--dlq--be--app--prod -n pat--dlq--be --tail=500 | grep -o "source:.*" | sort | uniq -c | sort -rn

Worker issues:

# Check notification worker
kubectl logs -f deployment/pat--dlq--be--wrk--notifications--prod -n pat--dlq--be

# Check default worker
kubectl logs -f deployment/pat--dlq--be--wrk--default--prod -n pat--dlq--be

# Check for worker errors
kubectl logs deployment/pat--dlq--be--wrk--notifications--prod -n pat--dlq--be --tail=100 | grep -i "error\|fail"

# Restart workers
kubectl rollout restart deployment/pat--dlq--be--wrk--notifications--prod -n pat--dlq--be
kubectl rollout restart deployment/pat--dlq--be--wrk--default--prod -n pat--dlq--be

Cron job failures:

# Check cron pod
kubectl get pods -n pat--dlq--be | grep cron

# Check cron logs
kubectl logs -f deployment/pat--dlq--be--cron--prod -n pat--dlq--be

# Restart cron
kubectl rollout restart deployment/pat--dlq--be--cron--prod -n pat--dlq--be

Load distribution issues:

# Check resource usage across API replicas
kubectl top pods -n pat--dlq--be | grep app--prod

# Check logs from each replica
for pod in $(kubectl get pods -n pat--dlq--be | grep "app--prod" | awk '{print $1}'); do
echo "=== $pod ==="
kubectl logs $pod -n pat--dlq--be --tail=20
done

# Restart all to redistribute load
kubectl rollout restart deployment/pat--dlq--be--app--prod -n pat--dlq--be

Performance Metrics

Current Scale

  • Main API: 3 replicas (good HA)
  • Retry Consumer: 1 replica
  • Workers: 2 workers at 1 replica each
  • Total Active Pods: ~7 pods

Stability

  • Namespace Age: ~2 years (mature, stable)
  • Recent Updates: 205 days ago (stable)
  • HA Configuration: 3 replicas for main API