Skip to content

📝 Postmortem Report - Upstream Probelm Our services

1. Summary

  • Incident ID/Name:
    OCT‑05‑2025_UpstreamErrorsAndWorkerScaling
  • Date & Time:
    October 5, 2025 — from 1:54 PM to 3:55 PM (IRST)
  • Duration:
    ~2 hours 1 minute
  • Severity Level:
    SEV‑2 – Service degradation due to upstream request saturation
  • Systems Affected:
    Core API services, background workers, downstream integrations
  • Impact on Users/Business:
    Users experienced intermittent API failures and degraded response times. A portion of live traffic was affected as service instances exceeded configured worker capacity, resulting in temporary upstream timeout errors.

2. Incident Timeline

(Chronological, Iran time)

  • 13:54 – Monitoring dashboards and logs indicate rising upstream error rates across multiple services.
  • 14:10 – API latency began spiking; errors reached ~40% of requests.
  • 14:25 – Engineering started investigation into overloaded workers handling upstream calls.
  • 14:45 – Identified that worker pool limits were insufficient under increased concurrent load.
  • 15:10 – Worker count configuration updated and services restarted with higher concurrency capacity.
  • 15:30 – Error rates dropped significantly; average latency returned to baseline values.
  • 15:55 – All services confirmed healthy and fully operational. Incident closed.

3. Root Cause Analysis

  • Immediate Cause:
    Insufficient worker pool capacity, leading to request queue saturation and upstream timeouts.
  • Underlying Cause:
    Worker configuration set for average traffic levels, lacking dynamic auto‑scaling or adaptive concurrency management.
  • Why It Wasn’t Prevented/Detected Earlier:
    Existing performance monitoring focused on CPU and memory utilization, not on worker queue backlog or request concurrency limits.
  • Five Whys:
  • Users saw upstream errors → because requests queued and timed out.
  • Requests queued → because available workers were fully busy.
  • Workers were overloaded → due to increased simultaneous traffic.
  • Worker pool not scalable → configuration static under variable load.
  • Static configuration persisted → scaling policy not yet implemented at service‑orchestrator level.

4. Impact

  • User Impact:
    Around 20–30% of active requests failed or timed out during the incident window.
  • Internal Impact:
    Increased API latency, temporary queue buildup in background jobs; some user transactions retried automatically.
  • Customer Communication:
    Internal teams notified; incident kept internal as resolution was achieved quickly and no data impact occurred.

5. Resolution & Recovery

  • What was done to restore service:
  • Verified source of upstream timeout patterns in logs.
  • Increased worker pool count on affected services.
  • Restarted relevant processes to apply the new limits.
  • Monitored error ratio and throughput until metrics stabilized.
  • Time to Detection (TTD): ~15 minutes
  • Time to Mitigation (TTM): ~45 minutes
  • Time to Recovery (TTR): ~2 hours 1 minute

6. What Went Well

  • Fast detection via automated error dashboards.
  • Clear communication between DevOps and Backend after issue recognition.
  • Service behavior normalized immediately post‑configuration update.

7. What Went Wrong

  • Lack of auto‑scaling parameters for worker threads.
  • Missing early alerting on request queue depth or slow response trends.
  • No load‑testing recently performed for peak traffic conditions.

8. Lessons Learned

  • Dynamic scaling mechanisms should be set at the orchestrator level to handle short‑term traffic spikes.
  • Real‑time alerts on worker utilization and request backlog can help detect congestion earlier.
  • Periodic performance stress testing and limit reviews prevent reoccurrence during sudden surges.

9. References

  • Incident Log: #incident‑upstream‑worker‑2025‑10‑05
  • Grafana Dashboards: API Error Rate / Worker Utilization
  • Logs: /Users/arian/incidents/2025‑10‑05/upstream‑errors/