📝 Postmortem Report - Upstream Probelm Our services
1. Summary
- Incident ID/Name:
OCT‑05‑2025_UpstreamErrorsAndWorkerScaling - Date & Time:
October 5, 2025 — from 1:54 PM to 3:55 PM (IRST) - Duration:
~2 hours 1 minute - Severity Level:
SEV‑2 – Service degradation due to upstream request saturation - Systems Affected:
Core API services, background workers, downstream integrations - Impact on Users/Business:
Users experienced intermittent API failures and degraded response times. A portion of live traffic was affected as service instances exceeded configured worker capacity, resulting in temporary upstream timeout errors.
2. Incident Timeline
(Chronological, Iran time)
- 13:54 – Monitoring dashboards and logs indicate rising upstream error rates across multiple services.
- 14:10 – API latency began spiking; errors reached ~40% of requests.
- 14:25 – Engineering started investigation into overloaded workers handling upstream calls.
- 14:45 – Identified that worker pool limits were insufficient under increased concurrent load.
- 15:10 – Worker count configuration updated and services restarted with higher concurrency capacity.
- 15:30 – Error rates dropped significantly; average latency returned to baseline values.
- 15:55 – All services confirmed healthy and fully operational. Incident closed.
3. Root Cause Analysis
- Immediate Cause:
Insufficient worker pool capacity, leading to request queue saturation and upstream timeouts. - Underlying Cause:
Worker configuration set for average traffic levels, lacking dynamic auto‑scaling or adaptive concurrency management. - Why It Wasn’t Prevented/Detected Earlier:
Existing performance monitoring focused on CPU and memory utilization, not on worker queue backlog or request concurrency limits. - Five Whys:
- Users saw upstream errors → because requests queued and timed out.
- Requests queued → because available workers were fully busy.
- Workers were overloaded → due to increased simultaneous traffic.
- Worker pool not scalable → configuration static under variable load.
- Static configuration persisted → scaling policy not yet implemented at service‑orchestrator level.
4. Impact
- User Impact:
Around 20–30% of active requests failed or timed out during the incident window. - Internal Impact:
Increased API latency, temporary queue buildup in background jobs; some user transactions retried automatically. - Customer Communication:
Internal teams notified; incident kept internal as resolution was achieved quickly and no data impact occurred.
5. Resolution & Recovery
- What was done to restore service:
- Verified source of upstream timeout patterns in logs.
- Increased worker pool count on affected services.
- Restarted relevant processes to apply the new limits.
- Monitored error ratio and throughput until metrics stabilized.
- Time to Detection (TTD): ~15 minutes
- Time to Mitigation (TTM): ~45 minutes
- Time to Recovery (TTR): ~2 hours 1 minute
6. What Went Well
- Fast detection via automated error dashboards.
- Clear communication between DevOps and Backend after issue recognition.
- Service behavior normalized immediately post‑configuration update.
7. What Went Wrong
- Lack of auto‑scaling parameters for worker threads.
- Missing early alerting on request queue depth or slow response trends.
- No load‑testing recently performed for peak traffic conditions.
8. Lessons Learned
- Dynamic scaling mechanisms should be set at the orchestrator level to handle short‑term traffic spikes.
- Real‑time alerts on worker utilization and request backlog can help detect congestion earlier.
- Periodic performance stress testing and limit reviews prevent reoccurrence during sudden surges.
9. References
- Incident Log: #incident‑upstream‑worker‑2025‑10‑05
- Grafana Dashboards: API Error Rate / Worker Utilization
- Logs:
/Users/arian/incidents/2025‑10‑05/upstream‑errors/