📝 Postmortem Report - Upstream Probelm Our services

1. Summary

Incident ID/Name:
OCT‑05‑2025_UpstreamErrorsAndWorkerScaling
Date & Time:
October 5, 2025 — from 1:54 PM to 3:55 PM (IRST)
Duration:
~2 hours 1 minute
Severity Level:
SEV‑2 – Service degradation due to upstream request saturation
Systems Affected:
Core API services, background workers, downstream integrations
Impact on Users/Business:
Users experienced intermittent API failures and degraded response times. A portion of live traffic was affected as service instances exceeded configured worker capacity, resulting in temporary upstream timeout errors.

(Chronological, Iran time)

13:54 – Monitoring dashboards and logs indicate rising upstream error rates across multiple services.
14:10 – API latency began spiking; errors reached ~40% of requests.
14:25 – Engineering started investigation into overloaded workers handling upstream calls.
14:45 – Identified that worker pool limits were insufficient under increased concurrent load.
15:10 – Worker count configuration updated and services restarted with higher concurrency capacity.
15:30 – Error rates dropped significantly; average latency returned to baseline values.
15:55 – All services confirmed healthy and fully operational. Incident closed.

Immediate Cause:
Insufficient worker pool capacity, leading to request queue saturation and upstream timeouts.
Underlying Cause:
Worker configuration set for average traffic levels, lacking dynamic auto‑scaling or adaptive concurrency management.
Why It Wasn’t Prevented/Detected Earlier:
Existing performance monitoring focused on CPU and memory utilization, not on worker queue backlog or request concurrency limits.
Five Whys:
Users saw upstream errors → because requests queued and timed out.
Requests queued → because available workers were fully busy.
Workers were overloaded → due to increased simultaneous traffic.
Worker pool not scalable → configuration static under variable load.
Static configuration persisted → scaling policy not yet implemented at service‑orchestrator level.

User Impact:
Around 20–30% of active requests failed or timed out during the incident window.
Internal Impact:
Increased API latency, temporary queue buildup in background jobs; some user transactions retried automatically.
Customer Communication:
Internal teams notified; incident kept internal as resolution was achieved quickly and no data impact occurred.

Dynamic scaling mechanisms should be set at the orchestrator level to handle short‑term traffic spikes.
Real‑time alerts on worker utilization and request backlog can help detect congestion earlier.
Periodic performance stress testing and limit reviews prevent reoccurrence during sudden surges.