📝 Postmortem Report: Vehicle Add API 500 Error
1. Summary
- Incident ID/Name: Vehicle Add API 500 Error via Arvan Proxy
- Date & Time: Sept 21, 2025, 6:00 – 12:00 IRST
- Duration: \~6 Hours
- Severity Level: SEV2
- Systems Affected: CitizenTraffic Front API –
/Vehicle/addendpoint - Impact on Users/Business: Users were unable to add vehicles via the API. Other APIs functioned normally. No significant revenue impact reported.
2. Incident Timeline
(Chronological log of events, IRST)
- 10:00 – Monitoring alerts detected 500 errors on
/Vehicle/add. - 10:02 – Alert acknowledged by SRE team.
- 10:05 – Initial investigation started; logs showed HTTP 500 responses from upstream service.
- 10:10 – Hypothesis formed: issue related to
x-real-ipheader. - 10:15 – DevOps removed
x-real-ipheader from request via Arvan proxy. - 10:18 – API tested successfully; 200 OK responses confirmed.
- 10:30 – Incident declared resolved; normal monitoring resumed.
3. Root Cause Analysis
- Immediate Cause: HTTP 500 errors triggered when
x-real-ipheader was included in requests through Arvan proxy. - Underlying Cause: Certain backend logic in the
/Vehicle/addservice could not properly handle thex-real-ipheader value. -
Why It Wasn’t Prevented/Detected Earlier:
-
This header combination was not part of standard integration testing.
- Proxy (Arvan) monitoring setup did not simulate
x-real-ipheader in QA. -
Five Whys (optional):
-
Why did the API return 500? → It failed when
x-real-ipheader was present. - Why did it fail on
x-real-ip? → Backend service did not validate or parse it correctly. - Why was this header sent? → Arvan proxy automatically injects
x-real-ip. - Why wasn’t it caught in testing? → QA tests didn’t include proxy headers.
- Why was the header not handled? → Backend lacked defensive coding for unexpected header values.
4. Impact
- User Impact: Users attempting to add vehicles experienced failures (\~100 requests failed).
- Internal Impact: On-call SRE and DevOps teams were involved for \~30 minutes.
- Customer Communication: No external communication required; internal teams informed via Slack.
5. Resolution & Recovery
- What was done to restore service:
x-real-ipheader removed from requests by DevOps; API tested and returned 200 responses. - Time to Detection (TTD): 2 minutes
- Time to Mitigation (TTM): 8 minutes
- Time to Recovery (TTR): 18 minutes
6. What Went Well
- Monitoring alerts triggered promptly.
- Rapid cross-team collaboration between SRE and DevOps.
- Quick fix applied without service downtime.
7. What Went Wrong
- API failed with a 500 error due to an unhandled header.
- Proxy header behavior not included in testing/QA.
- Lack of defensive code handling unexpected headers.
8. Action Items
(Each action must have an owner + due date)
- [ ] Add automated tests to simulate
x-real-ipand other proxy headers (Owner: SRE, Due: Oct 5) - [ ] Monitor Arvan proxy header behavior for all APIs (Owner: SRE, Due: Sept 30)
9. Lessons Learned
- Always include reverse proxy headers in QA testing.
- Defensive coding prevents minor header issues from causing 500 errors.
- Cross-team monitoring and alerting proved effective; continues to be essential.
10. References
- Telegram Group:
DevOps - Dear Ali Peyman (DevOps Lead)