Skip to content

📝 Postmortem Report: Vehicle Add API 500 Error

1. Summary

  • Incident ID/Name: Vehicle Add API 500 Error via Arvan Proxy
  • Date & Time: Sept 21, 2025, 6:00 – 12:00 IRST
  • Duration: \~6 Hours
  • Severity Level: SEV2
  • Systems Affected: CitizenTraffic Front API – /Vehicle/add endpoint
  • Impact on Users/Business: Users were unable to add vehicles via the API. Other APIs functioned normally. No significant revenue impact reported.

2. Incident Timeline

(Chronological log of events, IRST)

  • 10:00 – Monitoring alerts detected 500 errors on /Vehicle/add.
  • 10:02 – Alert acknowledged by SRE team.
  • 10:05 – Initial investigation started; logs showed HTTP 500 responses from upstream service.
  • 10:10 – Hypothesis formed: issue related to x-real-ip header.
  • 10:15 – DevOps removed x-real-ip header from request via Arvan proxy.
  • 10:18 – API tested successfully; 200 OK responses confirmed.
  • 10:30 – Incident declared resolved; normal monitoring resumed.

3. Root Cause Analysis

  • Immediate Cause: HTTP 500 errors triggered when x-real-ip header was included in requests through Arvan proxy.
  • Underlying Cause: Certain backend logic in the /Vehicle/add service could not properly handle the x-real-ip header value.
  • Why It Wasn’t Prevented/Detected Earlier:

  • This header combination was not part of standard integration testing.

  • Proxy (Arvan) monitoring setup did not simulate x-real-ip header in QA.
  • Five Whys (optional):

  • Why did the API return 500? → It failed when x-real-ip header was present.

  • Why did it fail on x-real-ip? → Backend service did not validate or parse it correctly.
  • Why was this header sent? → Arvan proxy automatically injects x-real-ip.
  • Why wasn’t it caught in testing? → QA tests didn’t include proxy headers.
  • Why was the header not handled? → Backend lacked defensive coding for unexpected header values.

4. Impact

  • User Impact: Users attempting to add vehicles experienced failures (\~100 requests failed).
  • Internal Impact: On-call SRE and DevOps teams were involved for \~30 minutes.
  • Customer Communication: No external communication required; internal teams informed via Slack.

5. Resolution & Recovery

  • What was done to restore service: x-real-ip header removed from requests by DevOps; API tested and returned 200 responses.
  • Time to Detection (TTD): 2 minutes
  • Time to Mitigation (TTM): 8 minutes
  • Time to Recovery (TTR): 18 minutes

6. What Went Well

  • Monitoring alerts triggered promptly.
  • Rapid cross-team collaboration between SRE and DevOps.
  • Quick fix applied without service downtime.

7. What Went Wrong

  • API failed with a 500 error due to an unhandled header.
  • Proxy header behavior not included in testing/QA.
  • Lack of defensive code handling unexpected headers.

8. Action Items

(Each action must have an owner + due date)

  • [ ] Add automated tests to simulate x-real-ip and other proxy headers (Owner: SRE, Due: Oct 5)
  • [ ] Monitor Arvan proxy header behavior for all APIs (Owner: SRE, Due: Sept 30)

9. Lessons Learned

  • Always include reverse proxy headers in QA testing.
  • Defensive coding prevents minor header issues from causing 500 errors.
  • Cross-team monitoring and alerting proved effective; continues to be essential.

10. References

  • Telegram Group: DevOps
  • Dear Ali Peyman (DevOps Lead)