📝 Postmortem Report - Fava 429 on our ip
1. Summary
- Incident ID/Name:
OCT‑05‑2025_Fava429RateLimit - Date & Time:
Start: October 5, 2025 – 6:30 PM (IRST)
End: October 6, 2025 – 11:18 AM (IRST) - Duration:
~16 hours 48 minutes - Severity Level:
SEV‑2 – Partial outage due to third‑party rate‑limiting - Systems Affected:
All services dependent on Fava APIs through IP81.12.28.40 - Impact on Users/Business:
Outbound requests to Fava services failed with HTTP 429 (Too Many Requests). Several features relying on external validation and payment services were unavailable or degraded for nearly 17 hours.
2. Incident Timeline
(Chronological, Iran time)
- 18:30 – Oct 5: Sudden rise in 429 errors from Fava API endpoints detected in logs and monitoring dashboards.
- 19:00: Engineering verified repeated 429 status responses from Fava on IP
81.12.28.40. - 19:20: Incident confirmed as external rate‑limit enforcement by Fava infrastructure.
- 20:00: Contact initiated with Fava support and infrastructure representatives.
- 21:10: Fava team acknowledged abnormal throttling on our IP, began review.
- 00:15 – Oct 6: Ongoing communication with Fava; rate‑limits remained active overnight.
- 10:45: Fava engineering confirmed adjustment of rate‑limit configuration.
- 11:18: Normal traffic fully restored; services returned to expected throughput and latency.
3. Root Cause Analysis
- Immediate Cause:
Our assigned IP81.12.28.40was unintentionally subjected to a strict rate‑limit (HTTP 429) by Fava’s gateway. - Underlying Cause:
Misconfiguration in Fava’s rate‑limit policy or automated abuse‑prevention system incorrectly flagging our IP as exceeding thresholds. - Why It Wasn’t Prevented/Detected Earlier:
Alerts for 5xx errors existed, but not for non‑fatal HTTP 4xx classes like 429. Therefore, the issue was visible only once full API failures accumulated. - Five Whys:
- Requests failed → because Fava returned 429.
- Fava returned 429 → because our IP was rate‑limited.
- IP was limited → due to incorrect external policy at Fava.
- Policy went unnoticed → external provider change not monitored via synthetic calls.
- Missing early detection → no thresholds for sustained 4xx patterns in observability stack.
4. Impact
- User Impact:
API functionality that relied on Fava’s external service saw complete failure; estimated 35–40% of total traffic (those flows dependent on Fava endpoints) affected. - Internal Impact:
Engineering and DevOps teams remained on escalated duty overnight; feature testing and planned QA delayed. - Customer Communication:
Users experienced degraded transactions; internal status updates posted in operations channel. Coordination handled directly with Fava support—no public announcement.
5. Resolution & Recovery
- What was done to restore service:
- Confirmed consistent 429 returns through log analysis and direct
curlchecks. - Escalated issue with Fava engineering staff.
- Verified removal of the rate‑limit rule after confirmation.
- Validated normal API throughput and latency post‑fix.
- Time to Detection (TTD): ~30 minutes
- Time to Mitigation (TTM): ~16 hours (dependent on Fava revision)
- Time to Recovery (TTR): ~16 hours 48 minutes
6. What Went Well
- Clear capture of Fava‑related 429 response patterns in logs.
- Quick escalation to external partner support.
- Smooth recovery once external configuration corrected—no data corruption, retries resumed automatically.
7. What Went Wrong
- Missing alerting for excessive 4xx response ratios delayed detection.
- Reliance on a single IP endpoint without redundancy led to full impact when rate‑limited.
- Extended turnaround time due to third‑party dependency and overnight communication delay.
8. Lessons Learned
- Ensure visibility on non‑server error (4xx) failure patterns in external‑API monitoring.
- Reinforce incident communication SLAs with external providers to shorten resolution cycles.
- Maintain backup or alternative endpoints for critical external dependencies to reduce downtime impact.
9. References
- Internal Grafana dashboard: “API check”
- Dears, MR. Ali Jazayeri & MR. Ali Peyman