📝 Postmortem Report - Fava Storage problem
1. Summary
- Incident ID/Name:
OCT-09-2025_InfraStorageFailure - Date & Time:
October 9, 2025 — from 2:48 PM to 8:25 PM (IRST) - Duration:
~5 hours 37 minutes - Severity Level:
SEV‑1 – Complete outage of core APIs and dependent services - Systems Affected:
Core API layer, database-dependent services, background workers - Impact on Users/Business:
All API endpoints were unavailable during the outage. Users experienced full downtime across application services, causing transaction interruptions and delayed business operations.
2. Incident Timeline
(Iran Time)
- 14:48 – System monitoring began showing API timeout spikes.
- 15:10 – All API requests reported 5xx errors. Initial triage started.
- 16:07 – CTO (Ali Jazayeri) reported that “apparently the database or an underlying infrastructure service has crashed.” Issue escalated to Fava operations team.
- 16:20 – Investigation confirmed none of the APIs were operational. Infrastructure and backend teams engaged.
- 17:30 – Root-cause isolation continued; database service reported intermittent unavailability.
- 18:45 – Fava team performing hardware/storage diagnostics.
- 20:25 – Services began to recover following storage subsystem restoration.
- 20:31 – CTO update: “Storage problem confirmed by Mr. E’rabi.” Final validation ongoing.
- 20:35 – All systems confirmed up and stable. Incident closed.
3. Root Cause Analysis
- Immediate Cause:
Failure in the primary storage layer supporting the production database. - Underlying Cause:
Infrastructure storage subsystem malfunction (likely hardware or performance fault at Fava datacenter layer) causing database crashes and cascading API unavailability. - Why It Wasn’t Prevented/Detected Earlier:
Monitoring covered database metrics but not the underlying storage subsystem. Alerts triggered post-database crash rather than on early I/O degradation. - Five Whys:
- APIs failed ⇒ because database became unreachable.
- Database crashed ⇒ because underlying storage became unresponsive.
- Storage became unresponsive ⇒ due to infrastructure-level fault.
- Fault undetected ⇒ lack of direct storage health monitoring.
- Lack of early alerting ⇒ no integrated visibility between storage provider and database monitoring stack.
4. Impact
- User Impact:
100% of API traffic failed from 2:48 PM – 8:25 PM (≈6 hours). All production users affected. - Internal Impact:
On‑call engineers and backend leads engaged for full duration; release pipeline paused; one deployment delayed. - Customer Communication:
Internal teams informed; status updates delivered via Slack/Fava channel communications. No external status page notification configured at the time.
5. Resolution & Recovery
- What Was Done:
- Escalated to Fava infrastructure team.
- Validated DB node stability and reattached to storage.
- Rebooted dependent application services post-restore.
- Verified full API heartbeat/responsiveness.
- Time to Detection (TTD): 22 minutes
- Time to Mitigation (TTM): ~3 hours
- Time to Recovery (TTR): ~5.5 hours
6. What Went Well
- Rapid cross-team coordination between internal engineering and Fava operations.
- Application recovered cleanly once storage was restored; no data loss detected.
- Postmortem process initiated promptly for documentation.
7. What Went Wrong
- Lack of proactive storage health monitoring.
- No automated failover to standby environment.
- Absence of external status communication channels.
- Initial detection relied on API errors rather than infrastructure signals.
8. Lessons Learned
- Infrastructure dependencies outside core application still represent single points of failure unless monitored end-to-end.
- Visibility into third-party managed components (e.g., storage) is crucial for early detection.
- Cross-team communication drastically improved recovery time; having clearer escalation paths can further reduce impact.
9. References
- Internal Grafana dashboards
- Dear Mr.Ali Jazayeri & Dear Mr.Ali Peyman