📝 Postmortem Report - Fava Storage problem

1. Summary

Incident ID/Name:
OCT-09-2025_InfraStorageFailure
Date & Time:
October 9, 2025 — from 2:48 PM to 8:25 PM (IRST)
Duration:
~5 hours 37 minutes
Severity Level:
SEV‑1 – Complete outage of core APIs and dependent services
Systems Affected:
Core API layer, database-dependent services, background workers
Impact on Users/Business:
All API endpoints were unavailable during the outage. Users experienced full downtime across application services, causing transaction interruptions and delayed business operations.

(Iran Time)

14:48 – System monitoring began showing API timeout spikes.
15:10 – All API requests reported 5xx errors. Initial triage started.
16:07 – CTO (Ali Jazayeri) reported that “apparently the database or an underlying infrastructure service has crashed.” Issue escalated to Fava operations team.
16:20 – Investigation confirmed none of the APIs were operational. Infrastructure and backend teams engaged.
17:30 – Root-cause isolation continued; database service reported intermittent unavailability.
18:45 – Fava team performing hardware/storage diagnostics.
20:25 – Services began to recover following storage subsystem restoration.
20:31 – CTO update: “Storage problem confirmed by Mr. E’rabi.” Final validation ongoing.
20:35 – All systems confirmed up and stable. Incident closed.

Immediate Cause:
Failure in the primary storage layer supporting the production database.
Underlying Cause:
Infrastructure storage subsystem malfunction (likely hardware or performance fault at Fava datacenter layer) causing database crashes and cascading API unavailability.
Why It Wasn’t Prevented/Detected Earlier:
Monitoring covered database metrics but not the underlying storage subsystem. Alerts triggered post-database crash rather than on early I/O degradation.
Five Whys:
APIs failed ⇒ because database became unreachable.
Database crashed ⇒ because underlying storage became unresponsive.
Storage became unresponsive ⇒ due to infrastructure-level fault.
Fault undetected ⇒ lack of direct storage health monitoring.
Lack of early alerting ⇒ no integrated visibility between storage provider and database monitoring stack.

User Impact:
100% of API traffic failed from 2:48 PM – 8:25 PM (≈6 hours). All production users affected.
Internal Impact:
On‑call engineers and backend leads engaged for full duration; release pipeline paused; one deployment delayed.
Customer Communication:
Internal teams informed; status updates delivered via Slack/Fava channel communications. No external status page notification configured at the time.

Infrastructure dependencies outside core application still represent single points of failure unless monitored end-to-end.
Visibility into third-party managed components (e.g., storage) is crucial for early detection.
Cross-team communication drastically improved recovery time; having clearer escalation paths can further reduce impact.