Skip to content

📝 Postmortem Report - Fava Storage problem

1. Summary

  • Incident ID/Name:
    OCT-09-2025_InfraStorageFailure
  • Date & Time:
    October 9, 2025 — from 2:48 PM to 8:25 PM (IRST)
  • Duration:
    ~5 hours 37 minutes
  • Severity Level:
    SEV‑1 – Complete outage of core APIs and dependent services
  • Systems Affected:
    Core API layer, database-dependent services, background workers
  • Impact on Users/Business:
    All API endpoints were unavailable during the outage. Users experienced full downtime across application services, causing transaction interruptions and delayed business operations.

2. Incident Timeline

(Iran Time)

  • 14:48 – System monitoring began showing API timeout spikes.
  • 15:10 – All API requests reported 5xx errors. Initial triage started.
  • 16:07 – CTO (Ali Jazayeri) reported that “apparently the database or an underlying infrastructure service has crashed.” Issue escalated to Fava operations team.
  • 16:20 – Investigation confirmed none of the APIs were operational. Infrastructure and backend teams engaged.
  • 17:30 – Root-cause isolation continued; database service reported intermittent unavailability.
  • 18:45 – Fava team performing hardware/storage diagnostics.
  • 20:25 – Services began to recover following storage subsystem restoration.
  • 20:31 – CTO update: “Storage problem confirmed by Mr. E’rabi.” Final validation ongoing.
  • 20:35 – All systems confirmed up and stable. Incident closed.

3. Root Cause Analysis

  • Immediate Cause:
    Failure in the primary storage layer supporting the production database.
  • Underlying Cause:
    Infrastructure storage subsystem malfunction (likely hardware or performance fault at Fava datacenter layer) causing database crashes and cascading API unavailability.
  • Why It Wasn’t Prevented/Detected Earlier:
    Monitoring covered database metrics but not the underlying storage subsystem. Alerts triggered post-database crash rather than on early I/O degradation.
  • Five Whys:
  • APIs failed ⇒ because database became unreachable.
  • Database crashed ⇒ because underlying storage became unresponsive.
  • Storage became unresponsive ⇒ due to infrastructure-level fault.
  • Fault undetected ⇒ lack of direct storage health monitoring.
  • Lack of early alerting ⇒ no integrated visibility between storage provider and database monitoring stack.

4. Impact

  • User Impact:
    100% of API traffic failed from 2:48 PM – 8:25 PM (≈6 hours). All production users affected.
  • Internal Impact:
    On‑call engineers and backend leads engaged for full duration; release pipeline paused; one deployment delayed.
  • Customer Communication:
    Internal teams informed; status updates delivered via Slack/Fava channel communications. No external status page notification configured at the time.

5. Resolution & Recovery

  • What Was Done:
  • Escalated to Fava infrastructure team.
  • Validated DB node stability and reattached to storage.
  • Rebooted dependent application services post-restore.
  • Verified full API heartbeat/responsiveness.
  • Time to Detection (TTD): 22 minutes
  • Time to Mitigation (TTM): ~3 hours
  • Time to Recovery (TTR): ~5.5 hours

6. What Went Well

  • Rapid cross-team coordination between internal engineering and Fava operations.
  • Application recovered cleanly once storage was restored; no data loss detected.
  • Postmortem process initiated promptly for documentation.

7. What Went Wrong

  • Lack of proactive storage health monitoring.
  • No automated failover to standby environment.
  • Absence of external status communication channels.
  • Initial detection relied on API errors rather than infrastructure signals.

8. Lessons Learned

  • Infrastructure dependencies outside core application still represent single points of failure unless monitored end-to-end.
  • Visibility into third-party managed components (e.g., storage) is crucial for early detection.
  • Cross-team communication drastically improved recovery time; having clearer escalation paths can further reduce impact.

9. References

  • Internal Grafana dashboards
  • Dear Mr.Ali Jazayeri & Dear Mr.Ali Peyman