Skip to content

📝 Postmortem Report: Data Center Outage – September 2–3, 2025

1. Executive Summary

  • Incident ID/Name: 01 – Data Center Outage
  • Date & Time: September 2, 2025, 15:20 – September 3, 12:40
  • Duration: \~21 hours
  • Severity Level: SEV1
  • Systems Affected: All core services hosted in the primary data center, including website, storage systems, databases, and APIs
  • Impact on Users/Business: Complete service downtime followed by degraded performance for several hours; critical business operations were disrupted.

2. Incident Timeline

  • 15:20 (Sep 2) – Initial detection of service unavailability. Users unable to access website and applications.
  • 15:30 – Incident escalated to DevOps team; investigation begins.
  • 16:00 – Primary storage systems became partially inaccessible.
  • 16:30 – External vendors (Iran FAWA) contacted; initial response received, but full investigation delayed.
  • Evening (Sep 2) – Coordination with city FAWA authorities and internal teams (Mr. Arabi, Mr. Ali Peyman, Mr. Ali Jazayeri) for ongoing support.
  • 04:30 (Sep 3) – Vendor confirms power restoration; services start recovering.
  • 04:30 – 12:40 (Sep 3) – Users experienced slowness and degraded performance due to residual infrastructure instability.
  • 12:40 (Sep 3) – All systems fully operational and stable; performance returned to normal.

3. Root Cause Analysis

  • Immediate Cause: Complete loss of power at the primary data center, affecting all hosted systems.
  • Underlying Cause: Lack of redundancy and failover mechanisms; no disaster recovery plan for prolonged outages.
  • Detection Gap: Monitoring focused on application-level health; critical infrastructure alerts were not configured, delaying early detection.
  • Residual Performance Issue: After power restoration, services experienced degraded performance due to delayed infrastructure stabilization and partial storage/DB recovery.

Five Whys Analysis:

  1. Why did services go down? → Power outage at the primary data center.
  2. Why was there no automatic failover? → No active disaster recovery/failover plan in place.
  3. Why wasn’t this planned? → Limited infrastructure redundancy and lack of prior risk assessment.
  4. Why weren’t warnings detected? → Monitoring focused solely on applications, not underlying infrastructure.
  5. Why wasn’t infrastructure monitored? → Assumption that data center guarantees uninterrupted power supply.

4. Impact

  • User Impact: All users experienced full service downtime (\~100% traffic failure) for \~21 hours, followed by degraded performance for \~8 hours. Critical operations and transactions were delayed or interrupted.
  • Internal Impact: DevOps and operations teams fully engaged in triage and vendor coordination; development releases were delayed.
  • Customer Communication: Updates shared internally; no external communication due to the unexpected outage.

5. Resolution & Recovery

  • Actions Taken:

  • Coordinated with Iran FAWA and city FAWA authorities to restore power.

  • Sequential validation of storage systems, databases, and application services.
  • Continuous monitoring to ensure full recovery and stabilization.

  • Time to Detection (TTD): \~10 minutes after users reported outage

  • Time to Mitigation (TTM): \~21 hours until services restored

  • Time to Recovery (TTR): \~21 hours, plus \~8 hours of performance stabilization


6. What Went Well

  • Prompt escalation and coordination by DevOps team.
  • Effective communication with internal and external stakeholders (Mr. Ali Peyman, Mr. Arabi, Mr. Ali Jazayeri).
  • Persistence and follow-up ensured full service restoration.

7. What Went Wrong

  • Absence of disaster recovery and failover mechanisms.
  • Monitoring gaps for critical infrastructure components.
  • Initial vendor response was slow; incident tracking lacked proactivity.
  • Single data center dependency created a single point of failure.
  • Residual performance issues after power restoration not anticipated.

8. Action Items

  • [ ] Implement disaster recovery plan across multiple data centers (Owner: SRE Team, Due: Oct 15, 2025)
  • [ ] Deploy real-time monitoring for critical infrastructure, including power systems and storage (Owner: Infra Team, Due: --)
  • [ ] Conduct regular disaster recovery drills to ensure failover readiness (Owner: DevOps Lead, Due: --)
  • [ ] Review service SLAs and map dependencies to eliminate single points of failure (Owner: SRE Team, Due: Oct 10, 2025)
  • [ ] Improve post-recovery validation to detect residual slowness (Owner: Performance Team, Due: --)

9. Lessons Learned

  • Multi-data center redundancy is essential for high-availability services.
  • Infrastructure monitoring must cover underlying systems, not just applications.
  • Clear escalation and communication paths with vendors and city authorities improve response time.
  • Routine disaster recovery planning and testing are critical to operational resilience.
  • Post-recovery monitoring and validation are necessary to prevent residual performance degradation.

10. References

  • Internal communications with Mr. Ali Jazayeri
  • Sharzad Backend Team (Mr. Farhad Baghan, Mr. Mohammad Azimi)