📝 Postmortem Report: Data Center Outage – September 2–3, 2025
1. Executive Summary
- Incident ID/Name: 01 – Data Center Outage
- Date & Time: September 2, 2025, 15:20 – September 3, 12:40
- Duration: \~21 hours
- Severity Level: SEV1
- Systems Affected: All core services hosted in the primary data center, including website, storage systems, databases, and APIs
- Impact on Users/Business: Complete service downtime followed by degraded performance for several hours; critical business operations were disrupted.
2. Incident Timeline
- 15:20 (Sep 2) – Initial detection of service unavailability. Users unable to access website and applications.
- 15:30 – Incident escalated to DevOps team; investigation begins.
- 16:00 – Primary storage systems became partially inaccessible.
- 16:30 – External vendors (Iran FAWA) contacted; initial response received, but full investigation delayed.
- Evening (Sep 2) – Coordination with city FAWA authorities and internal teams (Mr. Arabi, Mr. Ali Peyman, Mr. Ali Jazayeri) for ongoing support.
- 04:30 (Sep 3) – Vendor confirms power restoration; services start recovering.
- 04:30 – 12:40 (Sep 3) – Users experienced slowness and degraded performance due to residual infrastructure instability.
- 12:40 (Sep 3) – All systems fully operational and stable; performance returned to normal.
3. Root Cause Analysis
- Immediate Cause: Complete loss of power at the primary data center, affecting all hosted systems.
- Underlying Cause: Lack of redundancy and failover mechanisms; no disaster recovery plan for prolonged outages.
- Detection Gap: Monitoring focused on application-level health; critical infrastructure alerts were not configured, delaying early detection.
- Residual Performance Issue: After power restoration, services experienced degraded performance due to delayed infrastructure stabilization and partial storage/DB recovery.
Five Whys Analysis:
- Why did services go down? → Power outage at the primary data center.
- Why was there no automatic failover? → No active disaster recovery/failover plan in place.
- Why wasn’t this planned? → Limited infrastructure redundancy and lack of prior risk assessment.
- Why weren’t warnings detected? → Monitoring focused solely on applications, not underlying infrastructure.
- Why wasn’t infrastructure monitored? → Assumption that data center guarantees uninterrupted power supply.
4. Impact
- User Impact: All users experienced full service downtime (\~100% traffic failure) for \~21 hours, followed by degraded performance for \~8 hours. Critical operations and transactions were delayed or interrupted.
- Internal Impact: DevOps and operations teams fully engaged in triage and vendor coordination; development releases were delayed.
- Customer Communication: Updates shared internally; no external communication due to the unexpected outage.
5. Resolution & Recovery
-
Actions Taken:
-
Coordinated with Iran FAWA and city FAWA authorities to restore power.
- Sequential validation of storage systems, databases, and application services.
-
Continuous monitoring to ensure full recovery and stabilization.
-
Time to Detection (TTD): \~10 minutes after users reported outage
-
Time to Mitigation (TTM): \~21 hours until services restored
-
Time to Recovery (TTR): \~21 hours, plus \~8 hours of performance stabilization
6. What Went Well
- Prompt escalation and coordination by DevOps team.
- Effective communication with internal and external stakeholders (Mr. Ali Peyman, Mr. Arabi, Mr. Ali Jazayeri).
- Persistence and follow-up ensured full service restoration.
7. What Went Wrong
- Absence of disaster recovery and failover mechanisms.
- Monitoring gaps for critical infrastructure components.
- Initial vendor response was slow; incident tracking lacked proactivity.
- Single data center dependency created a single point of failure.
- Residual performance issues after power restoration not anticipated.
8. Action Items
- [ ] Implement disaster recovery plan across multiple data centers (Owner: SRE Team, Due: Oct 15, 2025)
- [ ] Deploy real-time monitoring for critical infrastructure, including power systems and storage (Owner: Infra Team, Due: --)
- [ ] Conduct regular disaster recovery drills to ensure failover readiness (Owner: DevOps Lead, Due: --)
- [ ] Review service SLAs and map dependencies to eliminate single points of failure (Owner: SRE Team, Due: Oct 10, 2025)
- [ ] Improve post-recovery validation to detect residual slowness (Owner: Performance Team, Due: --)
9. Lessons Learned
- Multi-data center redundancy is essential for high-availability services.
- Infrastructure monitoring must cover underlying systems, not just applications.
- Clear escalation and communication paths with vendors and city authorities improve response time.
- Routine disaster recovery planning and testing are critical to operational resilience.
- Post-recovery monitoring and validation are necessary to prevent residual performance degradation.
10. References
- Internal communications with Mr. Ali Jazayeri
- Sharzad Backend Team (Mr. Farhad Baghan, Mr. Mohammad Azimi)