📝 Postmortem Report: Data Center Outage – September 2–3, 2025

1. Executive Summary

Incident ID/Name: 01 – Data Center Outage
Date & Time: September 2, 2025, 15:20 – September 3, 12:40
Duration: \~21 hours
Severity Level: SEV1
Systems Affected: All core services hosted in the primary data center, including website, storage systems, databases, and APIs
Impact on Users/Business: Complete service downtime followed by degraded performance for several hours; critical business operations were disrupted.

15:20 (Sep 2) – Initial detection of service unavailability. Users unable to access website and applications.
15:30 – Incident escalated to DevOps team; investigation begins.
16:00 – Primary storage systems became partially inaccessible.
16:30 – External vendors (Iran FAWA) contacted; initial response received, but full investigation delayed.
Evening (Sep 2) – Coordination with city FAWA authorities and internal teams (Mr. Arabi, Mr. Ali Peyman, Mr. Ali Jazayeri) for ongoing support.
04:30 (Sep 3) – Vendor confirms power restoration; services start recovering.
04:30 – 12:40 (Sep 3) – Users experienced slowness and degraded performance due to residual infrastructure instability.
12:40 (Sep 3) – All systems fully operational and stable; performance returned to normal.

Immediate Cause: Complete loss of power at the primary data center, affecting all hosted systems.
Underlying Cause: Lack of redundancy and failover mechanisms; no disaster recovery plan for prolonged outages.
Detection Gap: Monitoring focused on application-level health; critical infrastructure alerts were not configured, delaying early detection.
Residual Performance Issue: After power restoration, services experienced degraded performance due to delayed infrastructure stabilization and partial storage/DB recovery.

Five Whys Analysis:

Why did services go down? → Power outage at the primary data center.
Why was there no automatic failover? → No active disaster recovery/failover plan in place.
Why wasn’t this planned? → Limited infrastructure redundancy and lack of prior risk assessment.
Why weren’t warnings detected? → Monitoring focused solely on applications, not underlying infrastructure.
Why wasn’t infrastructure monitored? → Assumption that data center guarantees uninterrupted power supply.

User Impact: All users experienced full service downtime (\~100% traffic failure) for \~21 hours, followed by degraded performance for \~8 hours. Critical operations and transactions were delayed or interrupted.
Internal Impact: DevOps and operations teams fully engaged in triage and vendor coordination; development releases were delayed.
Customer Communication: Updates shared internally; no external communication due to the unexpected outage.

Actions Taken:
Coordinated with Iran FAWA and city FAWA authorities to restore power.
Sequential validation of storage systems, databases, and application services.
Continuous monitoring to ensure full recovery and stabilization.
Time to Detection (TTD): \~10 minutes after users reported outage
Time to Mitigation (TTM): \~21 hours until services restored
Time to Recovery (TTR): \~21 hours, plus \~8 hours of performance stabilization

Prompt escalation and coordination by DevOps team.
Effective communication with internal and external stakeholders (Mr. Ali Peyman, Mr. Arabi, Mr. Ali Jazayeri).
Persistence and follow-up ensured full service restoration.

[ ] Implement disaster recovery plan across multiple data centers (Owner: SRE Team, Due: Oct 15, 2025)
[ ] Deploy real-time monitoring for critical infrastructure, including power systems and storage (Owner: Infra Team, Due: --)
[ ] Conduct regular disaster recovery drills to ensure failover readiness (Owner: DevOps Lead, Due: --)
[ ] Review service SLAs and map dependencies to eliminate single points of failure (Owner: SRE Team, Due: Oct 10, 2025)
[ ] Improve post-recovery validation to detect residual slowness (Owner: Performance Team, Due: --)

Multi-data center redundancy is essential for high-availability services.
Infrastructure monitoring must cover underlying systems, not just applications.
Clear escalation and communication paths with vendors and city authorities improve response time.
Routine disaster recovery planning and testing are critical to operational resilience.
Post-recovery monitoring and validation are necessary to prevent residual performance degradation.