📝 Postmortem Report Template (USE THIS AS TEMPLATE)

1. Summary

Incident ID/Name:
Date & Time:
Duration:
Severity Level (SEV1-SEV5):
Systems Affected:
Impact on Users/Business: (e.g., 25% of users couldn’t log in for 2 hours)

2. Incident Timeline

(Chronological log of events, Iran time)
- 10:03 – Alert fired for high error rates
- 10:05 – Engineer acknowledged alert
- 10:15 – Root cause suspected (DB latency)
- 10:45 – Mitigation applied
- 11:10 – Service fully restored

3. Root Cause Analysis

Immediate Cause: (e.g., DB connection pool exhaustion)
Underlying Cause: (e.g., config misalignment between staging & prod)
Why It Wasn’t Prevented/Detected Earlier:
Five Whys (if applicable):

4. Impact

User Impact: (# users affected, % traffic failed, lost revenue estimate)
Internal Impact: (on-call load, delayed releases, partner escalation)
Customer Communication: (status page, social, support responses)

5. Resolution & Recovery

What was done to restore service:
Time to Detection (TTD):
Time to Mitigation (TTM):
Time to Recovery (TTR):

6. What Went Well

(e.g., Alerts fired correctly, fast cross-team collaboration, rollback worked)

7. What Went Wrong

(e.g., Monitoring blind spot, lack of runbook, noisy alerts slowed triage)

8. Action Items

(Each action must have an owner + due date)
- [ ] Add DB connection monitoring (Owner: SRE1, Due: Sept 15)
- [ ] Automate rollback for service X (Owner: Eng2, Due: Sept 30)
- [ ] Run load test simulating 10x peak traffic (Owner: PerfTeam, Due: Oct 10)

9. Lessons Learned

Key takeaways to prevent recurrence.
Cultural/organizational learnings.

10. References

Slack/Teams/IRC incident channel link
Jira tickets / GitHub issues
Status page incident report
Related monitoring dashboards

✅ 3. Postmortem Checklist

Pre-Writing

[ ] Collect logs, dashboards, and monitoring data.
[ ] Confirm exact start/end times of incident.
[ ] Gather input from all responders.

Writing

[ ] Summary: Clear & non-technical for execs.
[ ] Timeline: Detailed, minute-by-minute.
[ ] Root Cause: Explain “why,” not just “what.”
[ ] Impact: Quantify in users, revenue, SLAs.
[ ] Resolution: Show detection → fix journey.
[ ] What went well: Highlight positives.
[ ] What went wrong: Honest, blameless.
[ ] Action items: SMART (Specific, Measurable, Assignable, Realistic, Time-bound).

Review

[ ] Share with all stakeholders (SRE, Eng, PM, Support).
[ ] Get leadership review for critical SEVs.
[ ] Publish internally (and externally if customer-facing).
[ ] Track action items until completion.