📝 Postmortem Report Template (USE THIS AS TEMPLATE)
1. Summary
- Incident ID/Name:
- Date & Time:
- Duration:
- Severity Level (SEV1-SEV5):
- Systems Affected:
- Impact on Users/Business: (e.g., 25% of users couldn’t log in for 2 hours)
2. Incident Timeline
(Chronological log of events, Iran time)
- 10:03 – Alert fired for high error rates
- 10:05 – Engineer acknowledged alert
- 10:15 – Root cause suspected (DB latency)
- 10:45 – Mitigation applied
- 11:10 – Service fully restored
3. Root Cause Analysis
- Immediate Cause: (e.g., DB connection pool exhaustion)
- Underlying Cause: (e.g., config misalignment between staging & prod)
- Why It Wasn’t Prevented/Detected Earlier:
- Five Whys (if applicable):
4. Impact
- User Impact: (# users affected, % traffic failed, lost revenue estimate)
- Internal Impact: (on-call load, delayed releases, partner escalation)
- Customer Communication: (status page, social, support responses)
5. Resolution & Recovery
- What was done to restore service:
- Time to Detection (TTD):
- Time to Mitigation (TTM):
- Time to Recovery (TTR):
6. What Went Well
- (e.g., Alerts fired correctly, fast cross-team collaboration, rollback worked)
7. What Went Wrong
- (e.g., Monitoring blind spot, lack of runbook, noisy alerts slowed triage)
8. Action Items
(Each action must have an owner + due date)
- [ ] Add DB connection monitoring (Owner: SRE1, Due: Sept 15)
- [ ] Automate rollback for service X (Owner: Eng2, Due: Sept 30)
- [ ] Run load test simulating 10x peak traffic (Owner: PerfTeam, Due: Oct 10)
9. Lessons Learned
- Key takeaways to prevent recurrence.
- Cultural/organizational learnings.
10. References
- Slack/Teams/IRC incident channel link
- Jira tickets / GitHub issues
- Status page incident report
- Related monitoring dashboards
✅ 3. Postmortem Checklist
Pre-Writing
- [ ] Collect logs, dashboards, and monitoring data.
- [ ] Confirm exact start/end times of incident.
- [ ] Gather input from all responders.
Writing
- [ ] Summary: Clear & non-technical for execs.
- [ ] Timeline: Detailed, minute-by-minute.
- [ ] Root Cause: Explain “why,” not just “what.”
- [ ] Impact: Quantify in users, revenue, SLAs.
- [ ] Resolution: Show detection → fix journey.
- [ ] What went well: Highlight positives.
- [ ] What went wrong: Honest, blameless.
- [ ] Action items: SMART (Specific, Measurable, Assignable, Realistic, Time-bound).
Review
- [ ] Share with all stakeholders (SRE, Eng, PM, Support).
- [ ] Get leadership review for critical SEVs.
- [ ] Publish internally (and externally if customer-facing).
- [ ] Track action items until completion.