Skip to content

📝 Postmortem Report Template (USE THIS AS TEMPLATE)

1. Summary

  • Incident ID/Name:
  • Date & Time:
  • Duration:
  • Severity Level (SEV1-SEV5):
  • Systems Affected:
  • Impact on Users/Business: (e.g., 25% of users couldn’t log in for 2 hours)

2. Incident Timeline

(Chronological log of events, Iran time)
- 10:03 – Alert fired for high error rates
- 10:05 – Engineer acknowledged alert
- 10:15 – Root cause suspected (DB latency)
- 10:45 – Mitigation applied
- 11:10 – Service fully restored


3. Root Cause Analysis

  • Immediate Cause: (e.g., DB connection pool exhaustion)
  • Underlying Cause: (e.g., config misalignment between staging & prod)
  • Why It Wasn’t Prevented/Detected Earlier:
  • Five Whys (if applicable):

4. Impact

  • User Impact: (# users affected, % traffic failed, lost revenue estimate)
  • Internal Impact: (on-call load, delayed releases, partner escalation)
  • Customer Communication: (status page, social, support responses)

5. Resolution & Recovery

  • What was done to restore service:
  • Time to Detection (TTD):
  • Time to Mitigation (TTM):
  • Time to Recovery (TTR):

6. What Went Well

  • (e.g., Alerts fired correctly, fast cross-team collaboration, rollback worked)

7. What Went Wrong

  • (e.g., Monitoring blind spot, lack of runbook, noisy alerts slowed triage)

8. Action Items

(Each action must have an owner + due date)
- [ ] Add DB connection monitoring (Owner: SRE1, Due: Sept 15)
- [ ] Automate rollback for service X (Owner: Eng2, Due: Sept 30)
- [ ] Run load test simulating 10x peak traffic (Owner: PerfTeam, Due: Oct 10)


9. Lessons Learned

  • Key takeaways to prevent recurrence.
  • Cultural/organizational learnings.

10. References

  • Slack/Teams/IRC incident channel link
  • Jira tickets / GitHub issues
  • Status page incident report
  • Related monitoring dashboards

✅ 3. Postmortem Checklist

Pre-Writing

  • [ ] Collect logs, dashboards, and monitoring data.
  • [ ] Confirm exact start/end times of incident.
  • [ ] Gather input from all responders.

Writing

  • [ ] Summary: Clear & non-technical for execs.
  • [ ] Timeline: Detailed, minute-by-minute.
  • [ ] Root Cause: Explain “why,” not just “what.”
  • [ ] Impact: Quantify in users, revenue, SLAs.
  • [ ] Resolution: Show detection → fix journey.
  • [ ] What went well: Highlight positives.
  • [ ] What went wrong: Honest, blameless.
  • [ ] Action items: SMART (Specific, Measurable, Assignable, Realistic, Time-bound).

Review

  • [ ] Share with all stakeholders (SRE, Eng, PM, Support).
  • [ ] Get leadership review for critical SEVs.
  • [ ] Publish internally (and externally if customer-facing).
  • [ ] Track action items until completion.