📝 Postmortem Report - SSO Problem
1. Summary
- Incident ID/Name: SSO-Login-Failure-2025-09-13
- Date & Time: 13 September 2025, 08:00 – 09:30 (Iran time)
- Duration: \~1 hour 30 minutes
- Severity Level: SEV2 (critical user-facing login issue)
- Systems Affected: SSO service, login across applications
- Impact on Users/Business: Users were unable to log in to the application during the incident. Exact user count unknown, but all login attempts via SSO failed for \~90 minutes.
2. Incident Timeline
(Iran time)
- 08:00 – User report received (A’rabi called Ali, reporting login not working in the app).
- 08:05 – On-call acknowledged and verified login failures.
- 08:10 – Initial assumption: no changes were deployed. Suspected external/system issue.
- 08:30 – Continued user impact confirmed, issue escalated.
- 09:00 – Follow-up with Ali Jazayeri revealed one of the SSO servers was down.
- 09:15 – Server restored, login requests began succeeding again.
- 09:30 – Incident confirmed resolved, monitoring green.
3. Root Cause Analysis
- Immediate Cause: One of the SSO servers was malfunctioning.
- Underlying Cause: Lack of redundancy / failover handling in SSO cluster meant a single bad server could block login attempts.
- Why It Wasn’t Prevented/Detected Earlier: No automated health checks or alerting for individual SSO nodes; incident was detected via user reports rather than monitoring.
-
Five Whys:
-
Why couldn’t users log in? → Requests routed to a failed SSO server.
- Why was the server failing? → Server was in a bad state (unresponsive).
- Why wasn’t traffic rerouted? → Load balancer didn’t detect failed node.
- Why didn’t monitoring detect it? → No per-node health check configured.
- Why wasn’t that in place? → Assumed redundancy was enough, no prior testing of node failure scenario.
4. Impact
- User Impact: All login attempts via SSO failed for \~1.5 hours. Number of affected users undetermined, but incident was business-critical.
- Internal Impact: Multiple escalations, delayed work for affected teams. On-call engineers (A’rabi, Ali) were engaged for \~90 minutes.
- Customer Communication: Users reported login issues directly; no official status page or proactive communication was made during the incident.
5. Resolution & Recovery
- What was done to restore service: Problematic SSO server was identified and fixed. Traffic resumed successfully.
- Time to Detection (TTD): \~5 minutes (via user call).
- Time to Mitigation (TTM): \~60 minutes (confirmation of server issue).
- Time to Recovery (TTR): \~90 minutes total.
6. What Went Well
- Quick user reporting enabled faster detection.
- Good collaboration between A’rabi and Ali.
- Root cause was identified and fixed without need for rollback or major intervention.
7. What Went Wrong
- Monitoring blind spot: No health check or alerting for SSO server failures.
- Detection depended on manual user reports rather than observability.
- Escalation path was unclear and caused \~1 hour delay before the fix.
- Lack of failover handling in SSO cluster design.
8. Action Items
- [ ] Add per-node health checks for all SSO servers (Owner: SRE Team, Due: Sept 30)
- [ ] Add alerting for failed login spikes (Owner: Monitoring Team, Due: Oct 5)
9. Lessons Learned
- Redundancy is ineffective without proper health checks and automated failover.
- Relying solely on user reports delays detection and increases impact.
- Escalation procedures must be clear to reduce time-to-fix.
- Future: build resilience by designing SSO cluster for fault tolerance.
10. References
- Call logs with Dear.Ali Jazayeri
- Incident notes (A’rabi’s call reports)
- Monitoring dashboard (SSO latency/error rates – gaps identified)