Skip to content

📝 Postmortem Report - SSO Problem

1. Summary

  • Incident ID/Name: SSO-Login-Failure-2025-09-13
  • Date & Time: 13 September 2025, 08:00 – 09:30 (Iran time)
  • Duration: \~1 hour 30 minutes
  • Severity Level: SEV2 (critical user-facing login issue)
  • Systems Affected: SSO service, login across applications
  • Impact on Users/Business: Users were unable to log in to the application during the incident. Exact user count unknown, but all login attempts via SSO failed for \~90 minutes.

2. Incident Timeline

(Iran time)

  • 08:00 – User report received (A’rabi called Ali, reporting login not working in the app).
  • 08:05 – On-call acknowledged and verified login failures.
  • 08:10 – Initial assumption: no changes were deployed. Suspected external/system issue.
  • 08:30 – Continued user impact confirmed, issue escalated.
  • 09:00 – Follow-up with Ali Jazayeri revealed one of the SSO servers was down.
  • 09:15 – Server restored, login requests began succeeding again.
  • 09:30 – Incident confirmed resolved, monitoring green.

3. Root Cause Analysis

  • Immediate Cause: One of the SSO servers was malfunctioning.
  • Underlying Cause: Lack of redundancy / failover handling in SSO cluster meant a single bad server could block login attempts.
  • Why It Wasn’t Prevented/Detected Earlier: No automated health checks or alerting for individual SSO nodes; incident was detected via user reports rather than monitoring.
  • Five Whys:

  • Why couldn’t users log in? → Requests routed to a failed SSO server.

  • Why was the server failing? → Server was in a bad state (unresponsive).
  • Why wasn’t traffic rerouted? → Load balancer didn’t detect failed node.
  • Why didn’t monitoring detect it? → No per-node health check configured.
  • Why wasn’t that in place? → Assumed redundancy was enough, no prior testing of node failure scenario.

4. Impact

  • User Impact: All login attempts via SSO failed for \~1.5 hours. Number of affected users undetermined, but incident was business-critical.
  • Internal Impact: Multiple escalations, delayed work for affected teams. On-call engineers (A’rabi, Ali) were engaged for \~90 minutes.
  • Customer Communication: Users reported login issues directly; no official status page or proactive communication was made during the incident.

5. Resolution & Recovery

  • What was done to restore service: Problematic SSO server was identified and fixed. Traffic resumed successfully.
  • Time to Detection (TTD): \~5 minutes (via user call).
  • Time to Mitigation (TTM): \~60 minutes (confirmation of server issue).
  • Time to Recovery (TTR): \~90 minutes total.

6. What Went Well

  • Quick user reporting enabled faster detection.
  • Good collaboration between A’rabi and Ali.
  • Root cause was identified and fixed without need for rollback or major intervention.

7. What Went Wrong

  • Monitoring blind spot: No health check or alerting for SSO server failures.
  • Detection depended on manual user reports rather than observability.
  • Escalation path was unclear and caused \~1 hour delay before the fix.
  • Lack of failover handling in SSO cluster design.

8. Action Items

  • [ ] Add per-node health checks for all SSO servers (Owner: SRE Team, Due: Sept 30)
  • [ ] Add alerting for failed login spikes (Owner: Monitoring Team, Due: Oct 5)

9. Lessons Learned

  • Redundancy is ineffective without proper health checks and automated failover.
  • Relying solely on user reports delays detection and increases impact.
  • Escalation procedures must be clear to reduce time-to-fix.
  • Future: build resilience by designing SSO cluster for fault tolerance.

10. References

  • Call logs with Dear.Ali Jazayeri
  • Incident notes (A’rabi’s call reports)
  • Monitoring dashboard (SSO latency/error rates – gaps identified)