📝 Postmortem Report - SSO Problem

1. Summary

Incident ID/Name: SSO-Login-Failure-2025-09-13
Date & Time: 13 September 2025, 08:00 – 09:30 (Iran time)
Duration: \~1 hour 30 minutes
Severity Level: SEV2 (critical user-facing login issue)
Systems Affected: SSO service, login across applications
Impact on Users/Business: Users were unable to log in to the application during the incident. Exact user count unknown, but all login attempts via SSO failed for \~90 minutes.

(Iran time)

08:00 – User report received (A’rabi called Ali, reporting login not working in the app).
08:05 – On-call acknowledged and verified login failures.
08:10 – Initial assumption: no changes were deployed. Suspected external/system issue.
08:30 – Continued user impact confirmed, issue escalated.
09:00 – Follow-up with Ali Jazayeri revealed one of the SSO servers was down.
09:15 – Server restored, login requests began succeeding again.
09:30 – Incident confirmed resolved, monitoring green.

Immediate Cause: One of the SSO servers was malfunctioning.
Underlying Cause: Lack of redundancy / failover handling in SSO cluster meant a single bad server could block login attempts.
Why It Wasn’t Prevented/Detected Earlier: No automated health checks or alerting for individual SSO nodes; incident was detected via user reports rather than monitoring.
Five Whys:
Why couldn’t users log in? → Requests routed to a failed SSO server.
Why was the server failing? → Server was in a bad state (unresponsive).
Why wasn’t traffic rerouted? → Load balancer didn’t detect failed node.
Why didn’t monitoring detect it? → No per-node health check configured.
Why wasn’t that in place? → Assumed redundancy was enough, no prior testing of node failure scenario.

User Impact: All login attempts via SSO failed for \~1.5 hours. Number of affected users undetermined, but incident was business-critical.
Internal Impact: Multiple escalations, delayed work for affected teams. On-call engineers (A’rabi, Ali) were engaged for \~90 minutes.
Customer Communication: Users reported login issues directly; no official status page or proactive communication was made during the incident.

What was done to restore service: Problematic SSO server was identified and fixed. Traffic resumed successfully.
Time to Detection (TTD): \~5 minutes (via user call).
Time to Mitigation (TTM): \~60 minutes (confirmation of server issue).
Time to Recovery (TTR): \~90 minutes total.

Quick user reporting enabled faster detection.
Good collaboration between A’rabi and Ali.
Root cause was identified and fixed without need for rollback or major intervention.

[ ] Add per-node health checks for all SSO servers (Owner: SRE Team, Due: Sept 30)
[ ] Add alerting for failed login spikes (Owner: Monitoring Team, Due: Oct 5)