📝 Postmortem Report: Slow Query – September 4–9, 2025
1. Summary
- Incident ID/Name: Slow Query Impact on Backoffice Orders Panel
- Date & Time: 04-09-2025 11:42 AM – 07-09-2025 16:25 (Iran time)
- Duration: \~3 days, 4 hours, 43 minutes
- Severity Level: SEV2 (major user-facing impact, no full outage)
- Systems Affected: Django API (
GET /order/api/v1/panel/orders), Backoffice Panel - Impact on Users/Business: Backoffice panel became sluggish and partially unusable for staff. Queries consistently took >8s, causing delayed order management and frustration for \~100% of internal operators.
2. Incident Timeline
- 11:42 – Sept 4 – First reports of sluggish order panel.
- 12:05 – Sept 4 – Engineers confirmed high DB latency on order queries.
- 14:30 – Sept 4 – Initial suspicion: joins + unnecessary deduplication.
- Sept 5–6 – Debugging and query performance testing in staging.
- 15:00 – Sept 7 – Root cause identified: overuse of
distinct()in Django QuerySet. - 16:00 – Sept 7 – Code refactored: removed
distinct(), replaced with scopedprefetch_related. - 16:25 – Sept 7 – Deployment complete. API latency dropped from \~8.5s to \~2.5s. Incident closed.
3. Root Cause Analysis
- Immediate Cause: Use of
distinct()on a large QuerySet in the order list endpoint. - Underlying Cause: Incorrect query optimization.
distinct()was applied to eliminate duplication caused by joins instead of restructuring prefetch logic. - Why It Wasn’t Prevented/Detected Earlier: Query performance testing did not include production-level data volume. Monitoring focused on endpoint uptime, not latency degradation.
-
Five Whys:
-
Why was the panel slow? → Database queries took >8s.
- Why were queries so slow? →
distinct()forced full dataset sort + deduplication. - Why was
distinct()used? → To handle duplicates fromprefetch_items. - Why were duplicates generated? → Prefetch strategy was overly broad.
- Why wasn’t this caught? → No performance regression tests for large datasets.
4. Impact
- User Impact: 100% of internal operators faced degraded performance. Average response time 8.5s. Some order workflows delayed by hours.
- Internal Impact: Engineering lost \~2 days debugging and testing. On-call load increased. Release cycle delayed by 1 day.
- Customer Communication: No external customer impact. Internal team updated via Slack channel.
5. Resolution & Recovery
-
What was done:
-
Removed
distinct()from queries. - Replaced
prefetch_itemswith scopedprefetch_related(Prefetch(...)). - Deployed optimized code to production.
- TTD (Time to Detection): \~20 minutes after first report.
- TTM (Time to Mitigation): \~2.5 days (root cause confirmed Sept 7, 15:00).
- TTR (Time to Recovery): \~3 days, 4h, 43m total.
6. What Went Well
- Engineers quickly confirmed DB as bottleneck.
- Refactoring was straightforward and deployed safely.
- Sentry traces gave clear before/after performance evidence.
7. What Went Wrong
- Lack of load/performance tests with production-like data.
- Over-reliance on
distinct()instead of fixing root query design. - Lack of having Latency monitoring, so alerts did not fire until users complained.
8. Action Items
- [ ] Add DB query performance dashboards (Owner: SRE1, Due: Sept 15)
- [ ] Introduce regression tests with production-sized datasets (Owner: QA Lead)
- [ ] Define SLOs for API latency & alert if >3s (Owner: SRE2, Due: Sept 20)
9. Lessons Learned
- Query-level shortcuts (like
distinct()) are dangerous at scale—optimize the data model instead. - Performance regressions must be tested with real-world data volumes.
- Internal-facing systems deserve the same observability rigor as customer-facing endpoints.
10. References
- Technical Report: Impact of Removing
distinct()on API Performance - Acknowledgment: Special thanks to **Ms. Mana Shams ** (Backend Team) for investigation and resolution efforts