Skip to content

📝 Postmortem Report: Slow Query – September 4–9, 2025

1. Summary

  • Incident ID/Name: Slow Query Impact on Backoffice Orders Panel
  • Date & Time: 04-09-2025 11:42 AM – 07-09-2025 16:25 (Iran time)
  • Duration: \~3 days, 4 hours, 43 minutes
  • Severity Level: SEV2 (major user-facing impact, no full outage)
  • Systems Affected: Django API (GET /order/api/v1/panel/orders), Backoffice Panel
  • Impact on Users/Business: Backoffice panel became sluggish and partially unusable for staff. Queries consistently took >8s, causing delayed order management and frustration for \~100% of internal operators.

2. Incident Timeline

  • 11:42 – Sept 4 – First reports of sluggish order panel.
  • 12:05 – Sept 4 – Engineers confirmed high DB latency on order queries.
  • 14:30 – Sept 4 – Initial suspicion: joins + unnecessary deduplication.
  • Sept 5–6 – Debugging and query performance testing in staging.
  • 15:00 – Sept 7 – Root cause identified: overuse of distinct() in Django QuerySet.
  • 16:00 – Sept 7 – Code refactored: removed distinct(), replaced with scoped prefetch_related.
  • 16:25 – Sept 7 – Deployment complete. API latency dropped from \~8.5s to \~2.5s. Incident closed.

3. Root Cause Analysis

  • Immediate Cause: Use of distinct() on a large QuerySet in the order list endpoint.
  • Underlying Cause: Incorrect query optimization. distinct() was applied to eliminate duplication caused by joins instead of restructuring prefetch logic.
  • Why It Wasn’t Prevented/Detected Earlier: Query performance testing did not include production-level data volume. Monitoring focused on endpoint uptime, not latency degradation.
  • Five Whys:

  • Why was the panel slow? → Database queries took >8s.

  • Why were queries so slow? → distinct() forced full dataset sort + deduplication.
  • Why was distinct() used? → To handle duplicates from prefetch_items.
  • Why were duplicates generated? → Prefetch strategy was overly broad.
  • Why wasn’t this caught? → No performance regression tests for large datasets.

4. Impact

  • User Impact: 100% of internal operators faced degraded performance. Average response time 8.5s. Some order workflows delayed by hours.
  • Internal Impact: Engineering lost \~2 days debugging and testing. On-call load increased. Release cycle delayed by 1 day.
  • Customer Communication: No external customer impact. Internal team updated via Slack channel.

5. Resolution & Recovery

  • What was done:

  • Removed distinct() from queries.

  • Replaced prefetch_items with scoped prefetch_related(Prefetch(...)).
  • Deployed optimized code to production.
  • TTD (Time to Detection): \~20 minutes after first report.
  • TTM (Time to Mitigation): \~2.5 days (root cause confirmed Sept 7, 15:00).
  • TTR (Time to Recovery): \~3 days, 4h, 43m total.

6. What Went Well

  • Engineers quickly confirmed DB as bottleneck.
  • Refactoring was straightforward and deployed safely.
  • Sentry traces gave clear before/after performance evidence.

7. What Went Wrong

  • Lack of load/performance tests with production-like data.
  • Over-reliance on distinct() instead of fixing root query design.
  • Lack of having Latency monitoring, so alerts did not fire until users complained.

8. Action Items

  • [ ] Add DB query performance dashboards (Owner: SRE1, Due: Sept 15)
  • [ ] Introduce regression tests with production-sized datasets (Owner: QA Lead)
  • [ ] Define SLOs for API latency & alert if >3s (Owner: SRE2, Due: Sept 20)

9. Lessons Learned

  • Query-level shortcuts (like distinct()) are dangerous at scale—optimize the data model instead.
  • Performance regressions must be tested with real-world data volumes.
  • Internal-facing systems deserve the same observability rigor as customer-facing endpoints.

10. References

  • Technical Report: Impact of Removing distinct() on API Performance
  • Acknowledgment: Special thanks to **Ms. Mana Shams ** (Backend Team) for investigation and resolution efforts