📝 Postmortem Report: Slow Query – September 4–9, 2025

1. Summary

Incident ID/Name: Slow Query Impact on Backoffice Orders Panel
Date & Time: 04-09-2025 11:42 AM – 07-09-2025 16:25 (Iran time)
Duration: \~3 days, 4 hours, 43 minutes
Severity Level: SEV2 (major user-facing impact, no full outage)
Systems Affected: Django API (GET /order/api/v1/panel/orders), Backoffice Panel
Impact on Users/Business: Backoffice panel became sluggish and partially unusable for staff. Queries consistently took >8s, causing delayed order management and frustration for \~100% of internal operators.

11:42 – Sept 4 – First reports of sluggish order panel.
12:05 – Sept 4 – Engineers confirmed high DB latency on order queries.
14:30 – Sept 4 – Initial suspicion: joins + unnecessary deduplication.
Sept 5–6 – Debugging and query performance testing in staging.
15:00 – Sept 7 – Root cause identified: overuse of distinct() in Django QuerySet.
16:00 – Sept 7 – Code refactored: removed distinct(), replaced with scoped prefetch_related.
16:25 – Sept 7 – Deployment complete. API latency dropped from \~8.5s to \~2.5s. Incident closed.

Immediate Cause: Use of distinct() on a large QuerySet in the order list endpoint.
Underlying Cause: Incorrect query optimization. distinct() was applied to eliminate duplication caused by joins instead of restructuring prefetch logic.
Why It Wasn’t Prevented/Detected Earlier: Query performance testing did not include production-level data volume. Monitoring focused on endpoint uptime, not latency degradation.
Five Whys:
Why was the panel slow? → Database queries took >8s.
Why were queries so slow? → distinct() forced full dataset sort + deduplication.
Why was distinct() used? → To handle duplicates from prefetch_items.
Why were duplicates generated? → Prefetch strategy was overly broad.
Why wasn’t this caught? → No performance regression tests for large datasets.

User Impact: 100% of internal operators faced degraded performance. Average response time 8.5s. Some order workflows delayed by hours.
Internal Impact: Engineering lost \~2 days debugging and testing. On-call load increased. Release cycle delayed by 1 day.
Customer Communication: No external customer impact. Internal team updated via Slack channel.

Lack of load/performance tests with production-like data.
Over-reliance on distinct() instead of fixing root query design.
Lack of having Latency monitoring, so alerts did not fire until users complained.

Query-level shortcuts (like distinct()) are dangerous at scale—optimize the data model instead.
Performance regressions must be tested with real-world data volumes.
Internal-facing systems deserve the same observability rigor as customer-facing endpoints.

Technical Report: Impact of Removing distinct() on API Performance
Acknowledgment: Special thanks to **Ms. Mana Shams ** (Backend Team) for investigation and resolution efforts