On August 6th, 2019, our system was under heavy load for a period of about 3 hours between 8h45 AM PST and 11h45 AM PST.
At the peak of the incident period, about 3% of all requests made to the system were failing. Service was degraded for about a dozen sites, and some customers experienced errors that indicated requests to their Shotgun sites were being throttled. This would have appeared as an "HTTP 503 Error".
To ensure new features and services are suitable, adequately provisioned and resilient to production conditions, Shotgun engineering follows the common practice of introducing them to production in "shadow" mode: online, and functioning, but with new capabilities hidden to end users.
During this incident, our alarms for high numbers of 503 errors were triggered. We identified that scaling of a recently introduced "shadow" service was causing an unexpected spike in the volume of upload progress indicator requests from the Web UI. Browser polling of background job completion status grew from 1% to 63% of all requests.
We reduced system load by completely disabling the new service.
After resolving user access issues, we released code updates to prevent users' browsers needlessly polling for background process status updates. We also optimized the number of jobs that were generated by the new service, to further protect the performance of the system, and have been able to successfully re-shadow the new service with no adverse effects.
Finally, we added additional alerts that will allow us to catch similar issues more rapidly.
Please be assured we are taking this degradation of service very seriously and that measures will be taken to prevent similar incidents.