Service Degradation - Shotgun
Incident Report for Flow Production Tracking
Postmortem

On August 6th, 2019, our system was under heavy load for a period of about 3 hours between 8h45 AM PST and 11h45 AM PST.

Scope of impact

At the peak of the incident period, about 3% of all requests made to the system were failing. Service was degraded for about a dozen sites, and some customers experienced errors that indicated requests to their Shotgun sites were being throttled. This would have appeared as an "HTTP 503 Error".

What happened?

To ensure new features and services are suitable, adequately provisioned and resilient to production conditions, Shotgun engineering follows the common practice of introducing them to production in "shadow" mode: online, and functioning, but with new capabilities hidden to end users.

During this incident, our alarms for high numbers of 503 errors were triggered. We identified that scaling of a recently introduced "shadow" service was causing an unexpected spike in the volume of upload progress indicator requests from the Web UI. Browser polling of background job completion status grew from 1% to 63% of all requests.

We reduced system load by completely disabling the new service.

What will be done to prevent this incident from happening again?

After resolving user access issues, we released code updates to prevent users' browsers needlessly polling for background process status updates. We also optimized the number of jobs that were generated by the new service, to further protect the performance of the system, and have been able to successfully re-shadow the new service with no adverse effects.

Finally, we added additional alerts that will allow us to catch similar issues more rapidly.

Please be assured we are taking this degradation of service very seriously and that measures will be taken to prevent similar incidents.

Posted Aug 08, 2019 - 18:54 UTC

Resolved
Things have stabilized since we push a configuration change in the system. We are still monitoring the system and will push additional fixes to further improve things.
Posted Aug 07, 2019 - 01:08 UTC
Monitoring
We pushed a change and things are slowly improving. We are monitoring.
Posted Aug 06, 2019 - 18:59 UTC
Update
We are observing high load on our application servers, which trigger 503 errors for heavier users. We identified a possible culprit and we are working on mitigating the issue.
Posted Aug 06, 2019 - 18:19 UTC
Investigating
We are observing a high number of failed requests to the Shotgun service which may impact site availability for some clients. Email notifications may also be delayed. This issue is under investigation.
Posted Aug 06, 2019 - 18:13 UTC
This incident affected: Flow Production Tracking and Notification Service.