Service Degradation - Shotgun
Incident Report for Flow Production Tracking
Postmortem

On Tuesday November 26th 2019, from 0h UTC to 12h30 UTC, Shotgun was partially unavailable to some clients.

What happened?

Clients coming online during the incident experienced intermittent page loading timeouts and subsequent "Server under heavy load" (503) errors.

The incident occurred after the deployment of a new version of Shotgun the previous night, an operation that is usually unnoticeable by clients. The root cause of the incident was new frontend page restore logic introduced in this version, which was intended to improve performance when restoring browser sessions. It relied on an existing health endpoint which was unable to handle the volume of concurrent calls made as many clients restored their browser sessions at once, leading to long timeouts which filled site processing queues.

Scope of impact

Some clients experienced page loading timeouts when coming online during the incident, whilst other clients with many users coming online at around the same time experienced a high volume of 503 errors.

What will be done to prevent this incident from happening again?

We have reverted the change which induced the incident and will work on an alternate solution to improve performance without relying on the health endpoint. The health endpoint will also be improved to make it more robust and reply faster without timing out.

To insure we catch this and similar problems in the future, we will introduce improved monitoring and high priority alerts.

Posted Nov 28, 2019 - 16:23 UTC

Resolved
This incident has been resolved.
Posted Nov 26, 2019 - 15:01 UTC
Monitoring
A fix has been implemented and we are monitoring the results.
Posted Nov 26, 2019 - 13:41 UTC
Update
We are continuing to investigate this issue.
Posted Nov 26, 2019 - 13:13 UTC
Update
We are continuing to investigate this issue.
Posted Nov 26, 2019 - 11:36 UTC
Investigating
We are observing a high number of failed requests to the Shotgun service which may impact site availability for some clients. Email notifications may also be delayed. This issue is under investigation.
Posted Nov 26, 2019 - 10:14 UTC
This incident affected: Flow Production Tracking and Notification Service.