On Tuesday November 26th 2019, from 0h UTC to 12h30 UTC, Shotgun was partially unavailable to some clients.
Clients coming online during the incident experienced intermittent page loading timeouts and subsequent "Server under heavy load" (503) errors.
The incident occurred after the deployment of a new version of Shotgun the previous night, an operation that is usually unnoticeable by clients. The root cause of the incident was new frontend page restore logic introduced in this version, which was intended to improve performance when restoring browser sessions. It relied on an existing health endpoint which was unable to handle the volume of concurrent calls made as many clients restored their browser sessions at once, leading to long timeouts which filled site processing queues.
Some clients experienced page loading timeouts when coming online during the incident, whilst other clients with many users coming online at around the same time experienced a high volume of 503 errors.
We have reverted the change which induced the incident and will work on an alternate solution to improve performance without relying on the health endpoint. The health endpoint will also be improved to make it more robust and reply faster without timing out.
To insure we catch this and similar problems in the future, we will introduce improved monitoring and high priority alerts.