On Friday November 20th 2020, from 4h00 UTC to 10h45 UTC, a small number of Shotgun sites were partially unavailable.
The incident occurred as a result of a faulty application component. Instead of the expected response, this component returned an error to requests.
A small number of clients (under 5%) whose sites were associated with the failed app component experienced HTTP 502 responses for one in every few requests made to Shotgun during the incident.
We are investigating why the affected app component got into a bad state and how it can be prevented from happening again.
Our monitoring tools have already been improved to alert us to this type of failure more urgently in the future.
To further improve our resiliency, we are considering enhanced health-checks to identify and remove faulty components automatically.