Service Degradation - Shotgun
Incident Report for Flow Production Tracking
Postmortem

On Thursday January 21st 2021, from 7h00 UTC to 10h05 UTC, a small number of Shotgun sites were partially unavailable.

What happened?

The incident occurred as a result of a faulty application component. Instead of the expected response, this component returned an error to requests.

Scope of impact

A small number of clients (under 5%) whose sites were associated with the failed app component experienced HTTP 502 responses for one in every few requests made to Shotgun during the incident.

What will be done to prevent this incident from happening again?

We are investigating why the affected app component got into a bad state and how it can be prevented from happening again.

To further improve our resiliency, we plan to implement enhanced health-checks to identify and remove faulty components automatically.

Posted Jan 22, 2021 - 16:36 UTC

Resolved
This incident has been resolved.
Posted Jan 21, 2021 - 10:46 UTC
Monitoring
A fix has been implemented and we are monitoring the results.
Posted Jan 21, 2021 - 10:19 UTC
Identified
The issue has been identified and a fix is being implemented.
Posted Jan 21, 2021 - 10:12 UTC
Investigating
We are observing a high number of failed requests to the Shotgun service which may impact site availability for some clients. Email notifications may also be delayed. This issue is under investigation.
Posted Jan 21, 2021 - 08:41 UTC
This incident affected: Flow Production Tracking and Notification Service.