Service Degradation - Shotgun
Incident Report for Flow Production Tracking
Postmortem

On Wednesday July 22nd 2020, from 6h25 PDT to 7h10 PDT, Shotgun suffered a partial outage that resulted in intermittent site availability for some clients within this timeframe.

What happened?

The incident happened during a routine release of Shotgun. The root cause of the incident was a database utilisation spike which caused some sites not to initialize as expected.

Scope of impact

Affected clients experienced up to 45 minutes of interruption to the availability of their Shotgun site.

What will be done to prevent this incident from happening again?

We have upscaled our database components to make them more resilient to unexpected connection spikes.

We are also exploring ways to improve our database load distribution and deployment methods to further mitigate the risk of this type of incident in the future.

Our monitoring is also being revised in light of this incident to help us identify this type of issue earlier.

Posted Aug 07, 2020 - 15:26 UTC

Resolved
This incident has been resolved.
Posted Jul 22, 2020 - 15:22 UTC
Monitoring
A fix has been implemented and we are monitoring the results.
Posted Jul 22, 2020 - 14:13 UTC
Identified
A few sites are down after an unexpected database restart. We are working getting those back online ASAP.
Posted Jul 22, 2020 - 13:55 UTC
Investigating
We are observing a high number of failed requests to the Shotgun service which may impact site availability for some clients. This issue is under investigation.
Posted Jul 22, 2020 - 13:35 UTC
This incident affected: Flow Production Tracking.