Service Degradation - Shotgun
Incident Report for Flow Production Tracking
Postmortem

On Thursday April 1st 2020, between 13:03UTC to 15:19UTC, some of our hosted clients experienced issues reaching their Shotgun site.

What happened?

After a scheduled maintenance ending at 13:00UTC, one of our database cluster master nodes suddenly become unresponsive probably due to a hardware failure. AWS RDS properly failover to the secondary instance as expected. However, all the Shotgun backend processing requests at the failover time didn't properly detect the failure and got stuck waiting for answers from the dead master. Some affected sites became unresponsive for end users as all their backends were waiting on the database.

The shotgun team reacted by manually restarting the affected sites to unblock clients as we identified them.

Around 14:43 UTC, we did a manual database failover which unstuck the last backend processes not yet restarted manually.

Scope of impact

Some hosted sites were inaccessible for a limited time during the incident.

What will be done to prevent this incident from happening again?

The following actions are in progress:

  • We are implementing a procedure to reduce mitigation time, for the case where that situation would happen again.
  • We are making changes to our backend to prevent this specific situation from reoccurring.
Posted Apr 07, 2020 - 20:10 UTC

Resolved
This incident has been resolved.
Posted Apr 01, 2020 - 16:57 UTC
Monitoring
A fix has been implemented and we are monitoring the results.
Posted Apr 01, 2020 - 15:20 UTC
Update
We are noticing outage on some sites. We are still investigating the issue. Thank you for your patience.
Posted Apr 01, 2020 - 14:35 UTC
Update
We are continuing to investigate this issue.
Posted Apr 01, 2020 - 13:51 UTC
Investigating
We are observing a high number of failed requests to the Shotgun service which may impact site availability for some clients. Email notifications may also be delayed. This issue is under investigation.
Posted Apr 01, 2020 - 13:05 UTC
This incident affected: Flow Production Tracking and Notification Service.