On Thursday April 1st 2020, between 13:03UTC to 15:19UTC, some of our hosted clients experienced issues reaching their Shotgun site.
After a scheduled maintenance ending at 13:00UTC, one of our database cluster master nodes suddenly become unresponsive probably due to a hardware failure. AWS RDS properly failover to the secondary instance as expected. However, all the Shotgun backend processing requests at the failover time didn't properly detect the failure and got stuck waiting for answers from the dead master. Some affected sites became unresponsive for end users as all their backends were waiting on the database.
The shotgun team reacted by manually restarting the affected sites to unblock clients as we identified them.
Around 14:43 UTC, we did a manual database failover which unstuck the last backend processes not yet restarted manually.
Some hosted sites were inaccessible for a limited time during the incident.
The following actions are in progress: