Some hosted sites are down
Incident Report for Flow Production Tracking
Postmortem

On Monday September 11th 2017, from 15h15 PDT to 18h00 PDT, we experienced a partial outage that caused downtime for some clients. The root cause of the incident was a faulty routing equipment at our data center. A switch was behaving strangely, and all servers behind that switch were periodically unavailable during that period. A regression in the application triggered by our failover procedures caused further downtime between 18h00 PDT and 18h30 PDT for all sites.

What happened?

One of our application servers started failing at 15h15 PM PDT. This was quickly caught by our monitoring and the sites on this server were back up 15 minutes later on a failover server. During that time, we notified our data center which started to look into the issue.

At 16h00 PDT, another server started to periodically go offline. We proceeded to move sites on this second server to our failover server. At that point, we started to suspect that a wider issue, probably related to networking, was going on. This was confirmed at 16h28 PDT by the data center; a switch was displaying strange behavior. At 16h35 PDT, we lost a third application server. We then quickly re-configured another server into a third failover application server, moving sites to this server. Just before 17h00 PDT, the sites on that third server were back up.

Finally, the data center confirmed the issue with the switch, and all the Shotgun servers that were behind that switch were identified. The data center informed us that this switch would be first rebooted. If the reboot didn't resolve the issue, servers would be moved behind another switch. We had a fourth server deserved by that switch; our memory object caching system (MOCS). Before the reboot, at 17h30 PDT, we switched our MOCS to its hot failover to avoid performance impact to the system which would be observable without the MOCS. This is when the second phase of the incident started.

The failover procedure for our MOCS didn't perform expected. Like all our failover procedures, this procedure is something we execute fairly often, every time we perform maintenance on our servers. But sometime after the switch, around 18h00 PDT, we started to observe unexpected errors in the system. The error was not obvious and while we suspected that the MOCS switch was the cause, we couldn't explain it as the failover MOCS was handling requests and was healthy. At 18h15 PDT we finally got a clear call stack, pointing to a recent optimization in Shotgun which prevented us from hot switching our MOCS. We then proceeded with a quick reboot of all web servers to clear their cache, and all sites were back up at 18h30 PDT.

Scope of impact

In the first phase of the incident, a small percentage of our sites were down for a period of up to 60 minutes. During that period, Shotgun was temporarily inaccessible for these clients.

In the second phase of the incident, for a period of 27 minutes, all sites were available, but requests going through the faulty code path were failing.

What will be done to prevent this incident from happening again?

  • We will make sure that our disaster recovery procedures are more efficient for the case where multiple servers are failing at the same time. We had to fallback to manual re-balancing, which increased the incident duration for the third server that went unavailable.
  • The issue in the back-end code has already been resolved and the fix will be deployed shortly
  • We will make sure that the dependency on the memory object caching system is tested as part of our Quality Assurance processes

We are very sorry for the disturbance. Be assured that we are consistently working to improve the reliability of our service

The Shotgun Team

Posted Sep 14, 2017 - 12:46 UTC

Resolved
The system has been stable for many hours now. Thanks, for your patience.
Posted Sep 12, 2017 - 13:42 UTC
Update
Shotgun is back under it's normal configuration.
Posted Sep 12, 2017 - 01:36 UTC
Update
After spending the last couple of hours in failover configuration, we are now confident to put the system back in its normal configuration. The servers that were impacted by the faulty equipment have been moved behind a different switch. The reconfiguration should be transparent to users.
Posted Sep 12, 2017 - 12:18 UTC
Update
After the redirection, some sites kept bad connections with our object caching server. We restarted all our web servers, which resolved this issue.
Posted Sep 12, 2017 - 01:36 UTC
Monitoring
Traffic redirection is completed. Our Data center has in the meanwhile rebooted the faulty switch. We will monitor for a while before going out of failover mode.
Posted Sep 12, 2017 - 00:48 UTC
Identified
We are currently doing some reconfiguration to redirect traffic to servers that are not behind the faulty switch. The DataCenter is also currently actively working on the faulty equipment, which is affecting many of our servers.
Posted Sep 12, 2017 - 00:29 UTC
Update
The issue is back. We are investigating with our data center provider.
Posted Sep 12, 2017 - 00:04 UTC
Monitoring
A faulty switch on the DataCenter was causing issue, the issue is solved for now. We are actively monitoring the situation.
Posted Sep 11, 2017 - 23:59 UTC
Update
Multiple app server are unresponsive, we are working with our data center provider.
Posted Sep 11, 2017 - 23:27 UTC
Identified
We lost an app server. Traffic has been redirect to another app server.
Free trials are not yet working and email notification will be delayed for affected sites.
Posted Sep 11, 2017 - 22:38 UTC
Investigating
We are investigating the issue.
Posted Sep 11, 2017 - 22:25 UTC