On Monday September 11th 2017, from 15h15 PDT to 18h00 PDT, we experienced a partial outage that caused downtime for some clients. The root cause of the incident was a faulty routing equipment at our data center. A switch was behaving strangely, and all servers behind that switch were periodically unavailable during that period. A regression in the application triggered by our failover procedures caused further downtime between 18h00 PDT and 18h30 PDT for all sites.
One of our application servers started failing at 15h15 PM PDT. This was quickly caught by our monitoring and the sites on this server were back up 15 minutes later on a failover server. During that time, we notified our data center which started to look into the issue.
At 16h00 PDT, another server started to periodically go offline. We proceeded to move sites on this second server to our failover server. At that point, we started to suspect that a wider issue, probably related to networking, was going on. This was confirmed at 16h28 PDT by the data center; a switch was displaying strange behavior. At 16h35 PDT, we lost a third application server. We then quickly re-configured another server into a third failover application server, moving sites to this server. Just before 17h00 PDT, the sites on that third server were back up.
Finally, the data center confirmed the issue with the switch, and all the Shotgun servers that were behind that switch were identified. The data center informed us that this switch would be first rebooted. If the reboot didn't resolve the issue, servers would be moved behind another switch. We had a fourth server deserved by that switch; our memory object caching system (MOCS). Before the reboot, at 17h30 PDT, we switched our MOCS to its hot failover to avoid performance impact to the system which would be observable without the MOCS. This is when the second phase of the incident started.
The failover procedure for our MOCS didn't perform expected. Like all our failover procedures, this procedure is something we execute fairly often, every time we perform maintenance on our servers. But sometime after the switch, around 18h00 PDT, we started to observe unexpected errors in the system. The error was not obvious and while we suspected that the MOCS switch was the cause, we couldn't explain it as the failover MOCS was handling requests and was healthy. At 18h15 PDT we finally got a clear call stack, pointing to a recent optimization in Shotgun which prevented us from hot switching our MOCS. We then proceeded with a quick reboot of all web servers to clear their cache, and all sites were back up at 18h30 PDT.
In the first phase of the incident, a small percentage of our sites were down for a period of up to 60 minutes. During that period, Shotgun was temporarily inaccessible for these clients.
In the second phase of the incident, for a period of 27 minutes, all sites were available, but requests going through the faulty code path were failing.
We are very sorry for the disturbance. Be assured that we are consistently working to improve the reliability of our service
The Shotgun Team