On Thursday February 13th 2020, between 9:05UTC to 10:25UTC, many of our hosted clients experienced intermittent issues reaching their Shotgun site.
During a scheduled maintenance starting at 9:00UTC a number of 500 error spikes began occurring within the first 15 minutes which appeared at first to be settling. As the maintenance continued larger spikes began occurring at around 9:20UTC and load on the application servers increased drastically. We received our first reports from clients experiencing site accessibility issues at around 9:30AM.
By 9:35UTC the maintenance was complete. We continued investigating client reports alongside monitoring of our infrastructure. Whilst some metrics had improved it was decided that rolling back the maintenance changes would be the safest course of action.
We experienced a number of new 500 error spikes during the rollback which was completed successfully at 10:10UTC. Additionally, some sites failed to come back online after the rollback process which required further intervention to resolve. All sites were back online by 10:25UTC.
During another recent maintenance an infrastructure component was updated which introduced a security feature limiting the number of threads our virtualised software containers were permitted to spawn. This feature was not well documented by the vendor so we were not aware of the impact it would have on our particular workloads. In particular, some of our maintenance operations involve automated failover processes which spawn additional processes within partner containers. Due to the newly applied limitations, these processes were unable to spawn meaning that a site's ability to respond to queries was diminished or in some cases completely absent.
We determined the optimal configuration for the new security feature affecting the operation of our infrastructure and have already applied this new configuration during a recent maintenance.
We also continue to make headway in migrating all of our infrastructure to AWS. On AWS our architecture will be more resilient to these types of issues.