Service Degradation - Shotgun
Incident Report for Flow Production Tracking
Postmortem

On Thursday February 13th 2020, between 9:05UTC to 10:25UTC, many of our hosted clients experienced intermittent issues reaching their Shotgun site.

What happened?

During a scheduled maintenance starting at 9:00UTC a number of 500 error spikes began occurring within the first 15 minutes which appeared at first to be settling. As the maintenance continued larger spikes began occurring at around 9:20UTC and load on the application servers increased drastically. We received our first reports from clients experiencing site accessibility issues at around 9:30AM.

By 9:35UTC the maintenance was complete. We continued investigating client reports alongside monitoring of our infrastructure. Whilst some metrics had improved it was decided that rolling back the maintenance changes would be the safest course of action.

We experienced a number of new 500 error spikes during the rollback which was completed successfully at 10:10UTC. Additionally, some sites failed to come back online after the rollback process which required further intervention to resolve. All sites were back online by 10:25UTC.

What caused the incident?

During another recent maintenance an infrastructure component was updated which introduced a security feature limiting the number of threads our virtualised software containers were permitted to spawn. This feature was not well documented by the vendor so we were not aware of the impact it would have on our particular workloads. In particular, some of our maintenance operations involve automated failover processes which spawn additional processes within partner containers. Due to the newly applied limitations, these processes were unable to spawn meaning that a site's ability to respond to queries was diminished or in some cases completely absent.

What will be done to prevent this incident from happening again?

We determined the optimal configuration for the new security feature affecting the operation of our infrastructure and have already applied this new configuration during a recent maintenance.

We also continue to make headway in migrating all of our infrastructure to AWS. On AWS our architecture will be more resilient to these types of issues.

Posted Feb 24, 2020 - 09:29 UTC

Resolved
This incident has been resolved.
Posted Feb 13, 2020 - 10:58 UTC
Monitoring
A fix has been implemented and we are monitoring the results.
Posted Feb 13, 2020 - 10:23 UTC
Identified
The issue has been identified and a fix is being implemented.
Posted Feb 13, 2020 - 09:46 UTC
Investigating
We are observing a high number of failed requests to the Shotgun service which may impact site availability for some clients. Email notifications may also be delayed. This issue is under investigation.
Posted Feb 13, 2020 - 09:38 UTC
This incident affected: Flow Production Tracking and Notification Service.