Degraded transcoding performance
Incident Report for Flow Production Tracking
Postmortem

What happened?

On Friday May 21, 2021 from 20:26UTC to 22:36UTC the ShotGrid transcoding service experienced an outage which resulted in failed transcodes of uploaded Versions, along with other non-Version based thumbnails including Files, Note attachments and annotations.

The cause of the issue was a codepath silently failing under a rare set of circumstances.

Scope of impact

The incident affected all hosted ShotGrid sites. During the outage window, successful uploads to ShotGrid were not transcoded and associated thumbnails were unavailable. The unusual nature of the failure extended the types of media impacted beyond Versions, also affecting Files and Note attachments and annotations.

No media was lost and all media uploaded to ShotGrid during the incident was saved in the respective storage location for each ShotGrid site until it could be successfully transcoded once the transcoding service recovered.

What we'll do to prevent this incident from happening again?

Following our investigation into the root cause we'll be taking a number of measures to detect and prevent this type of outage from recurring in the future.

  • We'll be reviewing the code for our transcoding service to ensure the codepath that lead to the incident is updated to prevent similar issues
  • We're updating our monitoring tooling to ensure this type of outage is caught earlier
Posted May 28, 2021 - 15:16 UTC

Resolved
This incident has been resolved.
Posted May 21, 2021 - 22:56 UTC
Monitoring
A fix has been implemented and our transcoding service is back online. Failed transcode jobs that were submitted during the incident will be reprocessed as the system catches up.
Posted May 21, 2021 - 22:46 UTC
Update
Our transcoding service is currently down. We are investigating the root cause.
Posted May 21, 2021 - 22:20 UTC
Investigating
The Shotgun Transcoding service is currently degraded. Users may experience longer than usual transcode times. We are currently investigating this issue.
Posted May 21, 2021 - 22:13 UTC
This incident affected: Transcoding Service.