Some background jobs (inc. running checks and ingesting monitoring updates) were unable to be processed for ~14 hours (June 10th, 17:40UTC to June 11th 7.30UTC). This affected BAU operations and delayed customer workflows.
The cause of this was a high number of non-critical background jobs being created and placed in one of the job queues. The lack of effective prioritisation for this job queue caused more time-critical jobs to be delayed.
The non-critical jobs were manually delayed and spread out over a number of days to allow the time-critical jobs to be processed.
We have now deployed a number of changes to prevent this happening again; this includes better job prioritisation and also improved retry logic for certain jobs.
We have also enhanced our alerting infrastructure in order to detect these types of delays earlier.