Running actions and Timeouterrors
Incident Report for HelloID
Postmortem

On Wednesday, the 18th, we received multiple customer complaints regarding endlessly running actions in the HelloID provisioning platform, requiring manual intervention for resolution. Following an investigation, we identified two distinct scenarios. The first occurred on Powershell Target Systems with concurrent actions set to 1, resulting in a two-hour waiting period between action executions. The second scenario involved actions perpetually stuck in a running state, only marked as not running after a manual retry. It's worth noting that these actions were actually executed in the next enforcement run, as evident in the action history.

After thorough investigation, we determined that the issue originated from a change in connections with our event system for which the change was introduced in the September release, manifesting only when scaling down a specific service.

How did we respond?

Tools4ever engineers initiated the investigation at 8:00 AM UTC. Due to the complexity and rarity of the issue, the process took considerable time.

On the 20th at 15:00 UTC, suspicions arose regarding the correlation between issue occurrences and the timing of a service downscaling. As a precautionary measure, we disabled scaling for a single service, ensuring a fixed number within performance limits.

By the 23rd at 7:00 UTC, we confirmed that the issue was indeed caused by the downscaling of a specific service, leading us to implement two immediate actions. The first involved removing all running open actions, and the second focused on preparing a permanent fix for the issue.

On the 30th at 7:00 UTC, we executed a migration to eliminate all running open actions. A permanent fix is slated for release with the November release on the 13th, during which we anticipate the issue will be fully resolved.

Posted Nov 02, 2023 - 14:28 UTC

Resolved
This incident has been resolved.
Posted Oct 30, 2023 - 07:20 UTC
Update
A permanent fix will be released in the November release.
All running tasks that were stuck have now been cleaned up
Posted Oct 30, 2023 - 07:20 UTC
Monitoring
We have implemented a temporary fix to prevent actions from getting stuck on the running status. Actions that currently have the status running will be cleaned up at a later date.

We will continue to monitor the situation.
Posted Oct 23, 2023 - 09:24 UTC
Investigating
In some cases there may be a lock related issue on PowerShell actions that sometimes unintentionally persists in connectors. As a result, the queue backs up, and actions time out.
We are investigating this issue.
Posted Oct 18, 2023 - 08:45 UTC
This incident affected: HelloID West Europe (Provisioning) and HelloID West America (Provisioning).