On Wednesday, the 18th, we received multiple customer complaints regarding endlessly running actions in the HelloID provisioning platform, requiring manual intervention for resolution. Following an investigation, we identified two distinct scenarios. The first occurred on Powershell Target Systems with concurrent actions set to 1, resulting in a two-hour waiting period between action executions. The second scenario involved actions perpetually stuck in a running state, only marked as not running after a manual retry. It's worth noting that these actions were actually executed in the next enforcement run, as evident in the action history.
After thorough investigation, we determined that the issue originated from a change in connections with our event system for which the change was introduced in the September release, manifesting only when scaling down a specific service.
How did we respond?
Tools4ever engineers initiated the investigation at 8:00 AM UTC. Due to the complexity and rarity of the issue, the process took considerable time.
On the 20th at 15:00 UTC, suspicions arose regarding the correlation between issue occurrences and the timing of a service downscaling. As a precautionary measure, we disabled scaling for a single service, ensuring a fixed number within performance limits.
By the 23rd at 7:00 UTC, we confirmed that the issue was indeed caused by the downscaling of a specific service, leading us to implement two immediate actions. The first involved removing all running open actions, and the second focused on preparing a permanent fix for the issue.
On the 30th at 7:00 UTC, we executed a migration to eliminate all running open actions. A permanent fix is slated for release with the November release on the 13th, during which we anticipate the issue will be fully resolved.