During the incident window, some customers experienced intermittent failures when initiating new on-demand calls (including audio, IVR, and custom routing flows). The impact was primarily limited to call setup and routing: ongoing calls were largely unaffected, but customers using more complex routing logic reported higher failure rates and incomplete call initiations.
The issue was caused by excessive growth and load in a call-routing datastore and related services, which reached an operational threshold and degraded routing performance. Under this elevated load, dependent components (including SignalR hubs and callback handling) began reconnecting and failing intermittently, preventing new routing entries from being processed reliably. Our engineering team validated Azure infrastructure health, stabilized the affected routing components, and service recovered as load normalized and routing entries decreased.
To prevent recurrence, we are enhancing monitoring/alerting on datastore growth and service load thresholds, implementing optimized routing data lifecycle management to avoid unbounded growth, and improving observability to detect routing degradation earlier—before it impacts customers.