Connectivity issues in North America
Incident Report for Boostlingo
Postmortem

Between 12:30 UTC and 15:00 UTC on July 23rd 2023 a third party managed websocket service used in Boostlingo went down. The third party determined that an automated OS kernel deployment introduced configuration issues.

The third party mitigated the issue by rolling back to the previous OS image version containing a healthy configuration.

Boostlingo utilizes this websocket service for one-to-many as well as one-to-one communication, namely from the Boost servers to web and mobile applications. The websocket service is an ideal fit with our on-demand calling functionality due to the fact that clients don't need to "poll" for data from an API, the server is able to "push" down updates to clients.

That said, when there is an issue with this service it will impact most real time functionality in web/mobile apps. This issue impacted the ability for interpreters using our apps to receive calls, requestors using our apps to place calls, and impacted push notifications to web/mobile. The last point mentioned could impact things like the async notifications for call log file download and in web toast appointment notifications.

At this time we have outstanding support tickets open with this third party for deeper root cause analysis and information on how they intend to prevent disruptions like this in the future. We have also begun some initial investigation on alternatives to this service, in case issues like this occur in the future. Due to most Boost functionality still being responsive and the date/time (fortunately) being the most "off peek" hours, monitoring triggers did not fire as quickly as we would have liked. We have refined triggers based on this type of error, so we can be more proactive in reaching out with workarounds.

It's important to note that all IVR, Direct Dial, and SIP calls were still able to be placed. If the BPIN was enabled for those accounts, they were also serviced by integrations that do not depend on the websocket service (ie. the calls would have successfully reached an interpreter). We did not even consider DAP since the functionality still available in platform far exceeded the capabilities of DAP. Additionally, all onsite appointment functionality and most of the web portal functionality (other than caller and push notifications mentioned above) was up the whole time.

We are sorry for the inconvenience this caused and for interpreters that were attempting to work during these hours and service calls. We will update this post-mortem when we receive any additional information from the third party managed service

Thanks

Boost Team

Posted Jul 26, 2023 - 20:14 PDT

Resolved
This incident has been resolved.
Posted Jul 23, 2023 - 07:30 PDT
Monitoring
The third party has fixed the issue and we are continuing to monitor the situation.
Posted Jul 23, 2023 - 07:12 PDT
Update
We are continuing to work on a fix for this issue.
Posted Jul 23, 2023 - 07:11 PDT
Identified
An issue has been identified with a 3rd party provider who is working to address the issue. The most recent update from the 3rd party is below.

23 minutes ago
Impact Statement: Starting at 12:30 UTC on 23 Jul 2023, you've been identified as a customer using Azure SignalR Service in West US 2 who may experience connectivity issues and failures with service management operations.



Current Status: The third party determined that a recent deployment introduced a configuration error that caused backend components of Azure SignalR to become unhealthy. To address this issue, we have rolled back to a recent deployment containing a successful configuration and we are currently monitoring to ensure this will recover the service. The next update will be provided in 60 minutes, or as events warrant.

1 hour ago
Impact Statement: Starting at 12:30 UTC on 23 Jul 2023, you've been identified as a customer using Azure SignalR Service in West US 2 who may experience connectivity issues and failures with service management operations.

Current Status: We are aware of this issue and are actively investigating. The next update will be provided within 60 minutes, or as events warrant.

1 hour ago
We are actively investigating a service event for Azure SignalR Service in West US 2. More details will be provided shortly.
Posted Jul 23, 2023 - 06:46 PDT
This incident affected: Boostlingo Voice (Boostllingo Voice IVR, Boostlingo Network Traversal Service, Boostlingo Speech Recognition) and Boostlingo Video (Boostlingo Group Rooms, Boostlingo Communication REST API).