The Day a Restart Saved Us

A restart in time saves nine

On the 8th of May 2023, one of our customers reported that a significant percentage of their users were connecting to calls but unable to see each other. The on-call engineer triaged this issue and realized it was limited to users connecting to one specific region among Dyte’s multiple regions across the globe. Our infrastructure team got involved at this point, as we had recently performed routine maintenance on unrelated systems in the same region. Further analysis of our logs confirmed that most users’ clients were facing ICE state errors during their call, and every call was associated with a single node from our fleet.

In the middle of an active incident, our highest priority is to prevent the issue from further affecting our users. We realized that the IP address advertised as part of ICE differed from the IP address that the media service had been assigned, which was a surprise for us. The fastest path to recovery was restarting the service, as the affected node was stuck in an unexpected state. As soon as we completed the restart, we saw the error count for ICE state failures return to normal range, and clients were successfully connecting to each other. Restarts are valuable in an operator’s toolbox, but they don’t scale well and often cover a deeper issue. With the issue mitigated, we donned our detective hats and started digging into what caused this particular incident.

Hold on, let's back up a bit!

Dyte’s calling features are powered by WebRTC, a standard that allows various devices, operating systems, and applications to interoperate for audio/video calling. The original design for WebRTC expected all users on a call to have direct network routes to each other. However, the internet is much more complex, and due to various reasons, most consumers of Dyte’s services do not have public IP addresses assigned to their devices. They are serviced by Network Address Translation (NAT) devices. To simplify the network discovery via Interactive Connectivity Establishment (ICE) in this scenario, Dyte chose to expose our media nodes on public IP addresses.

We use Kubernetes to host a majority of our application services, and for the media service, we ensure a 1:1 mapping exists between the service’s pod and a worker node. This guarantees a dedicated public IP address for each media service pod. When the media service pod starts up, it detects the associated public IP and sends it to every client that needs to connect to calls running on that pod.

In some highly regulated and fire-walled networks, the server having a dedicated public IP address does not guarantee a successful WebRTC connection, and we need to relay traffic through TURN servers, which are also exposed over dedicated public IP addresses. Dyte’s clients connect to TURN servers very early in the life cycle of a call. To make this connection predictable and performant, we decided to use Elastic IP addresses (EIP) from AWS rather than relying on DNS-based approaches. TURN will query for the EIP provisioned in its region and then assign the EIP address to an instance where the TURN pod is running.

The dominoes that had to fall

Now that we’ve understood the various parts of the system, we can continue digging into the incident and make sense of what happened.

To recap, at the moment of the issue affecting our users, the IP address the media service advertised to clients differed from the address that was actually assigned to it. Restarting the service forced it to detect the new IP address it was available on, and clients could connect to it.

Earlier that day, we upgraded the Kubernetes version for a staging cluster in the same region. It required provisioning new worker nodes and redeploying applications to schedule them on the new nodes. This included a redeploy of our TURN service, where the start-up script accidentally assigned the EIP allocated for TURN usage to a worker node in a different cluster that had an active production media service pod. Its job is fairly simple, find the network interface attached to the worker node the TURN pod is scheduled on, and tell AWS to associate the right EIP with the network interface.

The script had not been modified in a long time, but we had changed how our components authenticate with AWS’s services. As part of an unrelated effort to improve the compliance and security posture of our components, we disabled the older/insecure version of the Instance Metadata Service in favor of the newer, more secure v2. This caused a curl command that fetched the network interface’s ID in the start-up script to misbehave but not report an error. As a result, even though the start-up script had the setting to fail on error (set -e in bash), our script continued executing with faulty data and assigned the EIP to the wrong worker node.

The light bulb moments

This journey through our system included a few surprises:

When we associate an Elastic IP address with an instance that has an existing public IP address, the existing IP is disassociated and goes back to AWS’s pool of available addresses. The instance now takes on a new public IP address.
If curl receives an HTTP 4XX error (like an authorization failure), it may not return a non-zero exit code. To coerce curl to do that, we should use one of the --fail, --fail-with-body, or --fail-early command line parameters, as appropriate. Using these parameters in all our curl invocations is a no brainer, and we should all just adopt them.
AWS CloudTrail was super helpful in piecing together the exact set of steps that led to this issue and defining a timeline. Security focussed engineers often use this tool, and regular infrastructure people should use this more often to understand better how their scripts interact with AWS APIs.
While troubleshooting AWS IAM policies, the Policy Simulator is excellent at understanding why a particular AWS API call can be denied. Also, aws sts decode-authorization-message is a great companion in this process.

The fixes

Update the IAM policy associated with the start-up script for the TURN service, and ensure it only has access to an EIP within the same environment (development/staging/production). The policy also limits which network interfaces the EIP can be associated to by relying on some helpfully placed tags.
Verify that the curl version used in the pod is recent enough, add the --fail parameter to all invocations, and check the return code for all curl invocations, ensuring it is not 22
We added alerting for a specific ICE state error in our logging system, as it was a strong signal for this incident and would also serve well for other error scenarios we might run into in the future.

And this brings us to the end of this short story of how we discovered and fixed a particularly pesky ICE state error. We are extremely sorry for the impact this incident had on our customers and users, and hope this post sheds light on a corner case of working in public cloud services.

If you have any thoughts or feedback, please get in touch with me on Mastodon or LinkedIn. Stay tuned for more related blog posts in the future!

If you haven't heard about Dyte yet, head over to dyte.io to learn how we are revolutionizing communication through our SDKs and libraries and how you can get started quickly on your 10,000 free minutes, which renew every month. You can reach us at support@dyte.io or ask our developer community if you have any questions.

The Day a Restart Saved Us

A restart in time saves nine

Hold on, let's back up a bit!

The dominoes that had to fall

The light bulb moments

The fixes

Hiring Challenge: Smallest Golang Websocket Client

Packaging Libraries in iOS: A Comprehensive Guide

Render Video Tracks From WebRTC Using Flutter PlatformViews

Start building the future on live video

Start building the
future on live video