Anycast Resolution Latency and Our Commitment to Transparency

Early today at 11:40 a.m. UTC, we detected degraded performance across the DNS2 anycast network. Our team escalated the issue to our hosting provider immediately, and took action to implement a fix by 1:00 p.m. UTC. Performance was fully restored by 1:44 p.m. UTC, and our team continued to monitor the situation. You can review the updates on our status page here.

In the interest of transparency, I wanted to write this article to detail exactly what we experienced to our customers to provide additional information around the incident around this somewhat unique issue.

The complete incident details

At 11:49 a.m. UTC we detected degraded performance on part of our DNS2 anycast network. One of our hosting providers stopped sending our secondary prefixes, pushing the majority of DNS2 traffic to our DNS1 anycast network, which is the initial cause of this degradation.

During the shift from DNS2 to DNS1, much of that traffic shifted to nodes in Copenhagen, Prague, Marseille, and Stockholm. But those nodes could not handle the entire surge from DNS2, and traffic was again rerouted to Sydney and Miami. While this failover mechanism maintained DNS resolution for our customers, it also created latency primarily in central and eastern US time zones. DNS resolution speeds increased at their height to roughly 300ms (3/10 of a second), though the average response time in that window was 11ms.

Since we use our own service internally, we also experienced this incident firsthand. While you might not have noticed the impact if you were browsing a news site at this time, sites that use a lot more dynamic resources may have seemed slow based on the knock-on impacts of slower resolution.

Because we saw this incident occur in real-time, we immediately escalated the issue to our provider and collaborated to resolve the problem. Our hosting provider is also conducting further RCA (root cause analysis) to understand what led to the routing interruption of our secondary prefixes.

Our fully redundant architecture allowed DNS resolution to continue, despite increased latency of resolution time.

Changes we’re making

We are still investigating this incident with our hosting provider, as mentioned above. One thing we’re looking at doing a better job of is decreasing the MTTR (meantime to recovery) for these types of situations. We believe we will resolve these issues significantly faster even when the impact is low.

We are also reviewing internal processes and how we’ve structured our architecture to determine what changes we can make to reduce the impact surface area if an anycast node goes down.

When we built our anycast network, we purposefully created two parallel BGP networks so that if one network had any failures or latencies, the other network would pick up the slack. In one way, this incident was a testament to the success of that strategy; But in another way, this incident will allow us to build further improvements to account for the infinite landscape of problems that come with running a complex global anycast network.

I keep saying transparency

I often correlate the service we provide to oxygen. If we’re controlling the oxygen flow for other companies like ours, we need all of the gauges to report accurately and every tank has to be filled.

Providing our customers with a reliable, high performance service remains a core value of ours. We know that we are an integral part of your technology stack—one that you need to simply work. That’s why we take incidents like this very seriously. 

But I also recognize the need to share information when things like this occur. I’m a software user, too. I get impacted by incidents, too. As a technical user, I want answers to why these things occur. That is what we strive to do here: Be honest and responsive when incidents of this type do occur.

We are committed to our customers beyond the product itself. Each of you has chosen to partner with DNSFilter as your DNS resolution and filtering provider, deploying security to your organization via DNS through us. Thank you for choosing us, and we will continue to work hard to ensure that oxygen levels are at full capacity. And if the readings are ever off, we will always let you know.

 

Visit DNSFilter’s status page for details on this incident.

Search
  • There are no suggestions because the search field is empty.
Latest posts
What is Secure Web Gateway: What It Does, Benefits, and More What is Secure Web Gateway: What It Does, Benefits, and More

In today's world of ever-increasing cyber threats, organizations need strong defenses to protect their networks and data and in this complex digital ecosystem, we need more than just one line of defense.

Revving Up the Fun: DNSFilter's IndyCar Experience Recap — St. Pete Edition Revving Up the Fun: DNSFilter's IndyCar Experience Recap — St. Pete Edition

What a weekend at the track! DNSFilter was thrilled to host 10 guests alongside Pax8 this weekend for an unforgettable IndyCar experience in sunny St. Petersburg. Those who joined us came from Thrive, MVP Network Consulting LLC, Myrtle Beach Academy of Aviation, Entech, NetGain Technologies,Warren Averett Technology Group, LLC, and ECMSI—we were lucky to be in such great company for our very first race of the season.

Man-in-the-Middle Attacks: What Are They? Man-in-the-Middle Attacks: What Are They?

A man-in-the-middle (MITM) attack is a form of cyber threat where a bad actor inserts themselves into a conversation between two parties, intercepts traffic, and gains access to information that the two parties were trying to send to each other. It allows attackers to eavesdrop, collect data, and even alter communications between victims. Understanding the mechanics, implications, and defense mechanisms against MITM attacks is essential for prote...

Explore More Content

Ready to brush up on something new? We've got even more for you to discover.