It’s been over 2 years since we made the switch from InfluxDB to TimescaleDB at DNSFilter. You can read the original blog for all the details on why we transitioned to TimescaleDB in the first place, but the main thing we were after was reliability. We’re still using TimescaleDB, and unsurprisingly we’ve made a lot of changes to our infrastructure since that original post as our total users have continued to grow. Over the last 2 years we’ve worked to optimize TimescaleDB performance, and we’ve done it all without clustering.
In 2018, I made the prediction that we could get to 3B queries per day without major structural changes to our setup. I wasn’t quite wrong, but I wasn’t totally right either.
After going roughly 18 months without issues (with daily queries steadily growing), we hit over 1.2B for the first time in October of 2019. It was the first sign that the actual hardware supporting TimescaleDB would have trouble getting to 3B daily queries. The server had trouble keeping up with Kafka, and the lag was hours long. We weren’t losing queries like we had previously with InfluxDB, but the disk I/O of our TimescaleDB server had trouble keeping up.
We tweaked the ram usage in PostgreSQL and started to plan for a new approach.
At the time, this was just a spike in daily queries, but we knew we were approaching the moment where we would have to sustain over 1B queries daily. To help us out, we looked at setting up a virtual machine with Digital Ocean.
The original intention wasn’t to replace that server, it was just to take the load off of it. We wound up setting up 2 additional servers with Digital Ocean.
In February, our original server hit another wall and by March the Digital Ocean servers were officially online. For a few months, we had 3 TimescaleDB servers running (all receiving the same information) until it became clear the original server was no longer necessary. It wasn’t carrying much of the load anymore, and it was not as performant as the Digital Ocean servers. At that point, it was just deadweight and extra costs, so we deprecated it in June 2020.
As of fall 2020, our daily requests have skyrocketed compared to where we were this time last year (even with that October spike in requests). To handle this sustained surge in requests, we’ve done something a little different: We now have one bare metal TimescaleDB server set up to handle just DNS requests. Our 2 Digital Ocean servers are currently still going strong, handling unique queries from our app and a portion of daily requests.
This change was prompted to accommodate an integration partner, but the cool thing is that we’ve actually doubled my original projection with this move. That bare metal server now processes about 6B requests per day. Granted, it does not handle the full load the Digital Ocean servers do. It handles DevOps monitoring of our infrastructure, handling far fewer read queries than our primary Digital Ocean server. Though it still processes a large number of requests daily with only 5% CPU utilization.
Meanwhile, the main Digital Ocean server rarely goes over 30% CPU utilization. Though as I’ll get into later, CPU isn’t the best metric for monitoring the health of these servers.
In the past, we were renting servers. But we didn’t see that being sustainable for our business long-term.
We saw 3 possible options, and we ran costs on all of them:
Colocation was the clear winner.
While you need to put money down upfront, the costs begin to go down month over month. After 3 years, our server costs will be half of what they would be if we continued renting—and nearly 1/17 of what a switch to AWS would be.
While the monetary benefits are pretty obvious, the actual drive performance of the colocated server would be 10 times better than the AWS servers. On top of that, our setup included 648TB/month, which would have cost us $37,000 alone with AWS.
With the performance we’re currently seeing on our bare metal server (and the cost savings), we plan on migrating everything from Digital Ocean to bare metal. I just don’t see anything but colocation being able to meet our needs into the future.
Using colocated servers allows us to have the fastest, newest machines at a lower cost. I told NetActuate the specs I was looking for in a server, and then their team built everything, shipped it to the data center, and racked it for us.
It’s a happy medium. I don’t want DNSFilter to be in the business of running its own datacenter. Going this route still allows me to be hands-on while not having to worry about the actual hardware day-to-day.
The main challenge for me now is hiring DevOps staff to handle ongoing infrastructure and performance management that I’ve been doing the majority of.
Before putting any hardware into production, first I use ServerScope to understand how performant that hardware can possibly be. It also lets me know if there is anything that might indicate a defect in the hardware. That’s something that might cause limitations to the hardware’s life expectancy.
To monitor all of our servers, across both our Timescale database and our anycast network, we use Site24x7. Here, I’m able to check on KPIs, with the most important one in my day-to-day being Disk I/O.
The image above represents the Disk I/O of our main production server over Q3 2020. You can see it held steadily around 60 MB/sec for disk writes until the end of September. That’s around the time we had a huge uptick in daily queries, causing additional strain on the server. After seeing this, we decided to add our bare metal server to start handling a large portion of DNS queries. This took some of the load off of our primary server that is still handling user interface queries (in addition to a portion of DNS requests).
However, IOPS (Input/Output Operations Per Second) is arguably the best metric to look at since databases do short bursts of access. This is in contrast to a sustained (and uninterrupted) transfer, similar to the way a file server might write information. When using ServerScope, IOPS is the most significant metric I check prior to putting a server in production. It’s valuable to know what IOPS that server is capable of handling. To get an idea of the capacity we’re looking for, one recent storage drive we put into production has a random read of 654k IOPS and random write of 210k IOPS.
CPU utilization is useful to know, but really you just want to make sure that your server isn’t hitting over 80% for days or weeks at a time. That’s a sign it’s working too hard and it needs some help.
Another aspect we monitor is Kafka lag. What we saw in October 2019 (and later in March 2020) was TimescaleDB suddenly falling behind. In fact, it was hours behind. Once we saw the lag, we could investigate Disk I/O to see what the root cause of the problem was. The issue here was (again) that huge amount of new queries. TimescaleDB was suddenly writing an amount of data from Kafka it was not accustomed to writing and could not keep up. Without the ability to discover what caused the lag and subsequently working to fix it, we would have continued to get further and further behind.
We are still running on open source TimescaleDB after 2 years. It’s been able to handle all of the additional queries we’re now getting with our increase in users after the hardware changes I’ve talked about above. And on top of that, we haven’t used clustering at all.
With the announcement of TimescaleDB 2.0 and the option to use clustering for multi-node deployments in the open source version, we can continue to use open source TimescaleDB without an issue.
We’re still running a single node instance, but we do plan on implementing clusters at some point. We’ve tested clustering in the past, but there were bugs, so we never committed fully. We also haven’t used partitioning. If we try clustering again and we still run into issues or if it doesn’t allow us to scale the way we want to, partitioning is the next thing on the list for us to test.
One recent optimization we’ve made is to have chunks fit in RAM. This is a recommendation I discovered recently in TimescaleDB’s documentation around best practices.
For a long time, we had our chunk interval set as a single day (24 hours). When we weren’t even breaking 200M queries a few years ago, that was fine. But we’ve grown so much, so each of our TimescaleDB servers has a separate chunk interval time. Our primary server is now set at 4 hours, and our dedicated query server (the one handling 6B queries daily) is set to 20 minutes. This is a recent change, so it will take some time to get a clear idea of how much this change has benefited our infrastructure.
After 2 years of scaling TimescaleDB, I can make more accurate estimates and plan better. Now I know what type of stress 1B queries will put on our current servers and what changes need to be made to accommodate those new queries. We’ve also carefully mapped out our costs for the next 3 years, which will only benefit us when we run into the inevitable hiccup.
There is a lot more we plan on testing with TimescaleDB going forward, but we have a good idea of what the future of our server infrastructure will look like.