Our results: 10x better resource utilization, even with 30% more requests
In the early days of DNSFilter, we spent some cycles thinking towards the future scalability of the business. I’m an advocate of early-stage startups not over-optimizing, but we also needed to understand the unit-economics of our business. That meant understanding how performant we could be, along with what bandwidth, storage and reporting costs would be.
A large component of our architecture is logging our customers’ requests. We provide analytics to our customers. This means we need to account for each DNS request, because any lost request is lost data for our customers.
Back in late 2015 when we started the company, and into early 2016 as we began internally testing our MVP, we researched options in the market for storing and analyzing large sums of data. A new buzz-word at the time was ‘time series databases’. Our data seemed to fit the pattern, and I began working with the hot new startup InfluxDB.
After spending months experimenting, looking at their performance tests, implementing InfluxDB into query-processor, and setting up a fast and powerful server to run InfluxDB — There were still more hurdles. Unfortunately I wasted weeks of time on the learning curve of understanding in what ways InfluxDB differed from traditional SQL, limitations in their schema, query language, indexing, and more. I emailed influx, read forums, and github issues, and battled through issue after issue. But finally we had a working solution. For a while.
Unfortunately over time we ran into a number of issues with InfluxDB 1.3.
While in NYC, meeting up with one of our vendors, Packet.net, I searched through LinkedIn and saw that an old friend, Ajay Kulkarni was now in New York, and working on an IoT startup. As we reconnected, I learned that they were in the process of pivoting the business based on a component they’d developed to process time series data; and store it in PostgreSQL. As I shared some of my challenges with InfluxDB, they shared their status with releasing TimescaleDB. I left New York that week excited and rejuvenated that maybe there was another option that could fit our needs.
Switching was not easy. Not because of TimescaleDB — but because we’d chosen to integrate so tightly with InfluxDB. Our query-processor code was directly sending data to our InfluxDB instance. We explored the option of sending data to both InfluxDB and TimescaleDB directly, but in the end realized an architecture utilizing Kafka would be far superior.
After learning the ropes of Kafka, setting it up, coding query-processor to send data to Kafka, writing our own golang Kafka consumer to process the Kafka queue, and send data to both InfluxDB and TimescaleDB — we were finally ready to compare the solutions.
What a difference! From the get-go, we noticed marked improvements:
With upcoming GDPR regulations, it also seems TimescaleDB is better able to handle compliance as compared to InfluxDB. With InfluxDB, it is not possible to delete data, only to drop it based on a retention policy. TimescaleDB in comparison, offers a more efficient drop_chunks() function, but also allows for the deletion of individual records. This ability to delete particular records allows for compliance with deleting user data if requested.
The one area TimescaleDB natively lacked, was storage requirements. We saw that it was far less efficient as compared to InfluxDB. For us and our volume, this was not a big deal.
The current solution to this is filesystem level compression. The TimescaleDB folks recommend ZFS compression. After setting it up, we get about a 4x compression ratio on our production server, running a 2TB NVMe.
Here’s a snapshot of system CPU load with InfluxDB from October 2017, compared to February 2018 with TimescaleDB. Also keep in mind that daily processed requests increased 30% between these time periods, and TimescaleDB is run on a slightly slower server.¹
As you can see from the charts, even while processing 30% more requests, we’re using 96% less CPU, with far lower maximum load spikes.If this scales perfectly, and the NVMe can keep up, this means we could potentially handle over 3B requests/day (CPU-wise… We’ll need to add a 2TB NVMe every 1B). This is relevant and helpful for cost-estimating, as we recently quoted a potential customer who would send us a little over 3B requests a day.
The most surprising improvement in switching to TimescaleDB was in recovered development cycles. As myself and our backend developer were already very familiar with standard SQL, it has become faster to add new features, because it’s very clear-cut how to add columns, add indexes, and make the changes necessary to roll new features into production. We use a master-master architecture, with Kafka streaming to two production timescale instances. We can shift our API traffic between the instances when doing any database changes, so we can be sure we’re not impacting customers. Many PostgreSQL operations are non-blocking, so we wouldn’t even need to do this, but it’s also just great to have a hot failover.
We went from InfluxDB’s performance as a daily worry and development headache, to being able to treat TimescaleDB as a reliable component of our architecture which is a joy to work with.
I’m eternally thankful to the team at TimescaleDB for their help over the last year. Our business would not be the same today if it were not for our switch to TimescaleDB, which they helped make very painless. We’re doing 150M queries a day, and at less than 5% CPU usage. Our global anycast network currently has the capacity to process 20B queries a day if equally distributed. I feel confident we can scale to 3B queries a day or more without much change — and look forward to continuing to grow with TimescaleDB.
 These are two different dedicated servers with two different providers. Both have high performance NVMe drives.