Over the last week or so I’ve been helping Steve from dataloop.io with some ranking and benchmarking of different time series databases. Of course I’m biased on that topic given I’ve written DalmatinerDB and believe it’s awesome, but I’ve tried my best to be objective and scientific in the approach.
One of the more interesting parts is the benchmarks. As part of filling the ranking sheet I’ve read and analyzed about a dozen different benchmarks. They ranged from good ones that I would call scientific to bollocks marketing claims.
The truth and nothing but the truth
The truth is, all benchmarks lie and that includes the ones I’ve made for DalmatinerDB. Not necessarily intentionally, don’t get me wrong (even so some are pure marketing). It’s as simple as the work load differing so drastically in real production systems that it’s close to impossible to design a benchmark to cover them all.
The lie of omission
The most common problem with benchmarks is missing information. Personally I believe that a benchmark that does not publish detailed enough steps to be reproduced by anyone willing is pretty much worthless.
Sadly, that means that to give users a fair comparison benchmarks need to be run on cloud hardware. That degrades performance and it is something most companies publishing benchmarks don’t want. Of cause running on a dedicated server for $120k will be faster than Amazon/Google/Joyent but how would a mere mortal be able to verify this results? I’ll just put a number here. If the total cost of the benchmark is over $100 it’s beyond the reach of someone wanting to verify the numbers. And that means they can just be made up. This differs slightly for companies with a big budget.
Worst yet, most benchmarks don’t specify hardware at all. This leaves you in the dark as to what the investment required would be to achieve those numbers. It could be run on the 5-year-old notebook of the developer or on the above mentioned super server. You will never know.
Another often omitted detail is the load profile. There is a huge difference in that! Depending on the use case they might not at all fit your use case at all.
The same goes for application settings. Without the full settings it’s hard if not impossible to evaluate if they are sensible for a production environment.
The lie of batching
Sadly, it’s too common to use batch workloads without being clear about that. Thats fine for long term analytics. Cases where you want to load tons of historical data and then process it. However, it’s an entirely useless number when doing something close to real time. No way you want to wait a week before pushing the metrics of your production system to watch its load.
Why are batch loads so evil? Because it’s very easy to optimize for them. Lets take Postgres for example. I’m picking this as an example it’s somewhat of a non-competitor in the whole benchmarks and it illustrates the problem well. Its blazingly fast loading batch data when compared to the frequent but small updates a near to real time load generates.
I mentioned this before with the load profiles, this is really a big issue.
Personally I like haggar. It generates metrics for a set number of servers in a given frequency. Which comes close to real workloads in my eyes – but still only close.
The lie of time and space
This is a tricky one! Most benchmarks are run over a rather short time. I assume mostly due to resource constraints (both time and money wise). However, this does not reflect a production system well. Even the DalmatinerDB benchmarks are run with a near empty database to start. Yet most metric systems will run for month of not years at a stretch.
During some benchmarks (not DalmatinerDB) I’ve started to see serious performance degradations even when only a month of data was in the database. Benchmarks that simulate an hour of a day of data hide this problem entirely.
Size goes into the same category. Just because a system works well with 100MB of data doesn’t mean it works well with 100TB of data or more. Many benchmarks are run against absurdly small datasets.
The same goes for query performance. Reading 10s from one metrics is very different form reading a day from 10 metrics or 100, or 200, or a week worth of data.
The average lie
Even when everything above is done nice and correct this one can still bite you. Too many benchmarks report averages. And while for some things I believe the average is the correct aggregation. Let’s name data ingress with a small enough aggregation window as an example of that. Still in most cases it’s really a bad one.
Especially query times fall into this category. The average query time over X queries doesn’t really tell much. It could still mean that sometimes requests take absurdly long especially for systems that have a high spread.
Personally I prefer the 95th or 99th percentile depending on how important the query speed is. For the query benchmark tbke we only had min, mean and max to begin with. They came with the query benchmark inflxudb used for their publications and the results looked rather nice looking data. However, after adding the 95th and 99th percentile it suddenly looked very different.
The one lie
Here is another one. Concurrency. While concurrency and this heavily deepens on the use case again, yet many benchmarks are run with a very low concurrency. There are some cases where that reflects reality but more often than not it doesn’t.
Let’s take the query benchmarks as an example again. The default setting is to run a concurrency of 1. That might or might not be the use case you have in mind, but without providing results for higher concurrencies it’s impossible to tell how the system behaves.
To be fair as of writing this running the benchmark with a higher concurrency is still on the to-do list. So as I said before, we all lie….
The lie that wasn’t even told
I’ll make that short, very few benchmarks are published that aren’t favorable for the author. I know I have done that. Partially because I don’t want to look bad. Partially because bad results can have a good reasoning and explaining them is a lot harder. It gets worse as few people are willing to sit through the explanation to understand them.
But why?!?
This is a bit of a theoretical part. I can’t see into other people’s minds (or would never admit it?). So all I can do is to talk about my own experience and make an educated guess.
One factor is time. Making good benchmarks takes a lot of time. A hell lot of time! And knowing that they are never universally true kind of makes it feel like a futile effort.
Another factor of resources. Not every project is backed by a VC or has near unlimited resources like a huge company. So we got to get by with what we can scrape together, which often is less then optimal.
Marketing and malice. Now that might be a bit harsh but I think it’s true. Intentionally misleading people is evil, and marketing sometimes uses benchmarks to do that. I call that malice and honestly shame on those people. This seems to be a much more frequent practice when there is lots of money involved.
Doesn’t care. That’s a fair reason! If benchmarks are not published for evaluating a project but just out of the happy heart of a developer, they might not live up to semi scientific standards. I think that’s totally OK.
Lack of knowledge. Yap that happens! I did that and I pretty sure still do that, sometimes we just don’t know better.
What can we do
For those of us creating benchmarks, I’d say: be honest, it’s the key. I think everything we do can be valid if we are upfront about it and explain the reasoning.
Let’s take average vs percentiles as an example (again …). I said earlier I prefer percentiles but am OK with average for ingress. That’s a bit of a contradiction right?
So let me be upfront why. For ingress I think it doesn’t matter as much what the actual value is when the average over time we want lifelines. If we demand 15s lifelines of our system, it doesn’t matter much if the ingress is 1k metrics at t0 and 5k at t15 or if we say the average is 3k and we’re fine with that number. On the other hand, response times to queries can heavily impact UX so looking having 5% or 10% of the being 10x slower will be noticed severely.
As the reader of a benchmark we can questioning it. Asking the hard question of “why was it done this way”. And if it’s not reproducible or detailed enough perhaps just ignore it.
In the end the only way to look at benchmarks, if they really matter to you, is to model them yourself. Take your use case and see how different systems perform. That’s a lot of work, so yes it’d be nice to have a starting point but in the end it’s your responsibility to verify if it works for you or not.
What’s next
I think it’d be really nice to start having some basic procedure for this kind of benchmarks. And here I’m a bit biased towards some systems since I’m mostly interested in close to real-time metrics not batch jobs.
Haggar seems to be a good starting point for ingress benchmarks. It models a mostly realistic load (# of servers * # metrics) and slowly ramps up the numbers allowing to observe the progression.
Influx’s benchmarking tool seems good for query benchmarks. On the other hand, I’m not convinced about the ingress benchmarks as they measure batch loading only. They still work well as a data generator for the query benchmarks. It also desperately needs percentiles.
I think all benchmarks should be run on a public cloud. I’m not partial to which one. It’s about reproducibility, if you just want to quickly verify it’s affordable. The DalmatinerDB ingress benchmarks were run on the biggest IO box the JPC offers, and even then it was only $40 to set all up, run it a few times, work out the kinks and grab the data. Reproducing them could be done for half.
It might be a bit much to ask for some cases but having benchmarks with different settings would be interesting. Let’s go back to my favorite query benchmarks, it’d be nice to have the results for a concurrency of 1, 10 and 50 clients wouldn’t it?
I’m really open to other ideas and thoughts so if you have something I’ve missed feel free to leave a comment.
Leave a Reply