Monday, June 23, 2014

Benchmark(et)ing

Benchmarking and benchmarketing both have a purpose. Both also have a bad reputation. A frequently expressed opinion is that benchmark results are useless. I usually disagree. I don't mind benchmarketing and think it is a required part of product development but I am not fond of benchmarketing disguised as benchmarking.

Benchmarketing is a common activity for many DBMS products whether they are closed or open source. Most products need new users to maintain viability and marketing is part of the process. The goal for benchmarketing is to show that A is better than B. Either by accident or on purpose good benchmarketing results focus on the message A is better than B rather than A is better than B in this context. Note that the context can be critical and includes the hardware, workload, whether both systems were properly configured and some attempt to explain why one system was faster.

I spend a lot of time running benchmarks. They are useful when the context for the result is explained and the reader has sufficient expertise. Many benchmark results don't explain the context and not everyone has the time and ability to understand the results. Thus many benchmark results are misunderstood and perhaps benchmarks deserve a bad reputation. Another problem is that benchmarks usually focus on peak performance while peak efficiency is more important in the real world. Our services have a finite demand and we want to improve quality of service and reduce cost while meeting that demand. But that is hard to measure in a benchmark result.

One obvious result that a benchmark can provide is the peak rate of performance that won't be exceeded in the real world. This is useful when doing high level capacity planning or when debugging performance problems in production. Narrow benchmark tests, like read-only with point-lookups, help to provide simple performance models to which more complex workloads can be mapped.

Another result is that performance comparisons are less likely to be useful as the number of systems compared increases. It takes a lot of time and expertise to explain benchmark results and to confirm that best practices were used for all of the systems compared. In my experience results that compare more than two systems tend to be benchmarketing rather than benchmarking.

For a few years my tests were limited to MySQL with InnoDB, but recently I have run tests to compare different products including MySQL/InnoDB, WiredTiger, RocksDB, LevelDB, TokuDB, TokuMX and MongoDB. I am comfortable publishing results for new releases of MySQL & InnoDB and they can be compared to results I previously published.  I am less willing to publish results that compare products from multiple vendors especially when small vendors are involved. I don't want to be a jerk to a small vendor and these take more time to evaluate.

I have some advice for people who run benchmarks even though I don't always follow of of it. Some of the advice takes a lot of time to follow.
  • Explain the results. This is the most important suggestion and I try to ignore results that are not explained. Why was one system faster than another (better algorithm, less code bloat, simple perf bug, etc)? Simple monitoring tools like vmstat and iostat are a start but you will eventually need to look at code. Run PMP to understand where threads are busy or waiting. Run Linux perf to see what consumes CPU time.
  • Explain whether you made any attempt to properly configure the systems tested. The quality of a benchmark result is inversely related to the number of systems tested because it is less likely that the people doing the test have expertise in all of the systems. Publish the configuration files.
  • Explain whether you have expertise in using the benchmark client.
  • Explain the context for the test. The result can be described as A is faster than B in this context so you need to explain that context. 
    • What was the workload? 
    • What hardware was used (CPU description, #sockets, cores per socket, clock rate, amount of RAM, type of storage)? What rate can the storage sustain independent of the DBMS for both IOPs and MB/second?
    • What product versions were used? Comparing your beta versus their GA release that runs in production might be bogus. It can also be bogus to compare production systems with research systems that have no overhead from monitoring, error logging, optimization, parsing and other features required in the real world.
    • How was the test deployed? Did clients share the same server as the DBMS? If not what was the network distance & throughput between them.
    • Were the file structures and storage devices aged? A b-tree fragments from random updates over time. Write-optimized databases and flash get garbage to be collected. Doing a sequential load into a b-tree or LSM and then immediately running tests means the file structure isn't in a steady state.
    • How full was the storage device? Flash and spinning disk perform very differently when full than when empty. 
    • Was buffered or direct IO used? If buffered IO was used are you sure...
      • a good value was used for filesystem readahead?
      • a good posix_fadvise calls were done to enable or disable readahead?