Small Datum: September 2016

Tuesday, September 20, 2016

MyRocks and InnoDB with large objects and compression

I ran tests to explain the difference between MyRocks and InnoDB when storing large objects and data with varying amounts of compressibility.

Compression in MyRocks is simpler than in InnoDB. You should expect the database to use about 1.1X times the size of the compressed output. When rows compress to 60% of their original size and are 10kb before compression, then each row should use about 6.6kb in the database. The 1.1X adjustment is for space-amplification from leveled compaction.

Predicting the space used for InnoDB is harder. First, large LOB column are not stored inline and overflow pages are not shared. Second, disk pages have a fixed size and you risk using too much space or getting too many page splits when searching for a good value for key_block size. More details are here.

I ran two tests for two types of data. The first test is an insert only workload in PK-order for data with varying amounts of compressibility. The second test determined how fast point queries could be done on that data while rate-limited inserts were in progress. By varying amounts of compressibility I mean that there was one large varchar column per row and that 20%, 45%, 75% or 95% of the data in the column was random and the remainder was constant and easily compressed. Both tests used one connection for inserts. The query test also used one connection for queries.

The test pattern was run twice. In both cases the large column was a varchar. In the first case it had a length between 10,000 and 20,000 characters. In the second case it had a length between 100 and 1000 characters. The database block size was 16kb for MyRocks and InnoDB.

Insert only

For the insert-only workload the space used for MyRocks can be predicted from the compressibility of the data. That is much less true for InnoDB. For example compressed InnoDB uses about the same amount of space for pctRand in 20, 45 and 75.

MyRocks used the least amount of space. InnoDB used much more space when the column was larger (10,000 to 20,000 vs 100 to 1000). Overflow pages are the root cause.

The insert rates are better for MyRocks than for InnoDB. They were also stable for MyRocks and uncompressed InnoDB independent of the compressibility. Rates for uncompressed InnoDB are better than compressed InnoDB. While this wasn't a performance benchmark, it matches many other results I get. It is hard to get performance and compression from InnoDB. The CPU overhead per insert was similar between MyRocks and uncompressed InnoDB. CPU overheads were mostly larger for compressed InnoDB.

Legend for the data:

ips - inserts per second
size - database size in GB at test end
Mcpu - microseconds of CPU per insert
pctRand - percentage of random data in large column
engine - rx.zlib-6 is MyRocks with zlib level 6 compression. i6n is InnoDB in MySQL 5.6.26 without compression. i6c is InnoDB in MySQL 5.6.26 with compression.

column up to 20,000 column up to 1000

ips size Mcpu ips size Mcpu pctRand engine

5489 7.7 1090 34468 11 151 20 rx.zlib-6

5540 16 1127 34824 19 149 45

5532 24 1307 34517 27 166 75

5523 30 1467 34701 33 160 95

ips size Mcpu ips size Mcpu pctRand engine

3995 87 933 23470 66 173 20 i6n

3981 87 928 23704 66 174 45

3981 86 917 23487 66` 175 75

3995 88 914 23658 66 176 95

ips size Mcpu ips size Mcpu pctRand engine

3339 36 1064 13429 33 262 20 i6c

2779 32 1278 13124 33 271 45

2133 35 1750 8767 30 392 75

1757 50 2061 7228 38 461 95

Point queries

MyRocks provides the best compression, the best query throughput, and the east CPU overhead per query. My conclusions for InnoDB space consumption are similar to the results from the insert-only workload.

Legend for the data:

qps - queries per second
size - database size in GB at test end
Mcpu - microseconds of CPU per query
pctRand - percentage of random data in large column
engine - rx.zlib-6 is MyRocks with zlib level 6 compression. i6n is InnoDB in MySQL 5.6.26 without compression. i6c is InnoDB in MySQL 5.6.26 with compression.

qps size Mcpu qps size Mcpu pctRand engine

984 9.3 4308 2214 11 1585 20 rx.zlib-6

910 19 4532 2113 19 1627 45

846 30 4952 2102 27 1601 75

795 37 5598 2051 33 1691 95

qps size Mcpu qps size Mcpu pctRand engine

628 113 6240 1302 62 2527 20 i6n

624 110 6226 1300 63 2501 45

624 114 6312 1302 63 2536 75

628 115 6218 1305 66 2474 95

qps size Mcpu qps size Mcpu pctRand engine

708 38 5560 770 34 4450 20 i6c

629 39 6643 687 34 4895 45

513 44 8494 589 30 6046 75

418 57 10619 576 39 6599 95

Thursday, September 15, 2016

Peak benchmarketing season for MySQL

Maybe this is my XKCD week. With Oracle Open World and Percona Live Amsterdam we are approaching peak benchmarketing season for MySQL. I still remember when MySQL 4.0 was limited to about 10k QPS on 4 and 8 core servers back around 2005, so the 1M QPS results we see today are a reminder of the great progress that has been made thanks to investments by upstream and the community.

In General

But getting 1.5M QPS today compared to 1M QPS last year isn't at the top of the list for many (potential) users of MySQL. I use performance, usability, mangeability, availability and efficiency to explain what matters for web-scale DBMS users. My joke is that each of these makes a different group happy: performance -> marketing, usability -> developers, manageability -> operations, availability -> end users, efficiency -> management.

The benchmarketing results mostly focus on performance. Whether InnoDB does a bit more QPS than Amazon Aurora isn't going to make Aurora less popular. Aurora might have excellent performance but I assume people are deploying it for other reasons. I hope we make it easier to market usability, manageability, availability and efficiency in the MySQL community. MongoDB has gone a long way by marketing and then delivering usability and manageability.

Even when limited to performance we need to share more than peak QPS. Efficiency and quality-of-service (QoS) are equally important. QPS without regard to response time is frequently a bogus metric. I get more IOPs from a disk by using a too large queue depth. But more IOPs at the cost of 100 millisecond disk read response times is an expensive compromise. Even when great QPS is accompanied by a good average response time I want to know if there is lousy QoS from frequent stalls leading to lousy 99th percentile response times. Percona has built their business in part by being excellent at documenting and reducing stalls in InnoDB that occur on benchmarks and real workloads.

I have been guilty of sharing too many benchmark reports in the past that ignored efficiency and QoS. I have been trying to change that this year and hope that other providers of MySQL performance results do the same. This is an example of a result that includes performance, efficiency and QoS.

MyRocks and RocksDB

A lot of the RocksDB marketing message has been about performance. Database access is faster with an embedded database than client/server because you avoid network latency. The MyRocks message has been about efficiency. The target has been better compression and less write amplification than InnoDB so you can use less SSD and lower-endurance SSD. For a workload I care about we see 2X better compression and 1/10 the write rate to storage. This is a big deal.

When starting the project we had many discussions about the amount of performance loss (reduced QPS, higher response time) we could tolerate to get more efficiency. While we were vague the initial goal was to get similar QPS and response time to InnoDB for real workloads, but we were willing to accept some regressions. It turned out that there was no regression and similar performance with much better efficiency is a big deal.

But benchmarks aren't real workloads and there will soon be more benchmark results. Some of these will repeat what I have claimed, others will not. I don't expect to respond to every result that doesn't match my expectations. I will consult when possible.

One last disclaimer. If you care about read-mostly/in-memory workloads then InnoDB is probably an excellent choice. MyRocks can still be faster than InnoDB for in-memory workloads. That is more likely when the bottleneck for InnoDB is page write-back performance. So write-heavy/in-memory can still be a winner for MyRocks.

Seriously, this is the last disclaimer. While we are bickering about benchmark results others are focusing on usability and manageability and getting all of the new deployments.

Wednesday, September 14, 2016

Excited about Percona Live Amsterdam

I look forward to attending and speaking at Percona Live Amsterdam. There are presentations and people with expertise on MySQL, MongoDB and PostgreSQL. This is a great chance to learn from people with different experience and different expertise. I hope you make the most of it.

A great way to learn is to ask questions like:

Why are you using X?
What problems are you trying to solve?

A great way to end the conversation early and avoid learning anything is to ask questions like:

Why aren't you using Y?
You should be using Z! (this isn't a question)

zlib vs zstd for MyRocks running Linkbench

I used an IO-heavy configuration to determine the impact of zstandard vs zlib compression for MyRocks. There was about 1 read from SSD per transaction and decompression is done after each page read from the OS page cache and storage.

The results are impressive. Zstandard compresses like zlib level 1 but uses much less CPU.

zstandard reduces CPU by 45% vs zlib level 1 for the load test
zstandard reduces CPU by 11% vs zlib level 1 for the query test
zstandard gets 8% more TPS vs zlib level 1 for the query test

Configuration

Configuration for MyRocks is still complex. The templates for the MyRocks my.cnf files for Linkbench and general usage are explained on the wiki. I used no compression for L0, L1, L2, then lz4 for all but the max level and then one of zlib level 1, zlib level 6 or zstd for the max level. The tests used an Aug5 build of MyRocks, so this used kZSTDNotFinalCompression as the build preceded the 1.0 release of zstandard.

The test host has 50G of RAM available to userland, fast storage (5TB of NVMe MLC) and 24 CPU cores with 48 HW threads. The RocksDB block cache was set to 10G, the binlog was disabled but sync-on-commit was disabled for the binlog and RocksDB. Linkbench is run with maxid1=1B, the load test uses 2 clients and the query tests use 16 clients. Query tests are run as 24 1-hour loops and I report metrics from the 24th hour. I used my branch of linkbench and support scripts.

Results

The results for zstandard are impressive. I look forward to using this in production. Thanks Yann.

Legend:

ips/tps - inserts & transactions per second
r/i, r/t - iostat reads per insert and per transaction
wKB/i, wKB/t - iostat KB written per insert and per transaction
Mcpu/i, Mcpu/t - usecs of CPU time per insert and per transaction
size - database size in GB
rss - mysqld RSS size in GB
un, gn, ul, gl - p99 response time in milliseconds for the most frequent transactions (Update Node, Get Node, Update Link, Get Link List)

Results for the load

ips r/i rKB/i wKB/i Mcpu/i size rss engine
61543 0 0 0.98 81 324 3.1 zstd
61504 0 0 0.98 146 331 2.0 zlib-1
61457 0 0 0.97 153 312 2.2 zlib-6

Results for the 24th hour of the query test

tps r/t rKB/t wKB/t Mcpu/t size rss un gn ul gl engine
39366 1.00 10.38 2.36 878 377 12.2 0.6 0.6 1 0.8 zstd
36524 1.00 10.47 2.45 992 381 12.1 0.7 0.6 1 0.9 zlib-1
37233 0.97 9.76 2.30 1002 360 12.0 0.7 0.7 1 0.9 zlib-6

Wednesday, September 7, 2016

Tuning the RocksDB block cache

I spent many years using InnoDB with direct IO and I didn't worry about buffered IO performance. Well, I didn't worry until Domas told me to worry. My focus has switched to RocksDB and now I worry about buffered IO performance. Fortunately, another co-worker (Jens Axboe) promises to make buffered writeback much better.

With direct IO, InnoDB stores compressed and uncompressed pages in the InnoDB buffer pool. It has a clever algorithm to determine how much memory to use for each based on whether the workload appears to be IO or CPU bound. My vague memory is that we tune my.cnf to keep it from being too clever.

With buffered IO, RocksDB manages a block cache for uncompressed blocks and then depends on the OS page cache for compressed blocks. While I think there is an opportunity to be more efficient in that area, that is not the topic for today.

The question today is how to divide memory between the RocksDB block cache and the OS page cache. I have read tuning advice for other buffered IO databases that suggest giving as much RAM as possible to the database. I disagree and my advice is:

If the uncompressed working set fits in the RocksDB block cache then give as much RAM as possible to the block cache.
Else if the compressed working set fits in the OS page cache then give most RAM to the OS page cache by using a small RocksDB block cache.
Else give the RocksDB block cache about 20% of host RAM.

This is a rule of thumb. Sometimes in rule 3 I suggest giving 25% or 30% to the block cache, but I hope you get the point. The goal is to avoid reads from storage by caching more data in RAM. I assume that decompressing a block is much faster than reading it from storage which is more likely when you use zstandard.

This isn't proven unless you accept proof by anecdote. I ran a test with Linkbench on a host with 50G of RAM and a ~350G database. The test was repeated with the RocksDB block cache set to 5G, 10G, 20G and 35G. Using a smaller block cache reduced the storage read cost per transaction by between 10% and 20% using iostat r/s and iostat rKB/s. My advice might not work for you, but might help you to consider your choices before following tuning advice you read on the web.