Wednesday, February 8, 2017

MyRocks, InnoDB and in-memory insert performance

This is yet another result to understand an insert-only workload. In this case the database is small enough to be cached by the storage engine. Previous results were explained here, here and here. From several tests it is apparent that MyRocks and InnoDB from MySQL 5.6 have similar throughput while InnoDB from MySQL 5.7 does much better for cached & insert-only when the server has fast SSD.

In this result I changed a few options to reduce the chance that compaction stalls will occur with MyRocks. I increased level0_slowdown_writes_trigger to 20 and level0_stop_writes_trigger to 30. I also ran tests with universal compaction. We probably have some work to do here -- improving default option values and making compaction throttling less spiky.

tl;dr
  • MyRocks was faster than InnoDB in MySQL 5.6 and 5.7 at 1 and 2 concurrent clients
  • MyRocks and InnoDB/MySQL-5.6 have similar throughput at 4+ concurrent clients
  • InnoDB/MySQL-5.7 is fastest at 8+ concurrent clients
  • Disabling the binlog helps MyRocks more than InnoDB. We have work in progress to make binlog processing faster for MyRocks.
  • MyRocks with leveled compaction and the binlog is limited by compaction stalls in moving data from level 0 to level 2.
Note that MyRocks did great when able to avoid the bottlenecks from binlog processing and compaction stalls. That is apparent in the result for universal compaction with the binlog disabled. 

Test

For this test I inserted 400M rows using the insert benchmark. The test was run for 1, 2, 4, 8, 12 and 16 concurrent clients. There was a table per client and each table has 3 secondary indexes. The inserts are in PK order. The database cache was large enough to cache the database.

Tests were repeated with the binlog enabled and disabled. For MyRocks I used a build from November 21 to match what was used in previous tests. For MyRocks I also ran tests with leveled and universal compaction. This is the first time I tried universal compaction with MyRocks.

Results

The numbers in the table below are the average insert rate, larger is better.

1       2       4       8       12      16      concurrency
34465   61843   106185  131536  131926  121988  innodb-5.6.26, binlog
35217   63857   108666  133200  133023  125196  innodb-5.6.26, no binlog
31461   60855   114058  201918  258398  281690  innodb-5.7.10, binlog
32962   62490   117028  206292  263678  282486  innodb-5.7.10, no binlog

leveled compaction
38465   71467   113186  125392  125865  125984  myrocks, binlog
44092   81950   132275  142298  139519  134998  myrocks, no binlog

universal compaction
37474   71162   115042  133556  135181  134003  myrocks, binlog
43006   79888   134590  198511  238379  252207  myrocks, no binlog

Graphs

These graphs show the row insert rate per 5 second interval for the test with 16 clients. The graph for InnoDB-5.7 ends earliest because it finishes the load first. The graph for MyRocks with leveled compaction has the most variance when ingest is faster than compaction from level 0 to level 2. Ignoring the variance, MyRocks throughput declines less over time than InnoDB.

The next graph is for InnoDB-5.7. While throughput declines over time there is not much variance.
The next graph is for InnoDB-5.6. It has more variance and loses more throughput over time compared to InnoDB-5.7. MySQL has been busy making InnoDB better.
The next graphs are for MyRocks with leveled and then universal compaction. There are more stalls with leveled compaction. The average throughput is similar and binlog processing is the bottleneck. We have work in progress to improve leveled compaction and the overhead from the binlog.


Thursday, February 2, 2017

Why MyRocks?

This is yet another attempt to explain why MyRocks is interesting. MySQL continues to get more popular and should soon become #1 on the db-engines ranking. Popularity isn't the only thing about MySQL that is improving -- performance, manageability and availability are also much better and cloud vendors, especially Amazon, are doing remarkable work to make MySQL in the cloud much easier than it used to be.
While Oracle has been a great owner of MySQL we still benefit from external contributions to the core DBMS. But new storage engines are unlikely, even ones that had great promise like TokuDB and PBXT, because the storage engine API is hard to implement. I am surprised and thrilled by the progress we are making with MyRocks. It is in production for us today and I expect it to be in production elsewhere within the next 12 months. I am grateful to work with talented teams (MyRocks, MySQL & RocksDB) and understanding management.

Efficient performance is the reason for MyRocks. We want to provide performance similar to InnoDB but with much better storage efficiency. The technical message is that MyRocks has less space and write amplification than InnoDB without sacrificing too much read amplification. For a workload I care about it uses 1/2 the space and writes at 1/10 the rate of InnoDB. The less technical message is that with MyRocks a deployment needs less SSD and the SSD will last longer.

The other important question is when to use MyRocks. My goal is for MyRocks to be an alternative to InnoDB for any workload where the database is larger than RAM. While we have optimizations for in-memory workloads with RocksDB I don’t focus on that today. I also assume that too many workloads use too much RAM today. If you have fast storage with SSD and the database fits in RAM then you probably have too much RAM. So my pitch is to use more fast storage and less RAM and use less space, less power and possibly fewer servers.

Monday, January 30, 2017

Compaction stalls: something to make better in RocksDB


In previous results that I shared for the insert benchmark it was obvious that MyRocks throughput is steady when the workload transitions from in-memory to IO-bound. The reason is that non-unique secondary index maintenance is read-free for MyRocks so there are no stalls for storage reads of secondary index pages. Even with the change buffer, InnoDB eventually is slowed by storage reads and by page writeback.

It was less obvious that MyRocks has more variance on both the in-memory and IO-bound insert benchmark tests. I try to be fair when explaining storage engine performance so I provide a few more details here and results for InnoDB in MySQL 5.7.10 & 5.6.26 along with MyRocks from our fork of MySQL 5.6. The binlog was enabled for all tests, fsync-on-commit was disabled and 16 clients inserted 500m or 2b rows into 16 tables in PK order. Each table has 3 secondary indexes. For MyRocks I made one configuration change from the earlier result. I changed level0_slowdown_writes_trigger from 10 to 20 to reduce compaction stalls. This has a potential bad side effect of making queries slower, but this test was insert-only.

For both graphs the y-axis is the average insert rate per 5-second interval and the x-axis is the interval number. I used mstat to collect the data.

The goal is to match InnoDB in MySQL 5.7 in quality of service while providing much better throughput. We have some work to do. On the bright side, this is an opportunity for someone to make RocksDB better.

In-memory

The first graph is for the in-memory workload where all data is cached by the storage engine, nothing is read from storage, but many storage writes are done to persist the changes. InnoDB with MySQL 5.7 is much faster than with 5.6. It also has the least variance. MyRocks has the most variance and that is largely from compaction stalls when there are too many files in level 0 of the LSM tree.


IO-bound

The second graph is for the IO-bound workload. The graph ends first for MyRocks because it sustains the highest average insert rate. But it also has a thick line because of variance. InnoDB from MySQL 5.6 also has a lot of variance.

Monday, January 23, 2017

The value of write efficiency for the insert benchmark

I shared results last week for the insert benchmark. The test server for that result had fast SSD courtesy of several NVMe NAND flash devices. I repeated tests on a server with slower SSD and the results are interesting. Regardless of the storage device, MyRocks did better when the database was larger than RAM and InnoDB 5.7.10 did better when the database fit in RAM. But the difference between MyRocks and InnoDB 5.7.10 for the in memory workload is much smaller on slow SSD than it is on fast SSD. It is always risky to ignore the context from benchmark results and the important context here is the performance of the storage device.

Slower SSD hurts InnoDB more than it hurts MyRocks because MyRocks is more efficient for writes. Spending less IO on writes saves IO to do more writes or more reads as I previously demonstrated for Linkbench.

When the database fits in RAM there are no reads from storage. But there are many writes to storage whether or not the database is larger than RAM and the insert benchmark workload is always IO heavy. From looking at PMP stacks InnoDB throughput is limited by the rate at which the redo log is written and InnoDB in 5.6.26 has more CPU overhead from mutex contention than in 5.7.10. Note that the binlog was enabled but sync-on-commit was disabled so there was more stress on redo log throughput.

Workload

The workload is fully described in my previous post. 500m rows are inserted for the in memory test and 2b rows for the larger than memory test. All tests use 16 clients, each client inserts to a different table and each table has 3 secondary indexes to maintain.

The slower SSD provides about 10k page reads second versus more than 100k/second provided by the fast SSD server.  Both servers have similar CPUs and RAM.

By in memory I really mean that the database working set fits in memory but it is easier to ignore that and just claim the database fits in memory. Sorry for the confusion.

Throughput

The throughput listed below is the average insert rate during the test. Things that I notice in the results include:
  • InnoDB loses more throughput when going from fast to slow SSD and from in memory to larger than memory workloads.
  • Restating the previous point, MyRocks throughput is more stable when going from fast to slow SSD and from in memory to larger than memory workloads.
  • InnoDB in MySQL 5.7.10 scales much better than 5.6.26 for in memory write-heavy workloads

Throughput for in memory database
            slow-SSD  fast-SSD
MyRocks        83766    102712
InnoDB-5.7    124782    268873
InnoDB.5.6     66251    111111

Throughput for larger than memory database

            slow-SSD  fast-SSD
MyRocks        86401   115234
InnoDB-5.7     38207    65604
InnoDB-5.6     14784    52812

IO Efficiency

Things that I notice in the results include:
  • the value for wKB/i for MyRocks is somewhat smaller than InnoDB for the in memory database and much smaller than InnoDB for the larger than memory database. I expected this.
  • InnoDB in MySQL 5.7 is able to sustain higher IO rates than in 5.6 thanks to many performance improvements in InnoDB.
  • Assuming the slow SSD saturates around 400 MB/s of IO, then writing less per insert for MyRocks means it gets more inserts/second when IO saturates.

Legend:
* IPS - average inserts/second
* rMB/s, wMB/s - read & write MB/s via iostat
* rKB/i, wKB/i - read & write KB per insert via iostat
* note that iostat usually overstates bytes written by 2X

  for RocksDB because it counts bytes trimmed as bytes written

IO metrics for in memory database
             IPS     rMB/s   wMB/s   rKB/i   wKB/i
- slow SSD
MyRocks       83766    0      280     0       3.34
InnoDB-5.7   124782    0      533     0       4.27
InnoDB-5.6    66251    0      281     0       4.24
- fast SSD
MyRocks      102712    0      339     0       3.30
InnoDB-5.7   268873    0     1163     0       4.35
InnoDB-5.6   111111    0      451     0       4.05

IO metrics for larger than memory database
             IPS     rMB/s   wMB/s   rKB/i   wKB/i
- slow SSD
MyRocks       86401   86      346     1.00    4.00
InnoDB-5.7    38207   42      385     1.21   10.08
InnoDB-5.6    14784   36      169     2.44   11.41
- fast SSD
MyRocks      115234  118      457     1.03    3.96
InnoDB-5.7    65604   75      878     1.14   13.38
InnoDB-5.6    52812   71      525     1.34    9.94

Tuesday, January 17, 2017

Insert benchmark, MyRocks and InnoDB

I haven't been sharing many performance results on my blog as I have been saving results for my conference talks. Later this year I expect to blog more and travel less. Today I will share two results from the insert benchmark to compare MyRocks and InnoDB.

tl;dr - both are expected
  • MyRocks insert performance doesn't degrade when the database is larger than memory
  • InnoDB in MySQL 5.7 scales much better than in MySQL 5.6 for in-memory workloads
  • InnoDB in MySQL 5.7 wins for in-memory, MyRocks wins for io-bound

Configuration

This test does inserts in PK order into a table with 3 secondary indexes. Without the secondary indexes the workload is friendly to a b-tree. But secondary index for a b-tree is read-modify-write and the reads from and writes to storage will make InnoDB slower when the working set doesn't fit into RAM. InnoDB is likely to do better than other b-trees because of the insert buffer. With MyRocks, non-unique secondary index maintenance is read-free so performance doesn't drop much when the working set no longer fits in RAM.

I ran the insert benchmark using this script with two configurations. Both configurations used 16 clients, 16 tables and a client per table. For the io-bound configuration there was 50gb of RAM for the database and OS and 2B rows were inserted.  For the in-memory configuration there was 256gb of RAM for the database and OS and all data was cached by the database cache.

The test server has 48 cores with HT enabled and fast NVMe flash. For MyRocks I used the FB MySQL repo. For InnoDB I used MySQL 5.6.26 and 5.7.10. I tried to use a good my.cnf for all engines and the MyRocks settings are similar to what we suggest. For the io-bound configurations I used a 10gb RocksDB block cache and 35gb InnoDB buffer pool. For the in-memory configurations all database caches were set to 190gb. The binlog was enabled for all tests, but sync on commit was disabled for the database redo log & binlog.

I disabled compression for all engines in this test. InnoDB still uses much more space on disk, probably because of fragmented leaf pages.

Average insert rates

The average insert rates for each configuration. It is odd that the insert rate for io-bound MyRocks is better than for in-memory and I am not sure I will debug it. There isn't much difference between InnoDB from MySQL 5.6 and 5.7 for the io-bound configuration. There is a huge difference between them for the in-memory configuration. I assume the difference is all of the performance work done upstream (thank you). MyRocks is much faster than InnoDB for the io-bound configuration. InnoDB in MySQL 5.7 is much faster than MyRocks for the in-memory configuration. I expect MyRocks to do better on the in-memory setup in the future.

Legend for the table:
  • io-bound - average inserts/second for the io-bound configuration
  • in-memory - average inserts/second for the in-memory configuration
  • size - database size in GB at test end
  • engine - database engine

io-bound  in-memory  size   engine
115234    103088     226    MyRocks

 65604    267523     362    InnoDB-5.7.10
 52812    111259     362    InnoDB-5.6.26

IO-bound results

I have two graphs to show the insert rates over time. This is for data from one of the 16 clients rather than the global rate because my test script forgot to collect that. The first graph uses log scale for the x-axis. The second graph does not. The first graph makes it easier to see how throughput drops for InnoDB as the database gets larger. While it is odd that the throughput rate for InnoDB 5.7 picks up at test end, that can occur if the thread I monitored ran longer than other threads and had less competition for HW resources near test end. The graph for MyRocks stops earlier than the InnoDB graphs because it finishes the insert of 2B rows much earlier.

As a bonus for MyRocks, it writes about 1/4 the amount to storage per inserted row when compared to InnoDB. That makes SSD last longer and using less IO for writes saves more IO for reads.



Tuesday, January 3, 2017

MyRocks, MongoRocks and RocksDB at Percona Live 2017

Percona Live 2017 is a great place to learn about MyRocks, MongoRocks and RocksDB. The MyRocks community continues to grow as one of the talks is from Alibaba. I am excited that they help make MyRocks better.

Talks about RocksDB

Talks about !RocksDB
There are two other talks that I want to attend given their focus on storage engines:

Friday, December 30, 2016

The MyRocks community in 2016

The MyRocks community grew in 2016 with significant contributions from early evaluators. I appreciate their help and have been busy documenting the problems that they reported. I look forward to more contributions in 2017.

MariaDB Corporation and Percona announced plans to include MyRocks in their distributions. You will be able to get MyRocks with expert support. Equally big have been the contributions from Sergey Pertunya to MyRocks.

The MySQL team at Alibaba reported poor performance from range scans with sysbench. The problem was memory system contention on a performance counter and the fix made long range scans 2X faster on concurrent workloads.

I have been using sysbench more and created a script to run a sequence of tests with it. During my tests I noticed too much variance during the read-only tests and the problem is non-determinism from the state of the memtable and level 0 of the LSM tree. The overhead for a search of the LSM tree depends on the amount of data in the memtable and the number of files in level 0. This state is not deterministic when a read-only test follows a read-write test. I filed issue 427 to see if we can get RocksDB to adapt and flush data earlier than normal from the memtable and level 0 when a workload becomes read-only.

Sysbench is extremely useful even though the workloads provided by it are synthetic -- especially the read-only tests with a small (multi-GB) database and most tests use a uniform distribution for keys. There is a talk on sysbench at FOSDEM 2017 and I look forward to using it more next year.

Justin Swanhart pointed out the lousy performance MyRocks provided with a default configuration. As a result we improved the default configuration by increasing the memtable to 64M and block cache to 512M and I will share results at FOSDEM. See issues 369375, and 441.

Justin Swanhart and Rick Pizzi reported problems with too-large transactions leading to lousy performance and OOM. MyRocks buffers in memory all changes from a transaction until commit and at commit time the changes are briefly double buffered when copied into the memtable. Until we make this better the workaround is to use rocksdb_commit_in_the_middle for bulk loads. Even after we make this better that option will be useful for bulk loads. Today we have a poor limit on the max size of a transaction via rocksdb_max_row_locks. This is a limit on the number of locked rows rather than on the memory used by a transaction. Even worse, the default is 1B. We are changing the limit to be on the amount of memory used by a transaction. Issues for this include 341, 347, 348 and 365.

The MySQL team at Alibaba reported a performance regression from MyRocks group commit when running sysbench. They also reported a correctness problem that I ran into with sysbench. Auditing the code found another bug. A related problem is that rocksdb_disable_2pc was enabled by default, meaning engine & binlog crash safety wasn't guaranteed. We have changed the default configuration to be crash safe and renamed the option to avoid double negatives. We also used YSCB and sysbench to reproduce the performance regression and have work in progress to make this better. That requires collaboration between the MyRocks and RocksDB teams. It is fun to watch the teams solve problems. Issues for this problem include 474481, 482488.