Small Datum: Making the case for MyRocks. It is all about efficiency.

Tuesday, October 11, 2016

Making the case for MyRocks. It is all about efficiency.

I had two talks at Percona Live - one on MyRocks and another on web-scale. The talk links include the slides, but slides lose a lot of context. But first, the big news is that MyRocks will appear in MariaDB Server and Percona Server. I think MyRocks is great for the community and getting it into supported distributions makes it usable.

Efficiency is the reason for MyRocks. The RUM Conjecture explains the case in detail. The summary is that MyRocks has the best space efficiency, better write efficiency and good read efficiency compared to other storage engines for MySQL. The same is true of MongoRocks and MongoDB storage engines. Better efficiency is a big deal. Better compression means you can use less SSD. Better write efficiency means you get better SSD endurance or that you can switch from MLC to TLC NAND flash. Better write efficiency also means that more IO capacity will be available to handle reads from user queries.

But performance in practice has nuance that theory can miss. While I expect read performance to suffer with MyRocks compared to InnoDB, I usually don't see that when evaluating production and benchmark workloads. I spent most of this year doing performance evaluations for MyRocks and MongoRocks. I haven't shared much beyond summaries. I expect to share a lot in the future.

I prefer to not write about performance in isolation. I want to write about performance, quality of service and efficiency. By performance I usually mean peak or average throughput under realistic conditions. By quality of service I mean the nth (95th, 99th) percentile response time for queries and transactions. By efficiency I mean the amount of hardware (CPU time, disk reads, disk KB written, disk KB written, disk space) consumed. I have frequently written about performance in isolation in the past. I promise to do that less frequently in the future.

My other goal is to explain the performance that I measure. This is hard to do. I define benchmarketing as the use of unexplained performance results to claim that one product is better than another. While I am likely to do some benchmarketing for MyRocks I will also provide proper benchmarks where results are explained and details on quality of service and efficiency are included.

Let me end this with benchmarking and benchmarketing. For benchmarking I have a result from Linkbench on a small server: Intel 5th generation core i3 NUC, 4 HW threads, 8G RAM, Samsung 850 EVO SSD. The result here is typical of results from many tests I have done. MySQL does better than MongoDB, MyRocks does better than InnoDB and MongoRocks does better than WiredTiger. MyRocks and MongoRocks have better QoS based on the p99 update time in milliseconds. The hardware efficiency metrics explain why MyRocks and MongoRocks have more throughput (TPS is transactions/second). M*Rocks does fewer disk reads per transaction (iostat r/t), writes less to disk per transaction (iostat wKB/t) and uses less space on disk (size GB). It uses more CPU time per transaction than uncompressed InnoDB. That is the price of better compression. Why it has better hardware efficiency is a topic for another post and conference talk.

For benchmarketing I have a result from read-only sysbench for an in-memory database. MyRocks matches InnoDB at low and mid concurrency and does better at high-concurrency. This is a workload (read-only & in-memory) that favors InnoDB.

4 comments:

Akshay SuryavanshiOctober 12, 2016 at 3:53 AM
Hi Mark,
Again great news about the benchmarks, congratulations.
I was wondering does MyRocks provide the MVCC capabilities similar to how Innodb does ? Also how does MyRocks handle crash recovery, transaction snapshots etc ? Trying to understand at what point MyRocks can be a drop-in replacement for Innodb. If it's not too much to ask, can we get a blog covering similar areas.
Thanks
ReplyDelete
Replies
Mark CallaghanOctober 12, 2016 at 11:13 AM
A proper answer requires multiple chapters in a book. A lot more information is on the wiki - https://github.com/facebook/rocksdb/wiki. Start here -> https://github.com/facebook/rocksdb/wiki/RocksDB-Basics

RocksDB has two ways to support consistent read. First, all KV pairs have a sequence number that can be used to determine whether a KV pair is visible. The primary way is that compaction doesn't drop old versions of KV pairs that are visible to at least one open snapshot.

MyRocks provides read-committed (RC) and repeatable-read (RR). Today MRocks RR matches Postgres-style, not InnoDB-style (no gap locking). So just like Postgres we expect most deployments to use RC. For more details see https://github.com/mdcallag/mytools/wiki/Cursor-Isolation

MyRocks with RR isn't a drop-in replacement for InnoDB with RR because the semantics aren't the same. We might add gap locking to make it similar. MyRocks with RC is closer to a drop-in replacement for InnoDB with RC.

Plain RocksDB uses a WriteBatch so that MultiPut is atomic. MyRocks continues to use the WriteBatch and does a MultiPut at commit time. But it also uses the RocksDB transaction API which provides a lock manager to lock changed rows prior to commit and provide both RC and RR cursor isolation - https://github.com/facebook/rocksdb/wiki/Transactions

For durability there is a WAL (write ahead log) and the protocol is 1) write the WAL and then 2) update the memtable. If sync-on-commit is enabled then we fsync the WAL after writing it. Pure-RocksDB has always supported group commit in that case. With MyRocks and the binlog we also support group commit (recently checked in).

Crash recover is much simpler than for a b-tree. First, replay writes logged in the last WAL into the memtable, Then flush the memtable. Maybe the RocksDB team will laugh at my description as I never worked on that code. Actually, I have not worked on most of the RocksDB code but I spend a lot of time doing perf evaluations for it. We have a strong team.
ReplyDelete
Replies
UnknownOctober 13, 2016 at 2:54 AM
Rockdbs includes Facebooks "Additional Grant of Patent Rights Version 2" : https://github.com/facebook/rocksdb/blob/master/PATENTS

As I understand it, everybody who would use MariaDB Server and Percona Server in the future will be also affected by it?
ReplyDelete
Replies

Add comment