Wednesday, October 12, 2016

MongoRocks and WiredTiger versus linkbench on a small server

I spent a lot of time evaluating open-source database engines over the past few years and WiredTiger has been one of my favorites. The engine and the team are excellent. I describe it as a copy-on-write-random (CoW-R) b-tree as defined in a previous post. WiredTiger also has a log-structured merge tree. It isn't officially supported in MongoDB. Fortunately we have MongoRocks if you want an LSM.

This one is long. Future reports will be shorter and reference this. My tl;dr for Linkbench with low concurrency on a small server:
  • I think there is something wrong in how WiredTiger uses zlib, at least in MongoDB 3.2.4
  • mmapV1 did better than I expected.
  • We can improve MongoRocks write efficiency and write throughput. The difference for write efficiency between MongoRocks and other MongoDB engines isn't as large as it is between MyRocks and other MySQL engines.
Update - I have a few followup tasks to do after speaking with WiredTiger and MongoRocks gurus. First, I will repeat tests using MongoDB 3.2.10. Second, I will use zlib and zlib-noraw compression for WiredTiger. Finally, I will run tests with and without the oplog to confirm whether the oplog hurts MongoRocks performance more than WiredTiger.

All about the algorithm

Until recently I have not been sharing my performance evaluations that compare MongoRocks with WiredTiger. In some cases MongoRocks performance is much better than WiredTiger and I want to explain those cases. There are two common reasons. First, WiredTiger is a new engine and there is more work to be done to improve performance. I see progress and I know more is coming. This takes time.

The second reason for differences is the database algorithm. An LSM and a B-Tree make different tradeoffs for read, write and space efficiency. See the Rum Conjecture for more details. In most cases an LSM should have better space and write efficiency while a B-Tree should have better read efficiency. But better write and space efficiency can enable better read efficiency. First, when less IO capacity is consumed for writing back database changes then more IO capacity is available for the storage reads done for user queries. Second, when less space is wasted for caching database blocks then the cache hit ratio is higher. I expect the second reason is more of an issue for InnoDB than for WiredTiger because WT does prefix encoding for indexes and should have less or no fragmentation for database pages in cache.

Page write-back is a hard feature to get right for a B-Tree. There will be dirty pages at the end of the buffer pool LRU and these pages must be written back as they approach the LRU tail. Things that need to read a page into the buffer pool will take a page from the LRU tail. If the page is still dirty at the point the thing requesting the page will stall until the page has been written back. It took a long time to make this performant for InnoDB and work still remains. It will take more time to get this right for WiredTiger. Checkpoint and eviction are the steps by which dirty pages are written back for WT. While I am far from an expert on this I have filed several performance bugs and feature requests (and many of them have been fixed). One open problem is that checkpoint is still single threaded. This one thread must find dirty pages, compress them and then do buffered writes. When zlib is used then that is too much work for one thread. Even with a faster compression algorithm I think more threads are needed, and the cost of faster decompression is more space used for the database. Server-16736 is open as a feature request for this.

Test setup

I have three small servers at home. They used Ubuntu 14.04 at the time, but have since been upgraded to 16.04. Each is a core i3 with 2 CPUs, 4 HW threads and 8G of RAM. The storage is a 120G Samsung 850 EVO m.2 SSD for the database and a 7200 RPM disk for the OS. I like the NUC servers but my next cluster will use a better CPU (core i5) with more RAM.

The benchmark is Linkbench using LinkbenchX from Percona that has support for MongoDB. For WiredTiger and MongoRocks engines this doesn't use transactions to protect the multi-operation transactions. I look forward to multi-document transactions in a future MongoDB release. I use main from my Linkbench fork rather than from LinkbenchX to avoid the use of the feature to sustain a constant request rate because that has added too much CPU overhead in some tests.

I ran two tests. First, I used an in-memory database workload with maxid1=2M. Second, I used an IO-bound database with maxid1=40M. By IO-bound I mean that the database is larger than 8G but smaller than 120G and the SSD is very busy during the test. Both tests were run with 2 connections for loading and 1 connection (client) for the query tests. The query tests were run for 24 1-hour loops and the result from the 24th hour is shared. I provide results for performance, quality of service (QoS) and efficiency. Note that for the mmapv1 IO-bound test I had to use maxid1=20M rather than 40M to avoid a full storage device.

The oplog is enabled, sync-on-commit is disabled and WiredTiger/MongoRocks get 2G of RAM for cache. Tests were run with zlib and snappy compression. I reduced file system readahead from 128 to 16 for the mmapV1 engine tests. For MongoRocks I disabled compression for the smaller levels of the LSM. For the cached database, much more of the database is not compressed because of this. I limited the oplog to 2000MB. The full mongo.conf is at the end of this post.

I used MongoDB 3.2.4 to compare the MongoRocks, WiredTiger and mmapv1 engines. I will share more results soon for MongoDB 3.3.5 and I think results are similar. When MongoDB 3.4 is published I will repeat my tests and hope to include zstandard.

If you measure storage write rates and use iostat then be careful because iostat includes bytes trimmed as bytes written. If the filesystem is mounted with discard enabled and the database engine frequently deletes files (RocksDB does) then iostat might overstate bytes written. The results I share here have been corrected for that.

Cached database load

These are the results for maxid1=2M. The database is cached for all engines except mmapV1.

Legend:
  • ips - average inserts/second
  • wKB/i - average KB written to storage per insert measured by iostat
  • Mcpu/i - CPU usecs/insert, measured by vmstat
  • Size - database size in GB at the end of the load
  • rss - mongod process size (RSS) in GB from ps at the end of the load
  • engine - rx.snap/rx.zlib is MongoRocks with snappy or zlib. wt.snap/wt.zlib is WiredTiger with snappy or zlib
Summary:
  • MongoRocks has the worst insert rate. Some of this is because more efficient writes can mean less efficient reads and the LSM does more key comparisons than a B-Tree when navigating the memtable. But I think that most of the reason is management of the oplog where there are optimizations we have yet to do for MongoRocks.
  • MongoRocks writes the most to storage per insert. See the previous bullet point.
  • MongoRocks and WiredTiger use a similar amount of space. Note that during the query test that follows the load the size of WT will be much larger than MongoRocks. As expected, the database is much larger with mmapV1.

ips     wKB/i   Mcpu/i  size    rss     engine
5359    4.81     6807    2.5    0.21    rx.snap
4876    4.82    10432    2.2    0.45    rx.zlib
8198    1.84     3361    2.7    1.82    wt.snap
7949    1.79     4149    2.1    1.98    wt.zlib
7936    1.64     3353   13.0    6.87    mmapV1

Cached database query

These are the results for maxid1=2M for the 24th 1-hour loop. The database is cached for all engines except mmapV1.

Legend:
  • tps - average transactions/second
  • wKB/t - average KB written to storage per transaction measured by iostat
  • Mcpu/t - CPU usecs/transaction, measured by vmstat
  • Size - database size in GB at test end
  • rss - mongod process size (RSS) in GB from ps at test end
  • un, gn, ul, gll - p99 response time in milliseconds for the most popular transactions: un is updateNode, gn is getNode, ul is updateList, gll is getLinkedList. See the Linkbench paper for details.
  • engine - rx.snap/rx.zlib is MongoRocks with snappy or zlib. wt.snap/wt.zlib is WiredTiger with snappy or zlib
Summary:
  • WiredTiger throughput is much worse with zlib than with snappy. I think the problem is that dirty page write back doesn't keep up because of the extra overhead from zlib compression. See above for my feature request for multi-threaded checkpoint. There is also a huge difference in the CPU overhead for WiredTiger with zlib compared to WT with snappy. That pattern does not repeat for MongoRocks. I wish I had looked at that more closely.
  • While WiredTiger and MongoRocks used a similar amount of space after the load, WT uses much more space after the query steps. I am not sure whether this is from live or dead versions of B-Tree pages.
  • Response time is better for MongoRocks than for WiredTiger. It is pretty good for mmapV1.
  • mmapV1 has the best throughput. I have been surprised by mmapV1 on several tests.
  • MongoRocks writes the least amount to storage per transaction.

tps  r/t  rKB/t  wKB/t  Mcpu/t  size   rss   un   gn   ul  gll  engine
1741   0      0   2.72   16203   3.7  2.35  0.3  0.1   1   0.9  rx.snap
1592   0      0   2.66   19306   3.0  2.36  0.4  0.2   1   1    rx.zlib
1763   0      0   5.70   23687   4.9  2.52  0.4  0.1   2   1    wt.snap
 933   0      0   8.94   81250   6.5  2.97  1    0.6   7   5    wt.zlib
1967 0.2   9.70   4.61   12048  20.0  4.87  0.9  0.7   1   1    mmapV1

IO-bound database load

These are the results for maxid1=40M for the 24th 1-hour loop. The database does not fit in cache. I used maxid1=20M for mmapV1 to avoid a full SSD. So tests for it ran with half the data.

The summary is the same as it was for the cached database and I think we can make MongoRocks a lot faster.

ips     wkb/i   Mcpu/i  size    rss     engine
4896    7.11     8177   27      2.45    rx.snap
4436    6.67    11979   22      2.29    rx.zlib
7979    1.93     3526   29      2.20    wt.snap
7719    1.89     4330   24      2.30    wt.zlib
7612    1.85     3476   66      6.93    mmapV1, 20m

IO-bound database query

These are the results for maxid1=40M for the 24th 1-hour loop. The database does not fit in cache. I used maxid1=20M for mmapV1 to avoid a full SSD. So tests for it ran with half the data.

Summary:
  • Like the cached test, WiredTiger with zlib is much worse than with snappy. Most metrics are much worse for it. This isn't just zlib, I wonder if there is a bug in the way WT uses zlib.
  • Throughput continues to be better than I expected for mmapv1, but it has started to do more disk reads per transaction. It uses about 2X the space for the other engines for half the data.
  • MongoRocks provides the best efficiency with performance comparable to other engines. This is the desired result.

tps   r/t   rKB/t  wKB/t  Mcpu/t  size   rss   un    gn    ul  gll  engine
1272  1.22  12.82   4.02   17475  29     2.56   0.8   0.5   2   1   rx.snap
1075  1.03  10.44   4.56   25223  23     2.53   0.9   0.6   2   2   rx.zlib
1037  1.24  17.23  11.60   45335  34     2.69   1     1     3   2   wt.snap
 446  1.21  23.61  18.00  151628  33     3.38  13    11    21  18   wt.zlib
1261  2.43  34.77   5.28   13357  72     2.05   0.9   0.5   3   2   mmapV1, 20m

mongo.conf

This is the full mongo.conf for zlib. It needs to be edited to enable snappy.


processManagement:
  fork: true
systemLog:
  destination: file
  path: /path/to/log
  logAppend: true
storage:
  syncPeriodSecs: 60
  dbPath: /path/to/data
  journal:
    enabled: true
  mmapv1:
    journal:
      commitIntervalMs: 100
operationProfiling.slowOpThresholdMs: 2000
replication.oplogSizeMB: 2000

storage.wiredTiger.collectionConfig.blockCompressor: zlib
storage.wiredTiger.engineConfig.journalCompressor: none
storage.wiredTiger.engineConfig.cacheSizeGB: 2

storage.rocksdb.cacheSizeGB: 2
storage.rocksdb.configString: "compression_per_level=kNoCompression:kNoCompression:kNoCompression:kZlibCompression:kZlibCompression:kZlibCompression:kZlibCompression;compression_opts=-14:1:0;"

11 comments:

  1. Great writeup! I wrote a long comment but I think blogspot swallowed it.

    ReplyDelete
    Replies
    1. that is unfortunate, I like your comments

      Delete
    2. I tried retyping my musings from before below!

      Given that you're comparing an LSM to a btree in the load in memory workload, the fact that WT is beating MongoRocks is pretty surprising, and it also seems to me that there's a lot of room for optimization for MongoRocks, and that's really exciting.

      Furthermore, I wonder if the oplog issue is due to an io problem at all. The operations * wKB on the io bound load comes pretty close to the drive's 4k write speed limit (35mb/s), and I don't think the db/opslog is writing things in 4k random chunks.

      Do you think there would be a substantial difference even if you moved the oplog to a incredibly fast io system, like one of the latest gen pcie based flash storage? Or is the bottleneck the presence of the oplog itself?

      A question I like to consider is what is the most apparent bottleneck here. Certainly algorithms and code can be made more efficient and do less work per operation, or do that work in a way that is more efficient for the hardware & workload you're putting the system through. From your write workload, there isn't a substantial difference for 2m or 40m insert workload. You expect btrees etc to get a tiny less efficient as the depth gets bigger, gc gets done, etc -- so an insert at 40m objects is a bit slower than at 2m, but in general it does not look like the disk io is a bottleneck here. Similarly, based on the database size it does not look that write amplification is a big issue.

      From a performance perspective, more efficient and better algorithms and code are the way things really improve, but the hardware question is just as interesting. What is the best way to get close to doubling QPS on the write or read workload while keeping 99th percentile latency about the same: do we double the number of cores, increase core frequency? Is memory bandwidth a concern? Is disk io actually an issue?

      I suspect right now the answer is that only doubling the cpu cores/clock rate would come close double the QPS. My stress tests with multiple client insertion on wiredtiger on multi socket hardware with fast storage have always found bottlenecks on the CPU at around 400k writes/second, even after db size grows many multiples larger than the 384~gb of ram, but before 3.2.10 I always ran into the cache issue stalls. I hope I can find some time to reproduce the tests with the latest WT version.

      Great article in general, I am looking forward your 3.4 and 3.2.10 tests!

      Delete
    3. The working set stays in memory, even for b-trees, during the linkbench load, which helps explain why there isn't a big difference for performance between the 2M and 40M configurations.

      I need to debug the MongoRocks load performance. I don't know why it does so much more IO. Compared to WT it writes too much. Wether that hurts performance depends more on the storage perf, but regardless, it is writing too much.

      I have tests in progress for MongoDB 3.2.10 (WT & Rocks) on slower TLC and faster MLC NAND. From that I will have a better opinion on whether progress is made for dirty page writeback stalls.

      Are you using WT for more than just benchmark evaluations?

      Delete
    4. Alexey - I assume we have more low hanging fruit in MyRocks and MongoRocks that will help a lot with performance when fixed. One example is getting 2X more throughput for concurrent long range scans. There will be more - http://smalldatum.blogspot.com/2016/10/make-myrocks-2x-less-slow.html

      Delete
    5. Mark, we've been running WT on production since 3.0.1 around April 2015 (on 3.2.9 now, due to upgrade to 3.2.10 after more testing). WT and snappy was a pretty big deal for us in terms of cost vs performance. We got started on 2.0 and grew with mongo, from global locks etc. I was very excited to test 2.8 and put into production 3.0 when it came out, and our team was a very early adaptor to WT and snappy.

      WT and snappy was a real savior for us because we were having to deal with sharding and other issues because of not being able to run our larger collections on a single instance. It also provided a way to continue using file system backups off a Master/Slave setup easily. We have about 20~ i2.4xlarge & i2.8xlarge instances (+ similar amount for backup and secondaries) with host / database based distribution of data (ie, different logical databases are on different physical hosts).

      We are of course nowhere as large in terms of load/ops as Facebook or other larger companies, but we do have around 1000~ worker servers processing data to different pipeline stages, and we have been generally happy with how MongoDB and Wiredtiger have progressed. WT and snappy alone by my calculations saved us at least $500k / year in DB hosting on reserved AWS (or unknown hours of work solving this via other ways), so I can't be anything but thankful for the progress the folks working on Mongo make for all of us.

      Delete
    6. I am happy to learn that WT works for you. I am a big fan of it even though I don't get to use it in production.

      Delete
  2. Great writeup.
    We've been using WT in production for a while, and actually while it's a huge improvement from MMAPv1 for us, we've also encountered many complex bugs along the way.
    Many of the latest WT bug fixes (SERVER-25974, SERVER-26898, ...) are based on our workload :) and while the WT has made great progress, it's still somewhat buggy in complex, rich workloads.
    We just recently tried using MongoRocks and have seen some stability improvements compared to WT, however, it's harder to debug once you have issues and unsupported by the Mongo team.

    ReplyDelete
    Replies
    1. Thank you for making WT better. I know a few things - WT has the talent to make their engine robust and with MongoDB they have the resources, but making a new storage engine robust is a many year problem.

      We have that same problem for RocksDB, MyRocks and MongoRocks. And I think we also have the skill & resources to make them robust. If you need help with MongoRocks then Percona is the resource to use.

      From my evaluations, an LSM is likely to always be better (performance and/or efficiency) than a B-Tree for some workloads. Whether that difference matters to enough users is something for us to figure out. It is a big deal for my employer.

      Delete
  3. I'm excited now that 3.4 is finally out. I wonder if there are any performance changes compared to 3.2.11.

    ReplyDelete
    Replies
    1. Maybe not for WT but there are a huge number of improvements elsewhere to make sharded replica sets more robust.

      Delete