Monday, June 8, 2015

RocksDB & ForestDB via the ForestDB benchmark, part 1

ForestDB was announced with some fanfare last year including the claim that it was faster than RocksDB. I take comparisons to RocksDB as a compliment, and then get back to work. This year I met the author of the ForestDB algorithm. He is smart and there are more talented people on the ForestDB effort. It would be great to see a MongoDB engine for ForestDB.

ForestDB

I used the ForestDB benchmark to evaluate RocksDB and ForestDB. Results from the tests will be shared in future posts. Here I describe my experience running the tests. I forked the ForestDB benchmark as I made general and RocksDB-specific improvements to it.

ForestDB is tree-structured for the index and log-structured for the data. Everything is copy-on-write so there is one database and no log. This is kind of like Bitcask except Bitcask is hash structured for the index and the index isn't persistent. The write-amplification vs space-amplification tradeoff is configurable via the threshold parameter that sets the percentage of dead data allowed in the database file before the compaction thread will copy out live data into a new file. For a uniform distribution of writes to keys, space-amp is 2 and write-amp is 2 with threshold=50%. Reducing threshold to 25% means that space-amp is 4/3 and write-amp is 4. With leveled compaction in RocksDB we usually get higher write-amplification and lower space-amplification compared to ForestDB.

The ForestDB database can be sharded into many files per process. I am not sure if transactions can span files. There is only one compaction thread regardless of the number of files and during my tests it was easy for compaction to fall behind. Single-threaded compaction might be stalled by file reads, decompression after file reads and compression before file writes. I think that for every key read during compaction, the index must be checked to confirm the key is live, so this can be another reason for compaction to fall behind. I assume work can be done to make this better.

ForestDB benchmark client

I forked the ForestDB benchmark client. When I get more time I will try to push changes upstream. Here I describe changes I made to it using the git commit log:
  • e30129 - link with tcmalloc to get better performance and change the library link order
  • 1c4ecb - accept benchmark config filename on the command line
  • 3b3a3c - add population:compaction_wait option to determine whether to wait for IO to settle after the load. Their benchmark results include some discussion of whether the test should wait for IO to settle after doing a load. While I don't think the test should wait for all IO to stop, I think it is important to penalize an engine that incurs too much IO debt during the load phase which will make queries that follow much slower. This is still work-in-progress for RocksDB although setting the options hard_rate_limit and soft_rate_limit make this much less of an issue.
  • aac5ba - add population:load option to determine whether the population (load) step should be done. This replaces a command line flag.
  • aa467b - adds the function couchstore_optimize_for_load to let an engine optimize for load performance and the RocksDB version uses the vector memtable for loads and otherwise uses the skiplist memtable. This also adds the function couchstore_set_wal to enable/disable redo logging. Finally this adds a better configuration in all cases for RocksDB. I know that it isn't easy to configure RocksDB. The changes include:
    • set max_bytes_for_level_base to 512M. It is 10M by default. This is the size of the L1.
    • set target_file_size_base to 32M. It is 2M by default. This is the size of the files used for levels L1 and larger.
    • enable level_compaction_dynamic_level_bytes. This is a big improvement to the algorithm for leveled compaction. We need a blog post to explain it. 
    • set stats_dump_period_sec to 60. The default is 600 and I want more frequent stats.
    • set block_size to 8kb. The default is 4kb before compression and I don't want to do ~2kb disk reads. Nor do I want a block with only 4 1kb docs as that means the block index is too large.
    • set format_version to 2. The default is 0 and the RocksDB block cache can waste memory when this is 0. This might deserve a blog post from the RocksDB team.
    • set max_open_files to 50000. The default is 5000 and I want to cache all database files when possible.
  • 3934e4 - set hard_rate_limit to 3.0 and soft_rate_limit to 2.5. The default is to not use rate limits. When set these limit how much IO debt a level can incur and then user writes are throttled when the debt is too large. Debt in this case is when the level has too much data. The size of L1 should be max_bytes_for_level_base. Setting the soft_limit to 2.5 means that writes are delayed when it gets 2.5X of the target size. Setting the hard rate_limit to 3.0 means that writes are stopped when it gets to 3.0X of the target size. We have work-in-progress to make throttling much better as it can lead to intermittent stalls (bad p99) for write-heavy workloads. This also disables compression for levels 0 and 1 in the LSM to reduce the chance of compaction stalls.
  • a6f8d8 - this reduces the duration for which a spin lock is held in the benchmark client. I was getting odd hangs while running valgrind and this change fixed the problem. AFAIK the change is correct too.
  • cfce06 - use a different RNG seed per thread. Prior to this change all threads start with the same seed and can then generate the same sequence of key values. This can be really bad when trying to get IO-bound databases as it inflates the block cache hit rate. The LevelDB benchmark client has a different version of this problem. It uses a different seed per thread, but the seed per thread is constant. So if you restart the benchmark client then thread N generates the same sequence of keys that it generated on the last run. I fixed this in RocksDB (see the --seed option for db_bench).
  • 5a4de1 - add a script to generate benchmark configuration files for different workload patterns. See the next section for a description of the patterns.
  • 2b85fa - disable RocksDB WAL when operation:write_type=async is used. I reverted this part of the diff later. This also sets level0_slowdown_writes_trigger to 12 and level0_stop_writes_trigger to 16. The defaults were larger.
  • eae3fa - adds a helper script to run the pattern of tests. This is described in Benchmark pattern.
  • 871e63, e9909a - ForestDB wasn't ever doing fsync/fdatasync during the load phase. This fixes that. With these changes write-amplification is much larger for ForestDB when periodic_commit is enabled and write_type=sync. These changes provide the following behavior for existing config options:
    • population:periodic_commit - for ForestDB when set there is a commit per insert batch and when not set there is a commit at the end of the load. This has not changed. For RocksDB there is always one write batch per insert batch, but when set the WAL is used and when not set the WAL is not used. This does not control whether fsync/fdatasync are used. When periodic_commit is used then a load might be restartable when it fails in the middle. When not used, then a failure means load must start over.
    • operation:write_type - when set to sync then fsync/fdatasync are done per commit. This is a change for ForestDB but not for RocksDB.
Benchmark pattern

The script rall.sh runs tests using a pattern that is interesting to me. The pattern is listed below. The first step loads N documents and this takes a variable amount of time. The steps that follow run for a fixed amount of time. All of the tests used uniform random RNG to generate keys. The script also uses "numactl --interleave=all" given that I was using dual socket servers and a lot of RAM in many cases. The script also collects iostat and vmstat to help explain some performance differences.
  1. Load - this is the populate step. The PK value is hashed from an increasing counter (from 1 to N) so the load isn't in PK order. The test is configured to not wait for IO to settle when the load ends and if the engine has too much IO debt then the tests that follow can suffer.
  2. Overwrite-sync-1 - uses one writer thread to update docs. It does an fsync on commit.
  3. Overwrite-sync-N - like Overwrite-sync-1 but uses N writer threads.
  4. Point-query-1-with-writer - uses 1 reader thread that does point queries and 1 writer thread that updates docs. There is a rate limit for the writer. Read performance with some writes in progress is more realistic than without writes in progress especially for write-optimized database engines like RocksDB. More info on that topic is here.
  5. Point-query-N-with-writer - like Point-query-1-with-writer but uses N reader threads.
  6. Range-query-1-with-writer - like Point-query-1-with-writer but does range queries
  7. Range-query-N-with-writer - like Range-query-1-with-writer but uses N reader threads
  8. Point-query-1 - like Point-query-1-with-writer, but without a writer thread
  9. Point-query-N - like Point-query-N-with-writer, but without a writer thread
  10. Range-query-1 - like Range-query-1-with-writer, but without a writer thread
  11. Range-query-N - like Range-query-N-with-writer, but without a writer thread
  12. Overwrite-async-1 - like Overwrite-sync-1 but does not do fsync-on-commit. This was moved to the end of the test sequence because it frequently made compaction get too far behind in ForestDB
  13. Overwrite-async-N - like Overwrite-sync-N but does not do fsync-on-commit. Also moved to the end like Overwrite-async-1.
Benchmark configuration

My test scripts use a template for the benchmark configuration that is specialized for each run. The gen_config.sh script has a few test-specific changes especially for the load test. Changes that I made from one of the published configurations include:
  • set db_config:bloom_bits_per_key to 10 to use bloom filters with RocksDB
  • set threads:seed_per_thread to 1 to use different RNG seeds per thread
  • set body_length:compressibility=50
  • use operation:batch_distribution=uniform. Eventually I will try others.
  • use operation:batchsize_distribution=uniform with batchsize 1 (read_batchsize_*_bound) for point queries and a configurable batchsize (iterate_batchsize_*_bound) for range queries. In the Benchmark debugging section I will explain why I do this.
Benchmark debugging

I started with an IO-bound configuration and data on disk and ForestDB was getting about 1.5X more QPS than RocksDB for a read-only point-query workload. I had two servers with identical hardware and this result was reproduced on both servers. So it was time to debug. The test used one reader thread, so this result wasn't ruined by using the same RNG seed for all threads.

I started by confirming that the ratio of disk reads per query were similar for RocksDB and ForestDB. Each did about 1.5 queries per disk read courtesy of cached data. However the average disk read latency reported by iostat was about 1.5X larger for RocksDB. Why was the disk faster for ForestDB?

Next I used strace to see the system calls to read blocks. I learned that page reads for data were 4kb and aligned with ForestDB. They were usually less than 4kb and not aligned with RocksDB. Both use buffered IO. From iostat the average read size was 4kb for ForestDB and about 8kb for RocksDB. Many of the unaligned almost 4kb reads crossed alignment boundaries so the unaligned ~4kb read required an 8kb read from disk. This could be a performance problem on a fast SSD but there is not much difference between 4kb and 8kb reads from a disk array and I re-confirmed that via tests with fio. I also used blktrace to get more details about the IO pattern and learned that my servers were using the CFQ IO scheduler. I switched to deadline.

At this point I am stuck. Somehow ForestDB is getting faster reads from disk or the read pattern wasn't as random as I thought it would be. So I read more of the benchmark client and saw that point reads were done in a batch where the key for the first document per batch was selected at random (call it N) and then the keys for the rest of the batch were N+1, N+2, N+3, etc. This gives ForestDB an advantage when only one database file is used because with M files the reads in one batch might use different files. The load and query code both share a function that converts an integer to a key, so N here is an integer and the key is something that includes a hash of the integer. This means that the key for N=3 is not adjacent to the key for N=4. However, the data file for ForestDB is log structured and immediately after the load the data for N=3 is adjacent to the data for N=4 unless updates were done after the load to either of the documents.

At last I understood the cause. ForestDB was benefiting from fetching data that was adjacent in the file. The benefit came from caching either in the HW RAID device or disks. I changed the test configuration to use batchsize=1 for point queries and then ForestDB and RocksDB began to experience the same average read latency from the disk array.

No comments:

Post a Comment