Sunday, October 11, 2015

Losing it?

Many years ago the MySQL team at Google implemented semi-sync replication courtesy of Wei Li. The use case for it was limited and the community was disappointed that it did not provide the semantics they wanted. Eventually group commit was implemented for binlog+InnoDB: first by the MySQL team at Facebook, then much better by MariaDB, and finally in upstream MySQL. With group commit some magic was added to give us lossless semisync and now we have automated, lossless and fast failover in MySQL without using extra replicas. This feature is a big deal. I look forward to solutions for MariaDB (via binlog server and MaxScale) and MySQL (via Fabric and Proxy).

MongoDB is ahead of MySQL in features that make scale-out easier to manage and the next release (3.2) adds a few more features and more robust code for existing features. I hope that some of the features that have been sacrificed in the name of scale-out will eventually arrive in a MongoDB release: per-shard transactions, per-shard joins, per-shard consistent read.


Many times we describe algorithms when users need to understand behavior and then the user gets lost in the details. It is important for developers to understand when transactions can be lost and I will describe that with answers to two questions.
  1. After the changes from a commit are visible to a concurrent session under what conditions can that commit be lost?
  2. After a client has been informed that a commit has succeeded under what conditions can that commit be lost?


I list four combinations of behavior to consider and the features in MySQL and MongoDB that provide them:


MySQL uses one solution (lossless semisync) to protect against both loss of visible and acknowledged commits. With lossless semisync row locks are held on the master until at least one slave acknowledges the commit. This can reduce commit throughput as there is a network round trip between master and replica(s) before commit is finished and row locks are released. There is at one network round trip between commits to the same row. This overhead is reduced by moving the replica close to the master. In the solution described by Yoshinori the binlog archive provides the ack rather than using extra replicas in every data center and because the binlog archive doesn't require a lot of hardware it is easier to move it closer to the master.

In MongoDB total protection comes from two solutions -- majority read concern and write concern. The benefit is that majority write concern doesn't make commit slower assuming a workload has sufficient concurrency. It will add latency to each client doing a commit just like MySQL semisync. A different feature, majority read concern, protects against loss of a visible commit. However, there is the risk that a client that needs read-your-own write semantics will have to wait. At this point it isn't clear to me that MongoDB makes it easy to read your own writes. I wonder if applications that care more about performance will use majority read concern without using majority write concern. That isn't an option with MySQL.

It will take time to figure out the implications of the performance differences. With MySQL delays are added to writers as it takes them longer to get the through semi-sync commit protocol. With MongoDB delays might be added to readers as they wait for the majority read snapshot to be advanced.

Durable on a slave?

It can be important to understand how durable a change is on a replica when the replica acknowledges a transaction to the master. There are several options and I have not read enough recent MySQL or MongoDB documentation to determine whether there are options beyond durable in memory:
  1. durable in memory - the commit is buffered in memory on a replica before it acks. Many years ago when we implemented semi-sync for MySQL this was the only choice. I tend to think that durable in memory is an oxymoron, but I am a pessimist.
  2. durable in a log - the commit is durable in a log file on a replica before it acks. There has been talk that MySQL would 
  3. committed on the replica - the commit is applied on a replica before it acks. That guarantees read-your-writes semantics when that replica is queried soon after committing a change on a master. Alas this is also likely to create performance lag unless the replica uses many threads to apply changes concurrently, just like on the master. It also creates a window where a commit is visible on a replica before the master.

More editorial

MongoDB documentation has tended to be optimistic about the features provided by the software. I think this will be resolved as the community grows. There have been some interesting discoveries. Hopefully the gap between documented and actual behavior will be reduced over time.

The mmap engine releases the per-database or per-instance write lock before syncing the oplog even when durable writes are requested.  This is now described as read uncommitted, but read non-durable might be a better name because reads are still consistent but you can see changes from others before those changes are durable in the oplog. I wrote about this when reading code and the docs have been updated since then but I think their docs need more edits. This is only a problem for the mmap engine and multiple engines in MongoDB means they need to be clear about behavior for mmap versus WiredTiger.

There were too strong claims about the semantics of the majority write concern. It protects against the loss of an acknowledged commit but some docs suggested it protected against the loss of a visible commit. Aphyr, an expert in distributed systems testing, highlighted this problem in his Call Me Maybe series and a bug report. I wrote about part of the problem prior to that but I did not connect the problem with the too-strong claims in the documentation. Many years ago MySQL made a similar mistake when documenting semi-sync replication and fixed their docs after I filed a bug.

Documentation claimed that 2-phase commit was used to keep config servers in sync. That makes it more likely that commit is all-or-nothing for the 3 or 5 config servers hosting the same data. Alas it can lead to read-only mode when a server goes away. I read the code and the two phases were 1) ping all config servers and then if all responded 2) send the change to all config servers. If all servers did not respond with OK then manual intervention was required. This isn't 2 phase commit. Fortunately, something much better will be done for the 3.2 release and the docs have been updated.

Finally, read the excellent series of posts from Tokutek on replica set failover including the overview and posts one, two, three and four. There have been problems that haven't been widely known. Fortunately the 3.2 release of MongoDB should make things better.

Wednesday, October 7, 2015

Problems not worth fixing

I worked on a closed-source DBMS years ago and the development cycle was 1) code for 6 months 2) debug for 18 months. Part 2 was longer than part 1 because the customers have high expectations for quality and the product delivers on that. But it was also longer because some co-workers might not have been code complete after 6 months and were quietly extending part 1 into part 2.

I fixed many bugs during part 2. One that I remember was in the optimizer. Someone added code to detect and remove duplicate expressions (A and A and B --> A and B). Alas the algorithm was O(N*N). Analytics queries can have complex WHERE clauses and all queries were forced to pay the O(N*N) cost whether or not they had any duplicate expressions. 

The overhead for this feature was reasonable after I fixed the code but it raises an interesting question. Are some problems not worth the cost of prevention? Several times a year I see feature requests that make me think of this.

Forgot to add the bug that inspired this post. In bug 30342 there was a change in the MySQL 5.0 optimizer that added predicates which could increase the overhead for query optimization when more predicates means more calls to records_in_range.

Monday, October 5, 2015

MyRocks versus allocators: glibc, tcmalloc, jemalloc

I used a host with Fedora 19 and Linkbench to determine the ability of memory allocators to avoid fragmentation with MyRocks (MySQL+RocksDB). RocksDB can create pressure on an allocator because allocations for the block cache can have a short lifetime. Memory is allocated on page read and released when a page is evicted from the cache. The allocation is the size of the decompressed page and sometimes the size of the compressed page. RocksDB puts more pressure on the allocator than InnoDB because with InnoDB the buffer pool allocation is done once at process start. After the Linkbench load finished I ran the query test for 24 hours and report the values of VSZ and RSS from ps at the end. From the results below jemalloc & tcmalloc do much better at avoiding fragmentation (value of RSS) compared to glibc 2.1.7. I am still adjusting to larger values of VSZ with jemalloc. It isn't a problem, but it takes time to accept.

VSZ       RSS       allocator
22246784  10062100  jemalloc
13600032  10545284  tcmalloc
26136152  20313268  glibc

Test details:
  • Linkbench with maxid=100M, loaders=10, requesters=20
  • MyRocks with 8G block cache for RocksDB
  • libc is glibc 2.17, tcmalloc from gperftools 2.1, jemalloc 3.6.0

Friday, October 2, 2015

Wanted: a file system on which InnoDB transparent page compression works

I work on MyRocks, which is a MySQL storage engine with RocksDB. This is an alternative to InnoDB. It might be good news for MyRocks if transparent page compression is the future of InnoDB compression. I got more feedback from the MySQL team that despite the problems I have reported, transparent page compression works. I was just testing the wrong systems.

So I asked core developers from Btrfs whether it was OK to do punch hole per write and they politely told me to go away. Bcachefs might be a great way to add online compression to a b-tree without doing punch hole per write but it is not ready for production. Someone from MySQL suggested I try ext4 so I setup two servers with Ubuntu 14.04 which is on the list of supported systems. I used XFS on one and ext4 on the other.

XFS was still a problem and ext4 was just as bad. The problem is that the unlink() system call takes a ridiculous amount of time after a multi-GB file has been subject to many punch hole writes. By ridiculous I mean 50X or 100X longer. Maybe I am the only one using InnoDB file-per-table and measuring time to drop a table or database. Bad things happen when DROP takes 10+ minutes --InnoDB is unusable by other connections and an InnoDB background thread might kill mysqld because it thinks there is an internal deadlock.

Did I mention this was good news for MyRocks? If you want compression then we get 2X better compression compared to compressed InnoDB. We also made sure that DROP TABLE and DROP INDEX are fast.

I updated bug 78277 for this problem. I am the only person updating that bug. I also found a corruption bug with transparent page compression, bug 78672. Past experience with transparent page compression is described here, here and here.

The end to my testing day was perfect. I rebooted the host to get back memory from XFS metadata allocations and the filesystem came back corrupt. Being a pessimist I was using a scratch filesystem for this, so I didn't lose my Ubuntu install.