Small Datum: Aurora for MySQL is coming

Thursday, November 20, 2014

Aurora for MySQL is coming

I am excited about Aurora for MySQL. While there aren't many details, I have two conclusions from the information that is available. First, many talented people did great work on this. Second, many customers want the features it provides and some of these features are otherwise unavailable unless you are web-scale and can afford a team of automation experts. This is a big deal and good for the MySQL community. I am not just writing this to boost my priority on the Aurora preview signup list.

I don't think it matters whether Aurora has better performance. The big story is much better availability and manageability without having to hire an expert MySQL operations team. The limiting factor might be cost but that depends on whether you come from the land where everything is free or from a commercial DBMS. And even in the land of free, the cost for an operations team is large.

Soapbox

Before describing what I learned about it I have a short editorial.

Is Amazon the reason we need AGPL? It isn't clear to me that they improve upstream MySQL. They do benefit the community by making it easier to run MySQL in the cloud. But I don't see them in the community. Which means I also don't see their developers in the community and who wants to disappear while working on mostly open-source?
Their marketing leads with up to 5X faster than MySQL which is translated by the tech press into 5X faster than MySQL. When I signed up for an Aurora preview the response email from Amazon also used the 5X faster than MySQL claim. I prefer up to 5X faster than MySQL.
In the video from AWS re:Invent 2014 James Hamilton makes some interesting claims.

I am not sure he heard about InnoDB based on this statement -- Just doing a state of the art storage engine would have been worth doing. Take Jim Gray's black book on transactions, implement it, I would be happy. Someone can tell him to be happy.
In big print -- 3X write performance, 5X read performance. In small print -- sysbench. It will take time to determine whether similar performance & availability can be had elsewhere at a similar cost.
In big print -- 400X less lag, 2 seconds vs 5 milliseconds. I doubt this. The slide also stated -- synchronous multi-DC replication. I don't see how that is possible except within a datacenter given network latency. Other content from Amazon claims this is async replication. But then Mr. Hamilton stated that everything is there, transactions are not lost were two data centers to instantly disappear. Again, that requires sync replication. This is confusing and I am surprised his claims don't match the other documentation from Amazon.
It can fix torn pages. InnoDB does this for writes in progress during a crash. I assume Aurora can also do this for pages written a long time ago and that would be a big deal.

Information

What is Aurora? I don't know and we might never find out. I assume it is a completely new storage engine rather than a new IO layer under InnoDB. I encourage people to read the details page, FAQ, slides and pricing guide. The response from Clustrix is also useful. At least one answer on Quora is excellent and this overview of Amazon datacenters will convince you that they network needed to make this work with low latency.

The big deal is that the database is replicated 6X using 3 availability zones (AZs). I assume this means it uses 2 storage clusters per AZ. Documentation from Amazon states this is done using async replication and (committed) writes are available within a few milliseconds or 10s of milliseconds. I assume this is only for Aurora read replicas in the same AZ as the master. Network latency will make that impossible for some remote AZs. In the presentation at AWS re:Invent there are claims that the replication is synchronous. That is confusing.

For now I will assume that Aurora does async replication to the two remote AZs and the non-primary storage cluster in the primary AZ -- which means that 5 of the 6 copies are updated via async replication. If this is true, then an outage at the primary storage cluster can mean that recent commits are lost. It would be great if someone were to make this clear. My preferred solution would be sync replication within the primary AZ to maintain copies in 2 storage clusters, and then async replication to the 4 copies in the 2 remote AZs. We will soon have multiple solutions in the MySQL community that can do sync replication within a datacenter and async replication elsewhere -- Galera, upstream and lossless semisync. But Aurora is much easier to deploy than the alternatives. My standard question is what commit rate can be sustained when all updates are to the same row? Sync replication with cross-country network round trips makes that slow.

The presentation also claimed this was mostly drop-in compatible. I am interested in compatibility with InnoDB.

What are the semantics for cursor isolation? PostgreSQL, InnoDB and Oracle all do snapshot isolation with different semantics for writes. PostgreSQL has excellent documentation to describe the behavior.
Does Aurora support clustered indexes?
What is the max size of an index key?
How are large columns supported? Is data always inline?
How is multi-versioning implemented? InnoDB usually does updates in place so there isn't much work for purge to do except for deletes and updates to secondary index columns.
Does this use pessimistic or optimistic concurrency control?
Does this support partitioning?
What is the block size for reads?
Does this use compression?

Features

The brief description of features is very interesting. I will summarize that here. The features sound great and I expect them to get a lot of customers.

Unplanned failover to an Aurora read replica takes a few minutes. Unplanned failover to a new instance can take up to 15 minutes. This happens for customers who aren't spending money on Aurora read replicas. This is another feature that will make Aurora very popular. While it will be nice to make failover faster, the big deal is that they provide this. The usual MySQL deployment required some do-it-yourself effort to get something similar.
Storage uses SSD. I assume this is based on EBS. They do background scrubbing to detect and correct corrupt pages. Storage grows in 10G increments. You don't have to provision for a fixed amount of GB or TB, they will grow as needed up to 64T (or 64T/table). I am not sure I would want 64T in one instance, but it can make it easy to archive data in place. Also automatic growth to such large database sizes will make it much easier for deployments to avoid sharding especially when an instance has many TB of cold data.
There are interesting features for point-in-time recovery, incremental backup and snapshots. This is integrated with S3. Incremental backups make it affordable to archive data in place as you don't do full backups for data that doesn't change. But I don't understand all of their backup options.
Database is replicated 2 times within 3 AZs so there are 6 copies. Up to 2 copies can be lost and writes are still possible. Up to 3 copies can be lost and reads are still possible. I assume that by copies can be lost they mean storage clusters can be lost. Automatic recovery here is another big deal.
The buffer pool survives mysqld process restart. I wonder if that is only true for planned restart. Regardless, this is a very useful feature for IO-bound workloads when it is important to have a warm cache. Percona used to have a patch for this with InnoDB.
Replication is supported in two ways -- via Aurora (at storage level) and MySQL (binlog). My bet is that Aurora replication will be most popular but some people will use MySQL replication to replicas with locally attached storage to save money. They claim much less lag with Aurora and I agree there will be less but I am not sure there will be 400X less. However, they avoid some overhead on commit by not using the binlog and they avoid a lot of complexity by not requiring MySQL replication on the master or slave. I assume that Aurora replication ships deltas rather than page images to be efficient on the network.

Cost

Cost comparisons will be interesting. I am very uncertain about my estimates here as I don't have much experience with prices on AWS or for normal sized datacenter deployments limited to a few servers. I estimate a 3 year cost of $568,814 for a high-end Aurora deployment: largest servers, 1 master, 1 replica, backup and storage. It will be interesting to compare this to non-AWS hardware because you need to account for features that Aurora provides and for the extra availability of the Aurora storage solution. I used 3TB and 10k IOPs from storage because that can be provided easily by a high-end PCIe flash device, but that also assumes a very busy server.

Reserved db.r3.8xlarge (32 vCPUs, 244GiB RAM) for 2 servers is $115,682 over 3 years
50TB of backup at 10 cents/GB/month is $53,100 over 3 years
3TB of storage and 10,000 IOPs per server is $400,032 for 2 servers over 3 years. But I am not sure if the storage cost includes IO done to all AZs to maintain the 6 database copies. In that case, the storage cost might be higher.

20 comments:

Tim CallaghanNovember 20, 2014 at 11:09 AM
Does it support foreign keys? I thought I saw that it wasn't.
ReplyDelete
Replies
Mark CallaghanNovember 20, 2014 at 3:29 PM
What does it mean for Aurora benchmarks to have 100% hit rate for result set cache?
https://media.amazonwebservices.com/blog/2014/os_rds_full_page_mon_2.png
ReplyDelete
Replies
Harish MNovember 21, 2014 at 1:44 AM
Reading the FAQs and watching the video carefully, I get the sense that it's synchronous replication at the storage layer. The key idea with the product is scale-out, network-attached SSD storage (optimized for the DB workload). Now the underlying storage layer is replicated 6x (2x in 3 different AZs). I'm going to guess they use distributed consensus for all writes to the storage layer (you need 4 out of 6 for majority quorum and so you can tolerate 2 failures and still be up for writes).

On top of this storage layer, they probably have multiple EC2 instances running their custom MySQL daemons – one of them is the leader which is accepting new writes and a bunch of read-only slaves (they're calling them Aurora replicas to differentiate them from regular MySQL binlog-based replicas). If you think about it, the main job of the slaves is to keep invalidating dirty pages from their bufpool (if any). They don't actually have to go through the mechanics of committing a txn because the txn has already been committed to the storage node in this replica's AZ. This would explain why they can do low millisecond replication-lag for the Aurora replicas compared to typical 'several seconds' replication-lag for the normal MySQL replicas.

Doing a failover in this model should simply involve electing a new leader among the Aurora slaves and let that slave catch up to all updates from the storage layer and start accepting new writes. And so you get zero data loss and a really quick failover with a warm bufpool.
ReplyDelete
Replies
UnknownNovember 21, 2014 at 1:16 PM
The pricing estimate in your post isn't quite right. 3YR RI pricing for an R3.8xlarge is 20K upfront + 0.96/hour. That's 20K + 0.96 * 8760 hr/yr * 3 yrs = 42,228.80 per box. For your scenario of two boxes, 84,457.60.

We don't charge for backup up to instance storage size, and beyond that, charge at prevailing S3 rates, which starts at 0.03/GB-month. I'm not sure it is that easy to generate 50TB of backup for a 3TB database that does incremental backup (log-structured). You can if you do a lot of user snapshots and rewrite your dataset monthly. Seems extreme.

Storage is shared across primary and replica. Remember that you're paying for the storage and IOs you actually use, rather than what is peak provisioned. For most customers, that is 5x lower on storage for used vs provisioned and 10-20x on avg IOs vs peaks IOPS.

Anurag@AWS
ReplyDelete
Replies
UnknownNovember 21, 2014 at 1:45 PM
I think some of the confusion we've caused is in the difference in replication to Replicas vs replication of storage. Harrish's note is largely right and I've added on where necessary.

Beyond that, the storage subsystem is new/custom. not based on EBS. Replication is on a 10GB chunk basis, so failures have a reduced blast radius relative to replication failure of an entire cluster. Buffer pool survives unplanned restart.

Feel free to reach out to me directly if you have further questions (awgupta@amazon.com). Anurag@AWS
ReplyDelete
Replies

Add comment