High Availability with MongoDB for Fun and Profit

07 May 2012

I gave a talk on Friday at MongoSF on how to run MongoDB for high availability. The slides are posted on slideshare, but without speaker’s notes they’re probably not all that useful. Sorry about that. But to try to make it up to you, I’m going to go through the talk’s contents in the remainder of this post. As well, the video from the talk is now (highly) available on the 10gen website.

For a company like Stripe, availability is incredibly important. If we are unable to accept an API request, that causes downtime not just for us but also for our customers. Since the limiting factor on availability for many applications is the database (if that’s down, you can’t serve any requests), when choosing a database we sought to find one that allows maximum availability.

In an ideal world, your highly-available database would be a collection of indistinguishable nodes. You could add or remove nodes from the pool without any service interruption, and your nodes can fail or become partitioned from one another without downtime. Unfortunately, there are theoretical results which show that such a database can’t exist. So any real distributed database needs to make tradeoffs.

MongoDB has one of the cleanest availability models of commercially-available databases that I’ve seen. MongoDB nodes are clustered into replica sets. Each node in a replica set contains a full copy of the data. At any given time, there is at most one primary, which is the only node capable of accepting writes. If that node fails or is asked to step down, an election occurs and a new primary is selected. Any node can accept reads, though reads to secondaries are not guaranteed to be consistent. There is no single point of failure or manual administration needed to promote a secondary. Determination of the current cluster configuration is managed by a smart client driver; the application writer needs only to provide the driver with a set of seed nodes to establish an initial connection.

The one case that you as an application writer need to worry about is in the midst of cluster reconfiguration. When the set begins reconfiguring, the servers close all open connections. Any mongo operations will fail until the cluster reconfiguration has completed. Your driver will then bubble this as an error to the application level. (In general, it doesn’t have enough information to know what to do in this case, so it bails. If for example an error is returned to a write, the client can’t tell whether that write was accepted or not.) So to achieve robustness in the case of cluster reconfigurations, you need to write code to handle these exceptions.

At Stripe, we make all of our writes idempotent. We then wrap all Mongo operations in retry logic – if the operation throws an exception, we simply retry it until it succeeds (with an appropriate backoff and timeout). Note that your concurrency model needs to be considered when evaluating whether a write is truly idempotent – make sure a conflicting write can’t sneak in between tries.

With this retry code, our application then can handle arbitrary failures or routine reconfigurations of our Mongo cluster. I come from a MySQL world where I couldn’t even reboot my master server for routine maintenance. Being able to withstand the catastrophic failure of our database primary without any service interruption was basically a dream come true.

That’s really all the code you need to write for high availability with MongoDB. However, you also need to use proper administration and deployment techniques to ensure you don’t take yourself down. You’re most vulnerable during replica set reconfigurations, and so you should minimize the number of times you have to do one. While replica set reconfigurations should be safe, it’s plausible to accidentally be in a world where no node is eligible to become primary. As well, if there’s been lots of flapping, we’ve seen primary election take a minute or more, causing pending requests to timeout. You should always make sure to test your administration procedures in your staging environment to make sure they’re safe. One gotcha we’ve hit is that a recently-booted Mongo node isn’t eligible to become a primary unless all nodes in the cluster are up. In our staging environment, we had been fairly slow while testing a MongoDB version upgrade. In production, we were much faster, and when we tried to fail over to a recently-upgraded node, suddenly we had no eligible primary node.

One operation we’ll commonly perform is a rebuild of all machines in our database cluster. We used to rebuild one node at a time, adding it to the pool, waiting for a full sync, and then removing the corresponding old node. This required a decent number of replica set reconfigurations, and it was pretty miserable to perform. These days, we instead just add all the new nodes to the pool, allow them to sync, fail over the primary, and then remove the old ones.

If you build your new node via an initial sync rather than restoring from a snapshot, you should make sure to force that node to sync from a secondary. Otherwise, you’ll end up with your primary reading through all of its data, evicting hot data from cache. This can cause severe performance issues. In the current version of MongoDB, however, there’s no supported way to force a sync from a particular server (this feature was recently added to the development version). If you read the source, however, you’ll see that the new node selects the node with lowest ping time to sync from. So we just drop in an iptables rule blackholing traffic from the new node to the primary, wait for it to select its sync target, and then remove the iptables rule. This technique is obviously a giant hack, and I’m looking forward to just use replSetSyncFrom instead :).

One nonobvious point when configuring your replica sets is how to address the machines. Mongo can use either IP addresses or hostnames. (Fun fact: hostnames are dynamically resolved and connections are frequently reestablished. So if you munge /etc/hosts, you can easily repoint a particular hostname.) When rebuilding a cluster, we like to reuse hostnames, e.g. creating a new server db1 to replace the old db1. But these nodes have to exist in the replica set at the same time. So we used to add the new node with a temporary hostname, fail over the primary, remove the old nodes, munge /etc/hosts to point to the new nodes, and then reconfigure with the correct hostnames. But as much fun as it is munging /etc/hosts in the middle of database administration, it’s also very easy to make a mistake. These days, we just configure our replica sets with IP addresses. Then the machines can have whatever hostnames they want.

Perhaps the most common reason that many people go down is for routine maintenance. With MongoDB, it’s not too bad to never go down for maintenance. Since Mongo doesn’t enforce any particular schema, a common technique is to tag objects with a schema version number. You can then lazily migrate objects as you read them, or with a batch process that migrates them in the background (you just need to make your app be able to understand all active schema version numbers). One note is that I recommend against multi-updates for large collections (i.e. don’t run a batch db.update(…)).

Why not, you ask? You see, Mongo has a global write lock. Long running operations, such as a multi-update, yield the lock many times a second. However, if you are running at sufficient scale, they still hold the lock for long enough that contention begins to be a problem and queries start stacking up. As well, a multi-update will cause all the migrated records to be read, evicting hot data and further slowing down queries. So instead, at Stripe we perform all migrations at the application level. We iterate over all relevant objects, performing a single-record update for each. This allows us to slow down the pace and ensure that production traffic is not impacted.

Note that Mongo supports background index builds. Use these. You do have to be careful because, as a long-running operation, they can have production impact. However, we’ve found that these tend to be gentler than multi-updates, and we haven’t really seen issues from them.

In general, you should always be very thorough with your MongoDB usage and deployment. It turns out distributed systems are complicated. Don’t assume that your application can actually handle a particular failure case (e.g. what happens if my current primary suddenly becomes wedged but is still accepting TCP connections) – make sure you’ve actually tested it.

As well, you should be cognizant that MongoDB is under rapid development, which is great because it means new features are being added all the time. However, it also means that sometimes API changes happen, or sometimes regressions are introduced. Always wait at least a few weeks before upgrading to the latest version of MongoDB. And before you do, go and read the issues opened on that version, and ideally all of the issues resolved between your version and the newest one. It’s not very much fun, but it’s the best way to know what changes are in store for you.

I’ll close with a final point. Mongo gets a bad rap sometimes from people who use it incorrectly. E.g. people don’t put their driver into safe mode and then are surprised when they start losing writes. Make sure you’ve read the documentation, understand what the w and j parameters are, and that they’re tuned appropriately for your application. As well, pay attention to your system configuration (such as your ulimit on number of file descriptors) to make sure you don’t end up running into surprises down the road.

MongoDB provides a rich set of primitives for high availability. Through a little bit of code and a lot of proper administration technique, you can achieve essentially 100% uptime on your database, even in the presence of catastrophic machine failure. I’d be curious to hear how people have achieved this with other database systems – let me know if you have techniques you’d like to share.