April 14, 2007

Alex Payne is a Rails developer working on Twitter and in an interview had a couple of criticms with Rails scaling that they’re now seeing (I’m going to quote extensively to avoid the appearance of cherry picking):

By various metrics Twitter is the biggest Rails site on the net right now. Running on Rails has forced us to deal with scaling issues — issues that any growing site eventually contends with — far sooner than I think we would on another framework.

The common wisdom in the Rails community at this time is that scaling Rails is a matter of cost: just throw more CPUs at it. The problem is that more instances of Rails (running as part of a Mongrel cluster, in our case) means more requests to your database. At this point in time there’s no facility in Rails to talk to more than one database at a time. The solutions to this are caching the hell out of everything and setting up multiple read-only slave databases, neither of which are quick fixes to implement. So it’s not just cost, it’s time, and time is that much more precious when people can[’t] reach your site.

The specific issues that they’re running into seem to be database specific — they’re pounding their database with queries and that strain is beginning to show. I was hoping for more information about their setup from one of the Twitter developers, however DHH has stated that they’ve been in touch with him regarding their scaling issues and provided this additional information from those conversations:

At that time they were fielding spikes of up to 11,000 requests per second across some 16 cores with very little caching thrown into the mix to mitigate. No wonder their site had been feeling slow.

It sounded like they had a good plan at the time, though. Roll in a rack of new servers, look into doing substantial caching, and move beyond a single database server. The normal road to sanity employed by most any web application experiencing rapid growth.

…Alex mentions that scaling the application by adding more Mongrels and servers eventually puts a greater strain on the database. That’s absolutely correct, but also the intended consequence, not an unexpected side-effect. There should have been no surprises there.

I don’t care what framework you’re using — once you’ve saturated your database with queries, there are only a couple of things that you can do to address this:

  1. add (more) extensive caching in your applicaiton so that you reduce the number of queries hitting the database. With a Rails app this is the most immediate solution given how easy caching is to do in Rails and the comment that DHH makes that there was only minimal caching going on with their app indicates that there’s room here for a good (and nearly immediate) performance increase to be found here;
  2. move your database to a beefer box;
  3. add either clustering and/or replication to your database. This is the one grievance that Alex has that has any merit and since this interview was published a couple of days ago, headway has being made since this interview appeared to address this so that multiple (read/write and read only) connections can be established and used by the Rails framework.

Alex follows with:

None of these scaling approaches are as fun and easy as developing for Rails. All the convenience methods and syntactical sugar that makes Rails such a pleasure for coders ends up being absolutely punishing, performance-wise. Once you hit a certain threshold of traffic, either you need to strip out all the costly neat stuff that Rails does for you (RJS, ActiveRecord, ActiveSupport, etc.) or move the slow parts of your application out of Rails, or both.

I can’t get my head wrapped around the assertion that, “Once you hit a certain threshold of traffic, either you need to strip out all the costly neat stuff that Rails does for you (RJS, ActiveRecord, ActiveSupport, etc.) or move the slow parts of your application out of Rails, or both.” Why would you do this? When I initially read this I thought that he was talking about moving portions of the app out to another language so that they could specify multiple database connections. But that doesn’t follow from the assertion that you need to, “strip out all the costly neat stuff that Rails does for you.” So he is talking about application performance, in spite of the fact that he’s demonstarted no reason that horizontal scaling of the application servers won’t continue to work for them (in fact, he talks about how easy it is to have Joyent roll out new slices to handle exactly this in the next question’s answer). I’m not alone in thinking that this is what was meant as David comments that, “If your bottleneck has moved to the database, you probably won’t see big results by replacing pretty constructs with ugly ones. In other words, if a database query is taking 0.5 seconds, improving a loop from 0.05 to 0.01 seconds is not worth bothering with at this point.”

Can someone from Twitter please expand on why this would be necessary because given the context of where Alex has said that you’re seeing performance issues it makes no sense.

It’s also worth mentioning that there shouldn’t be doubt in anybody’s mind at this point that Ruby itself is slow. It’s great that people are hard at work on faster implementations of the language, but right now, it’s tough. If you’re looking to deploy a big web application and you’re language-agnostic, realize that the same operation in Ruby will take less time in Python. All of us working on Twitter are big Ruby fans, but I think it’s worth being frank that this isn’t one of those relativistic language issues. Ruby is slow.

Good grief. While Ruby is comparitively slow (I’ve not seen anyone arguing that it’s not) and while there are multiple efforts to get this addressed, it also has nothing to do with the issue that Twitter is seeing. A faster Ruby would mean that they would need fewer application servers, not that they’d see any decrease in the number of queries to the database.

Sorry, but in my opinion these are (all but one) crap complaints raised in a pretty shitty interview. Both Ryan and Rafe have weighed in with observations of their own that are worth reading.

Updates: if this type of stuff blows your hair back, then I highly recommend Cal Henderson’s, “Building Scalable Web Sites.”

Tim Bray has weighed in:

In the big picture, Twitter did exactly the right thing. They had a good idea and they buckled down and focused on delivering something as cool as possible as fast as possible, and it’s really hard, in early 2007, to beat Rails for that. When all of a sudden there were a few tens of thousands of people using it, then they went to work on the scaling.

Tim did a much better job at making lemonade out of lemons than I did. I was too focused on trying to figure out what the specific issues that they’d encountered were (that were legitimate) because these things do blow my hair back.

It’s 10.30 PM on Saturday night and I’ve just received a very polite and informative email from Alex. This post is getting a little long, so tomorrow I’ll write up a new post with the information that Alex has provided. That post is up now.