March 11, 2010

Why Digg Digs Cassandra

Digg, the San Francisco-based social media company, is dropping MySQL and instead betting its future on Cassandra, an open-source data store. It’s just the latest sign of the growing popularity of the software, which was developed (and open sourced) by Facebook to search through its inbox. While Facebook has since backed off Cassandra, Digg plans to open source all its work on Cassandra and champion the software’s development and adoption.

In a blog post on the Digg blog, John Quinn, Digg’s VP of engineering, writes:

Perhaps our most significant infrastructure change is abandoning MySQL in favor of a NoSQL alternative. To someone like me who’s been building systems almost exclusively on relational databases for almost 20 years, this feels like a bold move.

What’s Wrong with MySQL?

Our primary motivation for moving away from MySQL is the increasing difficulty of building a high performance, write intensive, application on a data set that is growing quickly, with no end in sight. This growth has forced us into horizontal and vertical partitioning strategies that have eliminated most of the value of a relational database, while still incurring all the overhead.

Digg is just the latest high-profile convert to the NoSQL world. Instead of using databases such as MySQL, many of the companies that deal in near-real-time information are opting for new kind of data stores — most of them open source, such as Cassandra and CouchDB.

Cassandra is roughly the open-source equivalent of Google’s Big Table. It was intended by Facebook to solve the problem of inbox search; the company needed something that was fast, reliable and had the ability to handle read and write requests at the same time. Messaging in an environment as heavily used as Facebook requires a system that can not only store data but also provide results for search queries at blazing fast speeds.

Stu Hood, the technical lead for the search team in the Email & Apps division of Rackspace, recently said:

I think that distributed databases solve a problem that a lot of companies with large datasets have had to solve independently in the past…Cassandra has an approach that hybridizes the Bigtable and Dynamo models, where a lot of its competitors chose to take one path or the other. Over the Bigtable clones, Cassandra has huge high-availability advantages, and no single point of failure (possible because of the eventually consistent approach). When compared to the Dynamo adherents, Cassandra has the advantage of a more advanced datamodel, allowing for a single “row” to contain billions of column/value pairs: enough to fill a machine. You also get efficient range queries for the top level key, and even within your values.

Data Presentations Cassandra Sigmod

View more presentations from jhammerb.

In a post last year, contributing writer Gary Orenstein pointed out that thanks to these attributes, Cassandra has potential applications beyond inbox search that include “recommendation engines, targeted advertising, and content search, particularly when you combine many concurrent inputs and output requests to the same data set.”

Digg is a prototypical application. The company tells me that it gets: