Digg switches to Cassandra to replace MySQL

Mar 13, 2010 11:37 GMT  ·  By

After announcing back in September 2009 that they would be switching from a MySQL environment to a NoSQL BigTable solution, Digg engineers rewrote the majority of their site's code using the Cassandra DB as their main data storage handler. Cassandra was chosen from other open-source NoSQL solutions, including CouchDB and MongoDB.

John Quinn, VP of Engineering at Digg, said on the company's blog, “We were inspired by Google and Amazon's broad use of their non-relational BigTable and Dynamo systems. We evaluated all the usual open source NoSQL suspects. After considerable debate, we decided to go with Cassandra.”

After six months of re-writing the majority of Digg's source code, performing detailed tests and adapting the entire server architecture, the company's tech department started rolling out Cassandra-powered features on the main site.

During the testing period, a tool called Transcribe has been developed to migrate content from MySQL to Cassandra. This tool is expected to be released as open source after Digg finishes its migration.

MySQL will not be completely phased out during this process, being still used in Digg's small-scale projects due to its high flexibility and rapid prototyping.

Besides implementing Cassandra on its platform, Digg also assigned a full-time committer to the Cassandra project developed at the Apache Software Foundation to implement some of the changes and new features tested on Digg's website in the project's main core.

These changes include massive performance improvements, increased comparitor speed, better compaction threading, reduced logging overhead, row-level caching, multi-get capability, native atomic counters using Zookeeper, upgraded Rackaware capability, slow query logging, improved bulk import functionality and new Scribe support for improved logging.

Developed initially for Facebook by a former Amazon engineer and Facebook employee, the database was named a top-level project at Apache in February 2010, and is now used by other big companies across the web like Twitter, Cisco, Rackspace, Reddit and IBM.