Facebook is a big user of Hadoop, but its needs go beyond what the basic software provides

Nov 9, 2012 13:50 GMT  ·  By

Facebook is so big that it runs on all custom software and hardware. It's not the first company to do so, but it also has a history of open sourcing its infrastructure software solutions. The latest such tool is Corona, a new way of managing Hadoop jobs.

Hadoop is yet another open source tech that Facebook has contributed to, though most of the initial work came from Yahoo, another big user of Hadoop.

Apache now manages the project and there are several commercial companies trying to develop and sell Hadoop-based software and tools to manage it.

Hadoop is an open-source implementation of Google's MapReduce algorithm and tech which is designed to make it possible for large clusters of computers to handle a lot of jobs simultaneously and a huge amount of data and do it fast.

It's the type of things large websites deal with most often. While MapReduce was a breakthrough, it's also a decade old. Hadoop is continually evolving, but so are the needs and the scales of the companies that use them, especially when you're Facebook.

Corona aims to make Hadoop more efficient by providing job scheduling by separating the job tracker from the cluster manager.

The technical details are of interest to the few people that use these systems. The end results though benefit every Facebook user and, since Corona is open source, potentially many other companies and people out there.

For example, with Corona, map and reduce slots were refilled 17 percent sooner, meaning that they stayed idle for less. This also meant better utilization of the resources, in a simulation Corona got to 95 percent cluster utilization while a standard Hadoop MapReduce scheduling system maxed out at 70 percent.

Other metrics improved as well, overall, Facebook is fairly satisfied with Corona, though it is still working on improving it and expanding it.