IBM fully adopts Apache Spark for its cloud infrastructure

Jun 15, 2015 13:19 GMT  ·  By

Sometimes open source projects pay off big time, and IBM's decision to invest in Apache Spark has given the people involved in the project a way to ensure a future based on their work on the Spark engine.

IBM is practically adopting Apache Spark from Apache without the company giving its permission, putting more than 3,500 IBM researchers at the disposal of the Spark team.

This high number of hands ready to improve the engine, fix bugs, test it in production environments, and then submit new features, puts IBM in firm command of an open source project that seems the perfect counterpart to Hadoop, another Apache project for working with big data loads.

While Hadoop is focused on providing a stable way of processing extremely huge data sets, Apache Spark is focused on speed, allowing developers to go through large data collections in near real time.

The software is essential for the "Internet of Things" that's creeping its head on the horizon, being able to go through big blocks of information without blinking an eye.

The IBM investment amounts to a few million dollars a year

Besides donating staff to help keep the software updated and bug-free, IBM also plans to put Spark to the test in its data analysis and e-commerce software, while also integrating it with the Watson AI, and even offering Spark as a service through its Bluemix cloud offering.

A technology center will also be built in San Francisco, where IBM plans to train students, scientists, and engineers to work with Spark and help set it as a standard for working with real-time data, just like Hadoop is viewed when dealing with size.

Since Spark is fitted perfectly for machine learning applications, IBM will also be donating its SystemML machine learning technology, which will be slowly integrated with the Spark open source ecosystem.

The Spark success story

If you're not familiar with Spark, the project was started in 2009 by Romanian-Canadian scientist Matei Zaharia at the University of California, Berkeley, and after a few very well-received iterations, it was put in the care of the Apache Software Foundation.

Here, more Hadoop-friendly features were added, and the project evolved into a speed demon, working currently up to 100 times faster than Hadoop MapReduce for in-memory data storage, and 10 times faster on disk storage.

A major factor in IBM's recent announcement was also the fact that the company was one of the four founding members of AMPLab (Algorithms, Machines and People Lab) at the University of California, Berkeley, having the inside track on many Spark features and its full capabilities long before anyone else.

Spark's future looks very bright right now, and the reason for this is because the other open source technologies IBM decided to invest in and promote in the past include every day dev tools like Linux, Eclipse, and OpenJDK.