Open Data Platform – Why does it matter

Open Data Platform – Why does it matter


"Can we put it in Hadoop?" is a question I get asked more and more often as customers start to recognize "big data" as something that might be missing from their data strategy. They often think of Hadoop as something like an on-line archive or a staging area, which they can use to augment their existing DWH environment.

There is however another pattern that begins to emerge – vendors are starting to develop commercial software "on top" of various Hadoop projects in a similar fashion to how they develop products that require an underlying RDBMS. And here lies the problem. You have a handful of RDBMS vendors like Oracle, Microsoft, and IBM, and although difficult it is not impossible to test and certify your software on top of each of the major databases. In the case of Hadoop this strategy simply doesn't work as Hadoop is not a single product but an umbrella term that covers different projects. Different distributions use different set of components, package them in different ways, and deploy them in various locations. According to the Apache Software Foundation at the moment there are close to 30 vendors providing their own distributions and commercial support.

The Open Data Platform (ODP) is an industry initiative that aims at resolving the interoperability issue by standardizing a set of core Hadoop components – the so called Open Data Platform Core. The platform is not a product but rather a set of guidelines on what Hadoop components to include in a Hadoop distribution, how to package them together, and what management framework to provide for monitoring and administering the different services. ODP also includes a set of open source tests, which to my understanding the individual distributions have to pass to be in compliance with the Open Data Platform initiative.

It is also my understanding that once a vendor develops a value add solution that runs on an ODP compliant distribution this value add solution should theoretically work on all other Open Data Platform compliant distributions. Customers can then choose if they want to run it on top of Hortonworks, IBM Open Platform, Pivotal HD and so on.

Value-add on top of an Open Data Platform-compliant distribution

Value-add on top of an Open Data Platform-compliant distribution

For the time being the Open Data Platform initiative is supported by big players as IBM, Hortonworks, EMC, VMWare, Teradata, SAS, Pivotal and others. There is also strong opposition against the initiative coming from vendors that ship Hadoop distributions mixed with proprietary components. For example MapR replaces the HDFS component and instead uses its own closed source file system called MapRFS. Cloudera doesn’t rely on Apache Ambari (the open source project for provisioning, managing, and monitoring Apache Hadoop clusters) but instead ships the proprietary Cloudera Management Suite. Both MapR and Cloudera naturally oppose a standard distribution that includes open source replacements for their bread and butter components and both have been quite vocal about it. According to the creator of Apache Bigtop Roman Shaposhnik (ex-Cloudera) he had to find a new home for his project for similar reasons.

It is my personal opinion that we need some kind of standardization for the myriad of Hadoop distributions in order to maintain interoperability (think POSIX for Hadoop) and for the time being I do support ODP. Having said that, ODP is not a proper standard (yet) and I would very much like to see it evolve into an IEEE backed reference architecture.

I am planning to write a quick tutorial in the next couple of days on deploying IBM Open Platform (which IBM distributes for free) so that anyone interested can have a go and play with an ODP based distribution.

Disclaimer: This weblog does not represent the thoughts, intentions, plans or strategies of my employer (IBM). All information featured here is solely my opinion.