It's been over a month since IBM released version 4.2 of their Hadoop distribution (BigInsights), so I decided to do a quick wirte up on the changes and new features brought by this release.
Traditionally, the BigInsights offering has been arranged into two layers – the ODPi compatible foundation – IBM Open Platform and the value-add components on top (Analyst, Data Scientist, and Enterprise Management). The Open Platform foundations is 100% open source and available for free, while the value-add bundles provide properietary components, which require obtaining a licence from IBM.
Version 4.2 greatly simplifies BigInsights' packaging by merging the value-adds into a single bundle. The Analyst, Data Scientist, and Enterprise Management bundles are no longer available, and they have been replaced by a single package called IBM BigInsights, which sits on top of the Open Platform.
As with version 4.1, the Open Platform foundation is 100% open source and completely free of charge, while the BigInsights bundle requires a licence.
Version 4.2 brings multiple component updates (see Table 1), for example it upgrades Spark to 1.6.1, but it also brings some new components in the distribution, most notably Ranger, Phoenix, and Titan. This version also bring official support for SystemML 0.10.0.
Table 1: Enhancements to IBM Open Platform 4.2
|Component||4.1 version||4.2 version|
BigSQL is the flagship product of the BigInsights package and this version brings some key enhancements to this component.
- Spark jobs invocation from BigSQL – the SYSHADOOP.EXECSPARK function can trigger the execution of a Spark job and use the results in a SQL query. It accepts three arguments – language (e.g. Java, Scala), fully quallified class name, and optional arguments. According to the official documentation you can nest the call in SQL queries so you can easily do something like
SELECT * FROM TABLE(SYSHADOOP.EXECSPARK( language => 'scala', class => 'com.ibm.biginsights.bigsql.examples.ReadJsonFile', uri => 'hdfs://host.port.com:8020/user/bigsql/demo.json', card => 100000)) AS doc, products WHERE doc.country IS NOT NULL AND doc.language = products.language;
- Support for UPDATE/DELETE on HBase tables – traditionally Big SQL did not provide UPDATE/DELETE for operations against an HBase store. These are now fully supported, and together with nested sub-queries, windowing, OLAP aggregation, complex joins, ROLLUP, and grouping functions, provide complete SQL support over data stored in HBase
- Performance improvements – 4.2 introduces a tons of performance improvement features like automatic ANALYZE, statistics sampling and extrapolation (instead of full scans), improved partition pruning, concurrency improvements and many more
- Disaster recovery improvements – Big SQL metadata can now be backed up online. Backup/restores can be automated for maintaining a DR copy at a remote site. This feature, however, is limited to Big SQL only and does not cover all of the data stored in HDFS. Maintaining a DR copy of the entire Hadoop cluster is something one can handle via Big Replicate though
- Deeper integration with the BLU acceleration – 4.2 brings integration with IBM BLU Acceleration. This set of technologies vastly improves analytic workloads by leveraging techniques like in-memory processing of columnar data, data compression (via Huffman encoding), data skipping, and CPU acceleration (parallel vector processing). Using Big SQL one can now create BLU tables (on the head node) and join them data stored in HDFS.
Text Analytics Enhancements
- The "Run on Cluster" action is no longer limited to MapReduce and can now execute extractors via Spark. This enables the extractors to run much faster when processing big document sets, by using all advantages provided by Spark like lazy evaluation and in-memory processing
- Text Analytics also introduces an embedded Annotation Query Language (AQL) editor, which allows you to edit the extractor AQL manually. The editor also shows you the resources used by the extractor, and presents its output from running on the loaded document set in a Results panel
- There are other minor improvements like projects import/export as well
Big Replicate is a new offering in the IBM big data family, although it is not part of the IBM BigInsights bundle and it requires a separate licence.
Big Replicate provides active-transactional replication technology, which can sync data between multiple Hadoop cluster, enabling Disaster Recovery environments and migrations to and from IBM BigInsights and other Hadoop distributions.
IBM provides a free, non-production version of BigInsights called Quick Start Edition which is available as docker images and is limited to five nodes (2 management and 3 data). It is available for download at http://www.ibm.com/analytics/us/en/technology/hadoop/hadoop-trials.html