The SQL on Hadoop landscape: An overview (Part II)

The SQL on Hadoop landscape: An overview (Part II)


In Part I we looked at Apache Hive, Cloudera Impala, and IBM Big SQL. We now continue with Pivotal HAWQ, Apache Drill, Spark SQL, Presto, and Hive on Tez.

HAWQ

HAdoop With Query (HAWQ) is Pivotal’s entry into the SQL on Hadoop market. HAWQ is an extension of the Pivotal Greenplum database that stores data directly in Hadoop. This is in a way similar to how IBM built Big SQL using the DB2 engine.

HAWQ uses a massively parallel architecture (MPP) and runs its own services, called Segments, to access data in HDFS directly. It is also the only other SQL on Hadoop solution to provide ANSI SQL compliance (the first one being Big SQL).

HAWQ Architecture

HAWQ also provides out-of-the-box high availability capabilities – you can configure a standby server, which is kept up to date by replicating the write-ahead log file from the primary master. You can also build the so called "dual cluster", which keeps a synchronized copy of the complete database [1].

Another interesting capability of HAWQ is its integration with MADlib, another incubating Apache project, which provides machine learning capabilities directly in SQL [2].

In September 2015, Pivotal donated the HAWQ core to the Apache Software Foundation and HAWQ is now an officially incubating project [3].

HAWQ strengths

  • Embedded high-availability
  • Embedded high-availability
  • Open source, freely available

Limitations of HAWQ

  • No Hive metastore (HCatalog) integration
  • Still incubating, not a mature Apache project
  • Scaling out HAWQ seems to be a complex process, involving table redistribution operations. The HAWQ documentation makes it clear that performance degradation is to be expected until the process completes[4]

Apache Drill

Apache Drill is a relatively new player in the SQL on Hadoop space. Its general availability (version 1.0) was announced in May 2015, but it's been incubating since September 2012 [5]. Drill is the open source equivalent of Google Dremel – an interactive ad hoc analysis system for read-only data sets, with MapR being the primary contributor and promoter of Drill.

Drill uses the Hive metastore to query structure data and provides an interesting feature for automatically discovering the schema of self-describing data (e.g. JSON, Parquet files etc.)

Drill Architecture

Drill does not rely on MapReduce, Tez, or Spark – it uses its own query execution services called Drillbits. Zookeeper is used to locate Drillbits and for cluster membership management.

Drill strengths

  • Uses ANSI SQL 2003 syntax and User Defined Functions (UDF) support
  • Access via JDBC/ODBC
  • There is an easy to use interface to visually explore data (Drill Explorer) [6]
  • Hive metastore integration
  • Open source

Limitations of Drill

  • Not fully ANSI SQL compliant yet
  • The only supported language for UDFs is Java
  • DML support is somehow limited

Presto

Presto is another open source project, which provides SQL on top of Hadoop. Although Apache-licenced, Presto is not an Apache Software Foundation project and its roadmap is mostly controlled by Teradata. The key contributors to Presto are Facebook, Teradata, and Netflix.

Presto Architecture

Presto uses an MPP DBMS architecture similar to Big SQL, but its engine is entirely written in Java. Presto is also known to scale extremely well with users as Facebook running it against a dataset of 300PB. It also employs a flexible storage abstraction, allowing it process data inside or outside of Hadoop (HDFS, HBase, Cassandra, MySQL, JSON, PostgreSQL, and others [7]).

Presto supports mechanisms for querying and combining data from external sources, providing functionality similar to the Big SQL federation mechanism.

Presto strengths

  • Flexible storage abstraction
  • Open source
  • Commercial support is available if needed
  • Doesn't rely on MapReduce

Limitations of Presto

  • Not part of any major Hadoop distribution (e.g. Cloudera, Hortonworks, BigInsgihts etc.)
  • Cloudera claim that Impala is 5.3 to 7.5 times faster than Presto [8]
  • Not fully ANSI SQL compliant (e.g. no support for non-equijoin, scalar subqueries)
  • Large joins might fail - all data must fit in memory and Presto doesn't support spilling to disk
  • No cost based optimizer – users must manually specify join order
  • No complex security system – no grant/revoke privileges for objects
  • No standard ODBC driver, but on the roadmap for 2016 [9]

Spark SQL

Technically not a standalone project but a module that extends Spark, Spark SQL provides DataFrames abstraction on top of Spark and can act as a SQL query engine. The project was announced in March 2014 and positioned as a successor of Shark.

Spark SQL supports all existing Hive data formats, user-defined functions, and integrates with the Hive metastore.

Spark SQL is not considered to be a general purpose SQL engine – it is primarily designed to enable Spark developers incorporate SQL statements in Spark programs [11]. If the primary use case is to fetch data for machine learning tasks, Spark SQL is probably the best choice out there. If the primary use case is to offload warehouse data in a Hadoop archive, Spark SQL might not be the best choice as it lacks essential features (e.g. security, auditing, issues with concurrent scalability and others)

Spark SQL strengths

  • Open source
  • Supports wide range of formats (Parquet, Avro, JSON, ORC)
  • Integrates with the Hive metastore
  • Provides compatibility with other Spark libraries (MLlib, GraphX)

Limitations of Spark SQL

  • Spark SQL runs on top of Spark and gets its performance from relying on in-memory caching. As the data sets grow you also have to expand the available cluster memory
  • Next to no security and lack of other common features (e.g. fine grained access control, auditing etc.)

Hive on Tez

One can say that Hive on Tez is an effort that's a bit similar to Spark SQL. Spark is a multi-purpose in-memory engine,while Tez is is a framework, which other projects like Hive and Pig can embed [12]. Both Spark and Tez provide in-memory execution and base the logic flow on directed-acyclic graphs. Tez, however, has strong dependency on YARN, while Spark (and Spark SQL respectively) doesn't depend on it and can also work in standalone mode.

A benchmark conducted by Hortonworks shows Hive on Tez being over 100 faster than Hive on MapReduce [13]. The benchmark is based on the standard TPC-DS tests, however it also looks like that it didn't include the entire query set – the test was done only with 50 of the 99 TPC-DS queries and this rises questions in regard to objectivity.

Cloudera have also released their own benchmarks showing Impala being faster than both Hive on Tez and Spark SQL. [14]

There is also a competing Hive on Spark project [15], which may make Hive on Tez obsolete, given the broader adoption of Spark in the community (Spark has over a hundred contributors including DataBricks, IBM, Intel, Yahoo, Cloudera and many others. Contributions to Tez come mostly from Hortonworks).

Hive on Tez strengths

  • Much faster than Hive on MapReduce
  • Open source
  • Commercial support available from Hortonworks
  • Hortonworks claim more than 90 customers running Hive on Tez in production [16]

Limitations of Hive on Tez

  • Not designed for OLTP
  • No ANSI-standard SQL support

Summary

There is another way of classifying the SQL on Hadoop engines and this is by how well they are integrated and they fit the Hadoop stack. Using this criteria we can group the solutions into three categories – Pure query engine, RDBMS on Hadoop, and Remote query submission solutions.

SQL Engine Classification by integration with Hadoop

Using this classification we can establish the clear winners as Cloudera Impala, IBM Big SQL, and Hortonworks Hive (on Tez). They all integrate seamlessly with the Hive metastore and provide great compatibility with other Hadoop services and third party applications (via standard JDBC/ODBC drivers).

There is of course the question of ANSI SQL compatibility, where the lead is dominated by Big SQL and HAWQ.
Anyway, selecting the right engine requires evaluating multiple variables – how important the ANSI SQL compatibility is? Is Hive metastore integration needed? Does this engine comes with the Hadoop distribution of choice? Do we have the right skills to use it? Is it free? Is it open source? Is commercial support available? How does it integrate with third party tools? Does it provide SQL federation?

Bottom line is, we definitely can't apply the “one tool for every job” approach, but we should also keep in mind that SQL on Hadoop is a very dynamic market – most of the projects featured in this post did not exist 5 years ago. There are also constant shifts of opinion from vendors. For example, Cloudera justified their decision to develop Impala by saying that “Hive is the wrong architecture for real-time distributed SQL” [17]. Then, barely 20 months down the road they launched a Hive on Spark project [18]. Whatever happens from here on, some projects will probably get abandoned and some will get important improvements to resolve key limitations.

I am planning to do a subsequent review of the SQL on Hadoop landscape after maybe another 24 months and discuss how the market will have evolved.

References

[1] Overview of HAWQ High Availability, Pivotal HAWQ 1.3.1 Documentation, http://hawq.docs.pivotal.io/docs-hawq/docs-hawq-shared/admin_guide/highavail/topics/g-overview-of-high-availability-in-greenplum-database.html
[2] MADlib Moves to ASF, http://madlib.incubator.apache.org/
[3] Introducing The Newly Redesigned Apache HAWQ, Lei Chang, https://blog.pivotal.io/big-data-pivotal/products/introducing-the-newly-redesigned-apache-hawq
[4] Expanding the HAWQ System, HAWQ Administration, http://pivotalhd-210.docs.pivotal.io/doc/2100/webhelp/hawq-topics/ExpandingtheHAWQSystem.html
[5] Industry's First Schema-free SQL Engine - Apache Drill 1.0 is Now Generally Available, MAPR Blogs, https://www.mapr.com/blog/industrys-first-schema-free-sql-engine-apache-drill-10-now-generally-available
[6] Using Drill Explorer, Apache Drill Documentation, https://drill.apache.org/docs/using-drill-explorer/
[7] Connectors, Presto Documentation, https://prestodb.io/docs/current/connector.html
[8] New Benchmarks for SQL-on-Hadoop: Impala 1.4 Widens the Performance Gap, Cloudera Engineering Blog, http://blog.cloudera.com/blog/2014/09/new-benchmarks-for-sql-on-hadoop-impala-1-4-widens-the-performance-gap/
[9] Teradata will support Presto, DBMS2, http://www.dbms2.com/2015/06/08/teradata-will-support-presto/
[10] Shark, Spark SQL, Hive on Spark, and the future of SQL on Spark, Reynold Xin, Databricks Blog, https://databricks.com/blog/2014/07/01/shark-spark-sql-hive-on-spark-and-the-future-of-sql-on-spark.html
[11] Apache Drill: Frequently Asked Questions, https://drill.apache.org/faq/
[12] Cloud is designed to fail... so fail, Jim Fagan, Tech Exchange, 22 December 2015, http://www.zdnet.com/article/salesforce-acquiring-quote-to-cash-app-startup-steelbrick/
[13] Benchmarking Apache Hive 13 for Enterprise Hadoop, Carter Shanklin, 2 June 2014, http://hortonworks.com/blog/benchmarking-apache-hive-13-enterprise-hadoop/
[14] New Benchmarks for SQL-on-Hadoop: Impala 1.4 Widens the Performance Gap, Justin Erickson, Marcel Kornacker, Dileep Kumar, and David Rorke, 22 September 2014, http://blog.cloudera.com/blog/2014/09/new-benchmarks-for-sql-on-hadoop-impala-1-4-widens-the-performance-gap/
[15] Hive on Spark: Getting Started, Szehon Ho, https://cwiki.apache.org/confluence/display/Hive/Hive+on+Spark%3A+Getting+Started
[16] Hive on Spark is blazing fast... or is it?, Carter Shanklin and Mostafa Mokhtar, Strata + Hadoop World, 2015, http://www.slideshare.net/hortonworks/hive-on-spark-is-blazing-fast-or-is-it-final
[17] Impala v Hive, Mike Olson, December 2013, http://vision.cloudera.com/impala-v-hive
[18] Hive on Spark, https://issues.apache.org/jira/browse/HIVE-7292