The SQL on Hadoop landscape: An overview (Part I)

The SQL on Hadoop landscape: An overview (Part I)


We can attribute the first attempt at getting SQL functionality on top of Hadoop to Apache Hive. Initially developed at Facebook, this open source project’s main goal according to one of its creators was to “bring the power of Hadoop to more and more people and not just developers” [1].

The next player in this space came in 2012 when Cloudera released Impala and this opened the floodgates. When thinking about a SQL engine on top of Hadoop we now have the option to pick from Hive, Impala, Drill, Spark SQL, HAWQ, Big SQL, Presto, Shark and others. A substantial amount of vendors, traditionally working in the data warehousing space, now offer their own implementations to bring SQL to Hadoop. But why do we need to turn Hadoop into a SQL queryable system? Wasn’t it supposed to be an alternative to the classical SQL-centric relational databases? In this article I’d like talk a bit about the motivation of having SQL interface into Hadoop and also do a brief overview of the key players’ offerings.

The SQL on Hadoop motivation

There are in fact several key drivers behind the SQL on Hadoop necessity:

  • Data Warehouse Modernization – We are far beyond the RDBMS vs Hadoop point and all the big RDBMS vendors consider Hadoop as something that doesn’t replace but enhances the traditional data warehouse. IBM speak about leveraging big data technologies to enhance the capabilities of a traditional data warehouse [2]. Microsoft promote the idea of having a seamlessly integrated, side by side Hadoop installation for scaling out non-relational data [3]. Oracle go even further by suggesting a unified framework for their relational database, Hadoop, and Oracle NoSQL [4].

    There are three main scenarios (in order of increasing complexity) for data warehouse modernization:

    • Online archive – offloading infrequently accessed data from the data warehouse into Hadoop is a very common use case. It helps in cost reduction as it decreases the amount of RDBMS licences in the environment. It is also a more efficient way of storing cold data as the per GB price in Hadoop is much lower [5]. Another benefit is speeding up the data warehouse as it will have less data to deal with (faster table scans etc.)
    • Staging area – another very common use case is to use a Hadoop cluster for temporary storage where data can be pre-processed, filtered, and transformed before loading into the data warehouse.
    • Complex analytics is something the more advanced Hadoop users go for. What they do is to employ data scientists and use the data predominantly in Hadoop, running complex tasks like machine learning, often combining Hadoop core with Spark and projects using the Spark engine like Spark SQL and Spark R.
  • Existing skills – The strong presence of relational databases in private companies, government entities, and even on mobile devices (e.g. the android.database.sqlite package) has had a tremendous impact on the landscape of programming languages. A recent analysis [6] on the number of programming jobs, ranked SQL programming as the most in-demand skill on the IT job market in 2015.
    Most In Demand Programming Jobs

    Most In Demand Programming Jobs according to Coding Dojo [6]

    This makes SQL not only the lingua franca for databases but also a skill that is widely available across IT departments. Reusing SQL for accessing an entirely new data platform (Hadoop) is something that has value not only for developers but also for business analyst and other users that have picked up SQL because of specific business tools (e.g. business intelligence and analytics).

  • Existing tools – Finding a company that’s a completely green field and free from relational databases is quite unusual. As with existing skills, most companies have already spent a certain amount of money on tools that can extract and present data from relational databases. Being able to hook-up these tools to Hadoop via a standard interface (think JDBC/ODBC) can make a difference in leveraging the existing investments.
  • Reducing complexity - Let’s face it. Writing MapReduce jobs is not easy and writing Java is more difficult than writing SQL. Java is good from maintainability perspective as it is quite verbose but using it for running ad hoc queries is quite inconvenient. Writing MapReduce jobs requires a certain amount of boilerplate code and this easily puts off non-programmers.

Given the favouring forecasts analyst give for the big data market (e.g. IDC believe that the big data market will grow at 26.5% annually and will go up to $41.5 billion through 2018 [7]) there is no wonder that the Hadoop distribution vendors together with the traditional RDBMS providers are willing to play in this space.

What options are out there

Apache Hive

As we already mentioned, Apache Hive was the first entry in the "SQL on Hadoop" space. It uses a language called HiveQL to provide an SQL-like interface into Hadoop. Strictly speaking HiveSQL is not SQL-92 compliant. It lacks widely used features like transactions and materialized view and although ACID support was added in version 0.14 the Hive documentation clearly states that "Hive is not designed for OLTP and does not offer real-time queries and row level updates" [8].

Hive works by storing metadata like table definitions and mappings into an embedded (Derby) or remote (MySQL) relational database. The metadata repository is known as the Hive Metastore. The information in the Metastore is used to provide a schema on read functionality and enables Hive to perform semantic checks on the submitted queries. Using the information in the Metastore the Hive server transforms HiveQL queries into MapReduce jobs.

Hive Architecture

Apache Hive Architecture

Using MapReduce to process the queries is what Hive is mostly criticized for. The conversion to MapReduce jobs leads to higher query latency because of the start-up overhead. This makes Hive more suited towards analysing relatively static data where query execution time is not very important. One way to remedy the latency issue is the Hive on Spark initiative, which is an effort to make Hive work on top of Apache Spark rather than MapReduce. This will bring performance benefits especially for queries involving multiple reducer stages [9]. As of the writing of this article Hive on Spark is still under active development.

Hive strengths

  • It is easier to write HiveQL than MapReduce jobs (generally true for all SQL on Hadoop solutions)
  • Low maintenance and simple to learn
  • Hive is the de facto standard – every Hadoop distribution includes Hive

Limitations of Hive

  • Not designed for OLTP
  • No real-time queries and row level updates
  • High query latency (minutes) [8]
  • Limited subquery support (e.g. feeding the output of one query as the input of another) [10]

Impala

In October 2012 Cloudera announced its open source SQL query engine for Apache Hadoop called Impala. In contrast to Hive, which is written in Java, Impala uses a massively parallel processing engine (MPP) written in C++. Not having to use MapReduce under the covers makes Impala better suited for real-time, ad-hoc queries. Impala is also well integrated with Hive – having a Hive metastore is a pre-requisite for using it. Impala also uses the same syntax (HiveQL) and ODBC driver as Hive [11]. There are, however, certain differences between Impala and Hive and there are HiveQL features that are not available in Impala [12].

Impala is sometimes criticised for its in-memory join implementation. When the joined tables can’t fit into memory Impala simply cancels the queries [13].

Impala strengths

  • Lower query latency than Hive as it bypasses MapReduce
  • Integrated with the Hive metastore so it can query tables managed by Hive (HDFS or HBase)
  • Written in C++. Provides a C++ API for user defined functions (UDF)
  • "Spilling to disk" support introduced in version 2.0.0 to limit the chance of out-of-memory errors
  • Impala is open source

Limitations of Impala

  • Relatively new product, still outperformed by traditional players (Big SQL), especially for complex queries [14]
  • Implements a subset of SQL-92, not fully ANSI compliant
  • Out of memory still happens (the "minimum buffers" issue)
  • Cloudera have been accused of "bending the truth" and not being honest in their Impala benchmarks [15]
  • Not a free software (requires a licence to use)

Big SQL

Big SQL is IBM’s proprietary SQL engine on top of Hadoop and is provided as a value-add on top of their ODP standard Hadoop distribution called BigInsights.

Big SQL is based on IBM DB2 and benefits from decades of IBM R&D investment in relational databases. It provides a fully ANSI-compliant SQL implementation and it doesn’t suffer from limitations that trouble most of its competitors. For example, most of the other implementations have limited support for subqueries (e.g. not allowing them in SELECT lists, HAVING clause, or with certain quantifiers such as SOME, ANY, or ALL). Big SQL on the other hand doesn’t have comparable restrictions and provides full support for subqueries. It also comes with over 200 built-in functions, including a wide range of OLAP aggregate functions. Another benefit is the support of user defined functions (UDF), which can be written in C, Java, or SQL.

There is also a wide range of advanced features, traditionally found in enterprise RDBMS’s like fine-grained access control (FGA), data masking, row-based access controls, and federated queries. The latter is very useful when aggregating data across Hadoop and relational databases as it allows a single query to pull and process data from both RDBMS and Hadoop sources.

IBM Big SQL Architecture

IBM Big SQL Architecture

There is a similarity in the architecture of Big SQL and Impala as both solutions use their own services alongside Hadoop and they don’t rely on MapReduce. Big SQL has its own workers, written in C++, that access HDFS data (e.g. delimited, sequence files, Parquet, Avro etc.) directly. For all other data types a Java I/O engine is used. The Java I/O engine also allows access to custom file formats and SerDe’s.

Big SQL uses shared nothing architecture and pushes processing out to data nodes to maximize data locality. It also performs intelligent data partition elimination based on SQL predicates. Access to Big SQL is via ODBC/JDBC drivers.

Big SQL strengths

  • Integrated with the Hive metastore so it can query tables managed by Hive (HDFS or HBase)
  • No lock-in: data stays in Hadoop, not Big SQL
  • Written in C++
  • Audited TPC benchmarks shows Big SQL being faster than Hive and Impala [14]
  • Strong "spill to disk" support, which prevents out-of-memory issues
  • Fully ANSI SQL compliant
  • Embedded high-availability (it supports standby Schedulers and Master nodes)

Limitations of Big SQL

  • Proprietary, closed source
  • Not a free software (requires a licence to use)

Continue to Part II ...

References

[1] Qubole Founders Open Up About the Transformation of Hadoop, June 18, 2014, Nate Philip
[2] Big Data at the Speed of Business
[3] The Microsoft modern data warehouse, Microsoft white paper
[4] Unified Query for Big Data Management Systems, Oracle white paper, January 2015
[5] Hadoop Hits the Big Time, Thomas H. Davenport, The Wall Street Journal, June 2014
[6] The 8 Most In-Demand Programming Languages of 2015, Coding Dojo, August 2015
[7] Big Data & Analytics – An IDC Four Pillar Research Area, Big Data Research
[8] What Hive is NOT, Hive Tutorial, The Apache Hive wiki, October 2015
[9] Hive on Spark, HIVE-7292, Created Jun 2014
[10] Hive Language Manual, Apache Hive wiki, October 2015
[11] Cloudera Impala: Real-Time Queries in Apache Hadoop, For Real, Cloudera Engineering Blog, October 2012
[12] HiveQL Features not Available in Impala, SQL Differences Between Impala and Hive, Cloudera Product Documentation
[13] What happens when the data set exceeds available memory, Cloudera Impala Frequently Asked Questions, Cloudera Product Documentation
[14] Big SQL 3.0: Hadoop-DS benchmark–Performance isn’t everything..., IBM Hadoop Dev, December 2014
[15] A fair benchmark for Impala 2.0, Dataminded blog, October 2014