Using Fluid Query to offload data to Hadoop

Using Fluid Query to offload data to Hadoop


Offloading relational data to HDFS is a common use-case for Hadoop known as "online archive". This approach allows companies to solve a quite common requirement – control the size of their data warehouse while still be able to query historical data.

An interesting offering in this space is IBM Fluid Query. This free add-on to the PureData System for Analytics platform (IBM’s data warehousing platform built on Netezza technology).

Moving data between Netezza and Hadoop using Fluid Query

Fluid Query provides access between PureData System for Analytics (PDA) and Spark or Hadoop. Fluid Query provides both querying and data movement capabilities and can reach out to common Hadoop distributions like Hortonworks and BigInsights. It can also work with the Spark SQL connector and has full support of IBM Big SQL.

In this tutorial I am going to show you how to setup a virtual environment for testing Fluid Query, using a VMware Virtual Machine for IBM’s Hadoop distribution (BigInsights) and the IBM Netezza Software Emulator.

Prerequisites

The only prerequisite is to have VMware Player installed and sufficient amount of memory for running the BigInsights VM (12 GB) and the Netezza Software Emulator (at least 4GB).

Please note that some of the software required for this tutorial is non-free and requires a valid Fix Central account to download. If you don't have an account that will allow you to download Netezza Analytics and Fluid Query there is really no point of reading further.

Setting up a Netezza Software Emulator

To download the Netezza Software Emulator you’ll have to go to IBM’s software download site. You will have to log in using your IBM id. If you don’t have an IBM id you’ll be asked to create one (it is a fairly straightforward process).

Netezza Software Download

Once logged in, select the IBM Netezza Software Emulator for Developers V7.2.1 for Linux and download it. Pay no attention to the title, the emulator can work just fine on both Linux and Windows. Scroll all the way down and click "Continue".

IBM Netezza Developer Network - Method Selection

You’ll have to set your contact preferences and accept the licence agreement by clicking "I agree".

Select your preferred download method (Download Director or http) and download the Software Emulator OVA file.

Start VMware Player, select File -> Open and select the OVA file. Set the VM name and the preferred destination and click "Import".

Import Virtual Machine Dialog

Once the import operation is completed, right click the VM and select "Settings" from the context menu. Go to the "Processor" section and enable the "Intel VT-x/EPT and AMD-V/RVI" option.

Virtual Machine Settings Screen

Power on the VM and make sure it starts correctly. You can login using either the root/netezza account or the nz/nz account.

Netezza Emulator Console

Note the hostname and the IP address (192.168.121.130 in my case) and leave the VM as it is for the time being.
Time to deal with the Hadoop VM.

Setting up a BigInsights VM

You can build your own BigInsights installation from scratch (Install IOP and then add Big SQL on top) or if you want to use a shortcut – simply download the BigInsights VMware image provided by IBM.

Go to http://www.ibm.com/analytics/us/en/technology/hadoop, scroll down and click "Try it".

IBM Analytics - Hadoop website

Click the arrow next to "For VM image" in the IBM BigInsights Quick Start section and follow the link "Download here".
You'll be prompted for your IBM id and upon successful authentication you will be presented with a list of downloadable images.

BigInsights Quick Start Edition Download Website

Select your preferred method (Download Director or http), put a tick next to "IBM BigInsights Quick Start Edition for Non-Production Environment VMware image" and click "Download now".

Unzip the downloaded OVF file and follow the same process as with the Netezza Emulator to import it into VMware Player.

VMware Player Import Dialog

Power on the newly created BigInsights VM.

BigInsights Licence Agreement Screen

Press "I understand" to accept the licence agreement.

BigInsights Startup Console

Note the hostname (rvm), IP address, and account details for the BigInisghts installation.

Open a browser and point it to the IP address of your BigInsights VM (in my case – 192.168.121.131). Use the default Ambari port - 8080.

Ambari Login Screen

Login as admin/admin and confirm that all BigInsights services are up and running (you might have to wait for a while, keep an eye on the number of running operations at the top of the screen next to the cluster name - BI4_QSE).

BigInsights Ambari Dashboard

If some of the services haven't started by default (red status next to their names) you can select and start them manually.

Open BigInsights Home by going to https://192.168.121.131:8443/gateway/default/BigInsightsWeb/index.html (replace 192.168.121.131 with the IP of your BigInsights VM). If you are using the BigInsights VMware image you should log in as user biadmin/biadmin.

BigInsights Home Web Page

Click "Launch" in the Data Server Manager for Big SQL.

Data Server Manager for Big SQL Web Page

You should see a default connection (BIGSQL) in the Database connections section. Select it and click the edit (pencil) button above.

BIGSQL connection settings

Note the Big SQL port number (32051).

Optionally, test the connection using the bigsql/bigsql account.

BIGSQL user/password dialog

Confirm that your Big SQL server is responding

Successful connection confirmation

Network configuration

Nothing fancy here – just make sure the two VM's can ping and resolve each other by name. Just add an entry for netezza to the BigInsights' /etc/hosts file:

[root@rvm ~]# cat /etc/hosts
# File is generated from /home/virtuser/setHostname.sh
# Do not remove the following line, or various programs
# that require network functionality will fail.
127.0.0.1               localhost.localdomain localhost
192.168.121.131 rvm.svl.ibm.com rvm
192.168.121.130 netezza
[root@rvm ~]#

Also add an entry for rvm to the /etc/hosts file in your Netezza Software Emulator:

[root@netezza ~]# cat /etc/hosts
127.0.0.1 netezza localhost.localdomain localhost

192.168.121.131 rvm.svl.ibm.com rvm
[root@netezza ~]#

Then make sure the two can ping each other.

Ping between Netezza and BigInsights

Optionally, make sure you can reach the VMs over SSH. I personally find PuTTY more convenient than the VMware Player console.

Setting up Netezza Analytics Software

Netezza Analytics (informally known as INZA) is a prerequisite for using Fluid Query and we have to download and install it in our Netezza Emulator.

This FixCentral link will take you to the right location.

Fix Central - Netezza Analytics

Select the 3.2.1.0-IM-Netezza-ANALYTICS pack and click "Continue".

HTTPS download - file selection

Click on nz-analytics-v3.2.1.zip and once downloaded, copy it over SCP to the /home/nz folder in the Netezza Emulator.

Log into the Emulator as user nz and unzip the archive.

[nz@netezza ~]$ ll
total 960764
-rw-r--r-- 1 nz nz 983814272 Nov 18 08:20 nz-analytics-v3.2.1.zip
[nz@netezza ~]$ unzip nz-analytics-v3.2.1.zip
Archive:  nz-analytics-v3.2.1.zip
 extracting: inza-3.2.1.zip
  inflating: inzaPackageInstaller.sh
  inflating: verifyNpsVersion
[nz@netezza ~]$

Start the inzaPackageInstaller.sh script to install the Analytics packages:

[nz@netezza ~]$ ./inzaPackageInstaller.sh

Do you want to install the Netezza Analytics packages? (y/n): y
Do you want to install the documentation packages? (y/n): n

Confirm that the following packages should be installed.
 1) Netezza Analytics package: YES
 2) Netezza Analytics Documentation package: NO
Do you want to install the above selections?
   Enter "y" to continue, "x" to exit
   or any other key to be prompted to modify your selection: y

Installing Netezza Analytics packages...
Available compressed installation file or files:
  [0] /export/home/nz/inza-3.2.1.zip
  [1] A compressed file in a different directory or with a non-standard name.

Enter your selection: 0
Archive:  /export/home/nz/inza-3.2.1.zip
 extracting: L-BDRR-9NKJYB.zip
    linking: currentLicense          -> L-BDRR-9NKJYB.zip
  inflating: nzcmrepo/adapter_utilities-3.2.1.46381.nzc
  inflating: nzcmrepo/admin_utilities-3.2.1.46381.nzc
  inflating: nzcmrepo/ae_sysroot-3.2.1.46381.nzc
...
Installing NZCM v.3.2.1...
Your .bashrc file was backed up as .bashrc_bak
Your .bashrc file was modified to include a path to the nzcm cartridge manager.
Done

NZCM successfully installed.

You can ignore the following message until the installation is completed.
        "You need to restart your Bash session or run 'source ~/.bashrc'."


Do you want to run the Netezza Analytics cartridge installer now? (y/n): y
Do you want to do an Express (e) installation or a Custom (c) installation (e/c): e
Confirm that the following components should be installed.
 1) MapReduce: YES
 2) Matrix: YES
 3) IBM Netezza In-database Analytics: YES
 4) Spatial: YES (ESRI)
Do you want to install the selected components?
   Enter "y" to continue, "x" to exit
   or any other key to be prompted to change your selection: y
Installing INZA base components...


...
Registration of nzspatial_esri completed on 'INZA'.

Log file: /nz/var/log/nzcm.20151118.09_29_14.34562.log
Would you like to re-enable all databases that are enabled for IBM Netezza Analytics? (y/n): y
...
Would you like to update all spatial databases that already have spatial registered with the current version? (y/n): y

Installed nzspatial version       3.2.1.46381
Installed nzspatial_esri version  3.2.1.46381

Database INZA
Registered nzspatial_esri version 3.2.1.46381
No update needed in INZA

WARNING: Running with default 'YES' option.


Synchronization done: 0 registered cartridges removed, 0 registered cartridges added.

The IBM Netezza Analytics package installation is complete.
[nz@netezza ~]$

Let's apply the changes made to ~/.bashrc. We can then verify that INZA has been successfully installed by running nzcm.

[nz@netezza ~]$ source ~/.bashrc
[nz@netezza ~]$ nzcm --inza
Installed inza version: 3.2.1.46381
[nz@netezza ~]$

Installing Fluid Query on Netezza

Our next task is to install Fluid Query. Go back to Fix Central to download it.

Fix Central - Fluid Query Download Page

Select the Fluid Query 1.6 fix pack and click "Continue".

Fluid Query Download - file selection page

Click on nz-fluidquery-v1.6.tar.gz to start the download. When finished, transfer the file to the /home/nz folder in the Netezza Emulator.

Unpack the archive.

[nz@netezza ~]$ tar xvfz nz-fluidquery-v1.6.tar.gz
fluid-query-import-export-v1.6.0.0.tar
fluidquery_install.sh
fluid-query-sql-v1.6.0.0.tar
license/
...
license/LI_el
[nz@netezza ~]$

Start the installation by running the fluidquery_install.sh script.

[nz@netezza ~]$ ./fluidquery_install.sh
-------------------------------------------------------------------------------
IBM Fluid Query Installer
(C) Copyright IBM Corp. 2015  All rights reserved.
Version 1.6.0.0 [Build 151110-258]
-------------------------------------------------------------------------------
Installing connector...
Checking if IBM Netezza Analytics is installed...
IBM Netezza Analytics is already installed. Ok.
Checking if IBM Netezza Analytics cartridge is installed...
IBM Netezza Analytics cartridge is already installed. Ok.
Checking for previous installation of IBM Fluid Query...
Previous installation of IBM Fluid Query does not exists in /nz/export/ae/products/fluidquery...
Extracting package to /nz/export/ae/products/fluidquery...
Copying license to /opt/nz/licenses/ifq...
Copying uninstaller to /nz/export/ae/products/fluidquery...
Installation successful! Package installed in /nz/export/ae/products/fluidquery
For details see the log file at /nz/var/log/fluidquery_install/fluidquery_install.2015-11-18.10:36:05.log
[nz@netezza ~]$

Get the Big SQL JDBC driver from the BigInsights installation and place it in the Fluid Query libraries folder:

[nz@netezza ~]$ scp root@rvm:/usr/ibmpacks/bigsql/4.1/db2/java/db2jcc.jar /nz/export/ae/products/fluidquery/libs/ibm/bigsql/db2jcc.jar
root@rvm's password:
db2jcc.jar                                    100% 3139KB   3.1MB/s   00:00
[nz@netezza ~]$ ll /nz/export/ae/products/fluidquery/libs/ibm/bigsql
total 3140
-r--r--r-- 1 nz nz 3214190 Nov 19 15:02 db2jcc.jar
[nz@netezza ~]$

Create and configure the test database in Netezza

We will now create a database that we will use to test different Fluid Query functionalities. We will name the database BANKDB as we will be using a demo data set from the UCI Machine Learning Repository, containing data on bank marketing campaigns.

Log into the Netezza Emulator as user nz and look at the existing databases using nzsql.

[nz@netezza ~]$ nzsql
Welcome to nzsql, the IBM Netezza SQL interactive terminal.

Type:  \h for help with SQL commands
       \? for help on internal slash commands
       \g or terminate with semicolon to execute query
       \q to quit

SYSTEM.ADMIN(ADMIN)=> \l
  List of databases
 DATABASE |  OWNER
----------+----------
 INZA     | INZAUSER
 NZA      | INZAUSER
 NZM      | INZAUSER
 NZMSG    | INZAUSER
 NZR      | INZAUSER
 NZRC     | INZAUSER
 NZVERIFY | INZAUSER
 SYSTEM   | ADMIN
 TEST     | ADMIN
(9 rows)

SYSTEM.ADMIN(ADMIN)=>

Create the BANKDB database and exit nzsql.

SYSTEM.ADMIN(ADMIN)=> CREATE DATABASE BANKDB;
CREATE DATABASE
SYSTEM.ADMIN(ADMIN)=> \q
[nz@netezza ~]$

Enable the database to use INZA.

[nz@netezza ~]$ /nz/export/ae/utilities/bin/create_inza_db.sh BANKDB
CREATE GROUP
CREATE GROUP
CREATE GROUP
ALTER GROUP
ALTER GROUP
GRANT
GRANT
GRANT
GRANT
GRANT
GRANT
GRANT
GRANT
GRANT
GRANT
 INITIALIZE
------------
 t
(1 row)

                     INITIALIZE
----------------------------------------------------
 The metadata objects are successfully initialized.
(1 row)

nzspatial_esri registered in INZA
Setting up nzspatial_esri
Using INZA database version 3.2.1
                   ST_INITIALIZE
----------------------------------------------------
 The metadata objects are successfully initialized.
(1 row)

GRANT
GRANT
GRANT
[nz@netezza ~]$

Now that we have a database in place it is time to create the Fluid Query configuration. Use the fqConfigure.sh script to set the host, port, user, and service. All of these details are available in the BIGSQL connection settings that you've tested.

[nz@netezza ~]$ /nz/export/ae/products/fluidquery/fqConfigure.sh --host rvm --port 32051 --username bigsql --config default --provider ibm --service BIGSQL
Using user specified non-default BIGSQL port number: 32051
Please enter your client password:
Connection configuration success.
[nz@netezza ~]$

The script creates a properties files, which you can edit if needed.

[nz@netezza ~]$ cat /nz/export/ae/products/fluidquery/default.properties
#Fluid Query connection configuration
#Tue Nov 24 10:35:38 EST 2015
krb5-conf=
client-principal=
password=6,-110,14,-114,-93,-36,-106,-63,82,-125,-65,10,83,-121,89,71
ssl-truststore-password=
connectionXmlPath=
driver-path=/nz/export/ae/products/fluidquery/libs/ibm/bigsql/
varchar-size=1000
username=bigsql
service-principal=
ssl-truststore=
dataTypesXmlPath=
port=32051
ssl=false
database=
keytab=
host=rvm
fqdm-conf=
debug=false
udtfname=
auth-type=USER_PASS
str-to-nvarchar=false
provider=IBM
service=BIGSQL
[nz@netezza ~]$

Let's register the BANKDB database with Fluid Query. Set the NZ_DATABASE environment variable to point to BANKDB and run the fqRegister.sh script.

[nz@netezza ~]$ export NZ_DATABASE=BANKDB
[nz@netezza ~]$ /nz/export/ae/products/fluidquery/fqRegister.sh
Optional parameter --udtf not provided. Using default "FqRead" as name
Functions and credentials are successfully registered in database "BANKDB".
[nz@netezza ~]$

Let's confirm that the FqRead function is now available in BANKDB.

[nz@netezza ~]$ nzsql -d BANKDB
Welcome to nzsql, the IBM Netezza SQL interactive terminal.

Type:  \h for help with SQL commands
       \? for help on internal slash commands
       \g or terminate with semicolon to execute query
       \q to quit

BANKDB.ADMIN(ADMIN)=> show function FqRead;
 SCHEMA |   RESULT   | FUNCTION | BUILTIN |                                            ARGUMENTS
--------+------------+----------+---------+--------------------------------------------------------------------------------------------------
 ADMIN  | TABLE(ANY) | FQREAD   | f       | (CHARACTER VARYING(ANY), CHARACTER VARYING(ANY))
 ADMIN  | TABLE(ANY) | FQREAD   | f       | (CHARACTER VARYING(ANY), CHARACTER VARYING(ANY), CHARACTER VARYING(ANY))
 ADMIN  | TABLE(ANY) | FQREAD   | f       | (CHARACTER VARYING(ANY), CHARACTER VARYING(ANY), CHARACTER VARYING(ANY), CHARACTER VARYING(ANY))
 ADMIN  | TABLE(ANY) | FQREAD   | f       | (CHARACTER VARYING(ANY), CHARACTER VARYING(ANY), CHARACTER VARYING(ANY), INTEGER)
(4 rows)

BANKDB.ADMIN(ADMIN)=>

Loading the test data

Use wget to download the dataset from UCI Machine Learning Repository then unzip the archive.

[nz@netezza ~]$ wget http://archive.ics.uci.edu/ml/machine-learning-databases/00222/bank.zip
…
2015-12-08 03:20:33 (291 KB/s) - “bank.zip” saved [579043/579043]

[nz@netezza ~]$ unzip bank.zip
Archive:  bank.zip
  inflating: bank-full.csv
  inflating: bank-names.txt
  inflating: bank.csv
[nz@netezza ~]$

Take a quick look at what's inside – each row represents an observation, the values are coma separated, and they are also enclosed in quotes. The first row also contains the attribute names.

[nz@netezza ~]$ head bank.csv
"age";"job";"marital";"education";"default";"balance";"housing";"loan";"contact";"day";"month";"duration";"campaign";"pdays";"previous";"poutcome";"y"
30;"unemployed";"married";"primary";"no";1787;"no";"no";"cellular";19;"oct";79;1;-1;0;"unknown";"no"
33;"services";"married";"secondary";"no";4789;"yes";"yes";"cellular";11;"may";220;1;339;4;"failure";"no"
35;"management";"single";"tertiary";"no";1350;"yes";"no";"cellular";16;"apr";185;1;330;1;"failure";"no"
30;"management";"married";"tertiary";"no";1476;"yes";"yes";"unknown";3;"jun";199;4;-1;0;"unknown";"no"
59;"blue-collar";"married";"secondary";"no";0;"yes";"no";"unknown";5;"may";226;1;-1;0;"unknown";"no"
35;"management";"single";"tertiary";"no";747;"no";"no";"cellular";23;"feb";141;2;176;3;"failure";"no"
36;"self-employed";"married";"tertiary";"no";307;"yes";"no";"cellular";14;"may";341;1;330;2;"other";"no"
39;"technician";"married";"secondary";"no";147;"yes";"no";"cellular";6;"may";151;2;-1;0;"unknown";"no"
41;"entrepreneur";"married";"tertiary";"no";221;"yes";"no";"unknown";14;"may";57;2;-1;0;"unknown";"no"
[nz@netezza ~]$

We will use the following DDL to create a table in BANKDB for storing the test data. We'll name the table BANKCAMPAIGN:

CREATE TABLE BANKCAMPAIGN (
  AGE INTEGER,
  JOB VARCHAR(20),
  MARITAL VARCHAR(20),
  EDUCATION VARCHAR(20),
  DEF VARCHAR(20),
  BALANCE INTEGER,
  HOUSING VARCHAR(20),
  LOAN VARCHAR(20),
  CONTACT VARCHAR(20),
  CDAY INTEGER,
  CMONTH VARCHAR(30),
  CDURATION INTEGER,
  CAMPAIGN INTEGER,
  PDAYS INTEGER,
  CPREVIOUS INTEGER, 
  POUTCOME VARCHAR(20),
  YES VARCHAR(3)) 
DISTRIBUTE ON RANDOM;

Go back in nzsql and create run the above statement.

[nz@netezza ~]$ nzsql -d BANKDB
Welcome to nzsql, the IBM Netezza SQL interactive terminal.

Type:  \h for help with SQL commands
       \? for help on internal slash commands
       \g or terminate with semicolon to execute query
       \q to quit

BANKDB.ADMIN(ADMIN)=> CREATE TABLE BANKCAMPAIGN (
BANKDB.ADMIN(ADMIN)(>   AGE INTEGER,
BANKDB.ADMIN(ADMIN)(>   JOB VARCHAR(20),
BANKDB.ADMIN(ADMIN)(>   MARITAL VARCHAR(20),
BANKDB.ADMIN(ADMIN)(>   EDUCATION VARCHAR(20),
BANKDB.ADMIN(ADMIN)(>   DEF VARCHAR(20),
BANKDB.ADMIN(ADMIN)(>   BALANCE INTEGER,
BANKDB.ADMIN(ADMIN)(>   HOUSING VARCHAR(20),
BANKDB.ADMIN(ADMIN)(>   LOAN VARCHAR(20),
BANKDB.ADMIN(ADMIN)(>   CONTACT VARCHAR(20),
BANKDB.ADMIN(ADMIN)(>   CDAY INTEGER,
BANKDB.ADMIN(ADMIN)(>   CMONTH VARCHAR(30),
BANKDB.ADMIN(ADMIN)(>   CDURATION INTEGER,
BANKDB.ADMIN(ADMIN)(>   CAMPAIGN INTEGER,	
BANKDB.ADMIN(ADMIN)(>   PDAYS INTEGER,
BANKDB.ADMIN(ADMIN)(>   CPREVIOUS INTEGER,
BANKDB.ADMIN(ADMIN)(>   POUTCOME VARCHAR(20),
BANKDB.ADMIN(ADMIN)(>   YES VARCHAR(3))
BANKDB.ADMIN(ADMIN)-> DISTRIBUTE ON RANDOM;
CREATE TABLE
BANKDB.ADMIN(ADMIN)=>

Let's verify the table has been created.

BANKDB.ADMIN(ADMIN)=> \d
                  List of relations
 Schema |         Name          |   Type   |  Owner
--------+-----------------------+----------+----------
 ADMIN  | BANKCAMPAIGN          | TABLE    | ADMIN
 ADMIN  | GEOMETRY_COLUMNS      | TABLE    | INZAUSER
 ADMIN  | NZA_META_COLPROPS     | TABLE    | INZAUSER
 ADMIN  | NZA_META_COMPONENTS   | TABLE    | INZAUSER
 ADMIN  | NZA_META_IDSEQUENCE   | SEQUENCE | INZAUSER
 ADMIN  | NZA_META_LOCKVIEW     | VIEW     | INZAUSER
 ADMIN  | NZA_META_MODELS       | TABLE    | INZAUSER
 ADMIN  | NZA_META_PARAMS       | TABLE    | INZAUSER
 ADMIN  | NZ_MAT_BRDCST_COMMAND | TABLE    | INZAUSER
 ADMIN  | NZ_MAT_METADATA       | TABLE    | INZAUSER
 ADMIN  | SPATIAL_REF_SYS       | TABLE    | INZAUSER
 ADMIN  | V_NZA_COLPROPS        | VIEW     | INZAUSER
 ADMIN  | V_NZA_COMPONENTS      | VIEW     | INZAUSER
 ADMIN  | V_NZA_MODELS          | VIEW     | INZAUSER
 ADMIN  | V_NZA_PARAMS          | VIEW     | INZAUSER
(15 rows)

BANKDB.ADMIN(ADMIN)=>

Quit nzsql and use nzload to parse and ingest the bank.csv file. Our target table is BANKCAMPAIGN in the BANKDB database. We will skip the header, set the delimiter, and enable the quotedValue flag as the CSV values are surrounded by quotes.

[nz@netezza ~]$ nzload -skipRows 1 -maxErrors 10 -t BANKCAMPAIGN -df bank.csv -db BANKDB -format text -delim ";" -quotedValue DOUBLE
Load session of table 'BANKCAMPAIGN' completed successfully
[nz@netezza ~]$

Go back in nzsql and confirm the number of rows in BANKCAMPAIGN.

[nz@netezza ~]$ nzsql -d BANKDB
Welcome to nzsql, the IBM Netezza SQL interactive terminal.

Type:  \h for help with SQL commands
       \? for help on internal slash commands
       \g or terminate with semicolon to execute query
       \q to quit

BANKDB.ADMIN(ADMIN)=> SELECT COUNT(*) FROM BANKCAMPAIGN;
 COUNT
-------
  4521
(1 row)

BANKDB.ADMIN(ADMIN)=>

Installing Fluid Query in BigInsights

Installing Fluid Query in the BigInisghts VM is a fairly straightforward procedure – we just have to copy the installation across, unzip, and start the fluidquerry_install.sh script.

Log into the BigInsights VM as root and create a target directory for Fluid Query.

[root@rvm ~]# mkdir /fluidqueryLocal
[root@rvm ~]# chown biadmin:biadmin /fluidqueryLocal
[root@rvm ~]#

Put the nz-fluidquery-v1.6.tar.gz file that you've downloaded from Fix Central in the BigInsights VM (I am using the /root/fq directory), then extract and install it.

[root@rvm fq]# ls -la
total 94260
drwxr-xr-x.  2 root root     4096 Nov 27 00:05 .
dr-xr-x---. 30 root root     4096 Nov 27 00:05 ..
-rw-r--r--.  1 root root 96513570 Nov 27 00:05 nz-fluidquery-v1.6.tar.gz
[root@rvm fq]# gunzip nz-fluidquery-v1.6.tar.gz
[root@rvm fq]# tar -xf nz-fluidquery-v1.6.tar
[root@rvm fq]# tar -xf fluid-query-sql-v1.6.0.0.tar
[root@rvm fq]# tar -xf fluid-query-import-export-v1.6.0.0.tar
[root@rvm fq]#
[root@rvm fq]# su - hdfs
[hdfs@rvm ~]$ hadoop fs -mkdir /fluidquery
[hdfs@rvm ~]$ hadoop fs -chown bigsql:hadoop /fluidquery
[hdfs@rvm ~]$
[root@rvm fq]#
[root@rvm fq]# ./fluidquery_install.sh --datamove_path /fluidqueryLocal --hdfs_user bigsql
-------------------------------------------------------------------------------
IBM Fluid Query Installer
(C) Copyright IBM Corp. 2015  All rights reserved.
Version 1.6.0.0 [Build 151110-258]
-------------------------------------------------------------------------------
Installing data movement...
Data movement path: /fluidqueryLocal
Checking for previous installation of IBM Fluid Query data movement...
Previous installation of data movement does not exist in /fluidqueryLocal...
Creating HDFS directory /fluidquery...
Copying files to /fluidqueryLocal...
Copying license to /fluidqueryLocal...
Copying uninstaller to /fluidqueryLocal...
Copying jars to HDFS...
BigInsight exists. Copying jars to BigSQL nodes...
install IFQ in BigSQL v3
 -> install IFQ on node: rvm.svl.ibm.com
 -> install IFQ on node: rvm.svl.ibm.com
IBM Fluid Query data movement installed succesfully in /fluidqueryLocal!
For details see the log file at /fluidqueryLocal/var/log/fluidquery_install/fluidquery_install.2015-11-27.02:45:09.log
[root@rvm fq]#

Setting up and running an import job

We will now look at how we can use Fluid Query to move data from Netezza to Hadoop.

Say that you want to offload some old data from your Netezza appliance. For example, let's take all records from BANKCAMPAIGN for clients contacted in January. Let's log into the Emulator and see how many records fits this condition.

[nz@netezza ~]$ nzsql -d BANKDB
Welcome to nzsql, the IBM Netezza SQL interactive terminal.

Type:  \h for help with SQL commands
       \? for help on internal slash commands
       \g or terminate with semicolon to execute query
       \q to quit


BANKDB.ADMIN(ADMIN)=> SELECT COUNT(*) FROM BANKCAMPAIGN WHERE CMONTH = 'jan';
 COUNT
-------
   148
(1 row)

BANKDB.ADMIN(ADMIN)=>

Let's go back to the BigInsights VM and create an import configuration. Make a backup for the existing fq-import-conf.xml.

[root@rvm fq]# cp /fluidqueryLocal/fq-import-conf.xml /fluidqueryLocal/fq-import-conf.xml.bak
[root@rvm fq]#

Then change the values of the following attributes inside fq-import-conf.xml.

Attribute Value Description
nz.fq.hive.schema bigsql target schema name, under which all imported tables will be created. If id doesn't exist it will be automatically created
nz.fq.nps.db BANKDB The Netezza database name; include double quotations around delimited database names
nz.fq.tables BANKCAMPAIGN Comma-separated list of Netezza tables; include double quotations around delimited table names
nz.fq.nps.server netezza The wall IP address or fully qualified hostname of the Netezza server
nz.fq.nps.password password The password for the Netezza database user account
nz.fq.nps.where CMONTH = 'jan' The SQL WHERE clause to use for selecting the data to transfer

Next, prepare a target HDFS directory for storing the imported data.

[root@rvm fq]# su - hdfs
[hdfs@rvm ~]$ hadoop fs -mkdir /nzbackup
[hdfs@rvm ~]$ hadoop fs -mkdir /nzbackup/backup1
[hdfs@rvm ~]$ hadoop fs -chown bigsql:hadoop /nzbackup/backup1
[hdfs@rvm ~]$

You can now create an import script that will set the required environment variables and run the Fluid Query import using the import configuration file. Put the script in /fluidqueryLocal and name it simply import.sh (or something more meaningful). This is what my import.sh looks like.

#/bin/bash
export HADOOP_CLASSPATH="/usr/iop/4.1.0.0/hive/lib/*:/usr/iop/4.1.0.0/hive/conf/:/fluidqueryLocal/nzjdbc3.jar"
hadoop jar /fluidqueryLocal/nzcodec.jar -conf fq-import-conf.xml -libjars /fluidqueryLocal/nzcodec.jar,/fluidqueryLocal/nzjdbc3.jar

Make sure the script is executable (chmod +x) and run it as user bigsql.

[root@rvm ~]# su - bigsql
[bigsql@rvm ~]$ cd /fluidqueryLocal/
[bigsql@rvm fluidqueryLocal]$ ./import.sh
WARNING: Use "yarn jar" to launch YARN applications.
FluidQuery version: 1.6.0.0 [Build 151110-258]
2015-11-27 03:24:35,869 0    [main] INFO  com.ibm.nz.fq.common.LogConfigurator  - Missing log4j.configuration file - using default configuration
…
 15/11/27 03:24:38 INFO parse.ParseDriver: Parsing command: DESCRIBE `BIGSQL.bankcampaign`
15/11/27 03:24:38 INFO parse.ParseDriver: Parse Completed
15/11/27 03:24:38 INFO log.PerfLogger: 
15/11/27 03:24:38 INFO log.PerfLogger: 
15/11/27 03:24:38 ERROR metadata.Hive: Table BIGSQL not found: default.BIGSQL table not found
15/11/27 03:24:38 ERROR metadata.Hive: Table BIGSQL not found: default.BIGSQL table not found
FAILED: SemanticException [Error 10001]: Table not found BIGSQL.bankcampaign
15/11/27 03:24:38 ERROR ql.Driver: FAILED: SemanticException [Error 10001]: Table not found BIGSQL.bankcampaign
org.apache.hadoop.hive.ql.parse.SemanticException: Table not found BIGSQL.bankcampaign
…
2015-11-27 03:24:38,859 2990 [main] TRACE com.ibm.nz.fq.transfer.plain.PlainDataTransferService  - Archive: ArchiveTable [name=bankcampaign, whereClause=CMONTH = 'jan', tableName=BANKCAMPAIGN, schemaName=, status=null, sourceDbDdl=*AGE*…, hiveDdl=CREATE EXTERNAL TABLE `BIGSQL.bankcampaign` ( `AGE` INT,`JOB` STRING,`MARITAL` STRING,`EDUCATION` STRING,`DEF` STRING,`BALANCE` INT,`HOUSING` STRING,`LOAN` STRING,`CONTACT` STRING,`CDAY` INT,`CMONTH` STRING,`CDURATION` INT,`CAMPAIGN` INT,`PDAYS` INT,`CPREVIOUS` INT,`POUTCOME` STRING,`YES` STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\174'  ESCAPED BY '\\' STORED AS TEXTFILE LOCATION '/nzbackup/backup1/bankcampaign/table' TBLPROPERTIES("serialization.null.format"="NULL"), path=/nzbackup/backup1/bankcampaign, archive=ArchiveData [hiveSchema=BIGSQL, hiveTable=bankcampaign, where=CMONTH = 'jan', npsTable=BANKCAMPAIGN, nd=NpsData [npsServer=netezza, npsPort=5480, npsDatabase=BANKDB, npsUser=admin, npsSsl=false, npsSslCaCertificate=null], fieldTerminatedBy=|, nullValue=NULL]]
…
2015-11-27 03:24:39,009 3140 [main] INFO  com.ibm.nz.fq.transfer.plain.PlainDataTransferService  - Running new map-reduce task to import the data
15/11/27 03:24:39 INFO impl.TimelineClientImpl: Timeline service address: http://rvm.svl.ibm.com:8188/ws/v1/timeline/
15/11/27 03:24:39 INFO client.RMProxy: Connecting to ResourceManager at rvm.svl.ibm.com/192.168.121.131:8050
…
2015-11-27 03:24:55,316 19447 [main] DEBUG com.ibm.nz.fq.ArchiveManager  - Archive operation performed successfully
 Import finished successfully for table: BANKCAMPAIGN
Transfered 11.813 KB in 18.0s. (2.307 MB/h)
2015-11-27 03:24:55,317 19448 [main] INFO  com.ibm.nz.fq.NzTransfer  - Import finished. Transferred 1 out of 1
2015-11-27 03:24:55,317 19448 [main] INFO  com.ibm.nz.fq.NzTransfer  - Tables transferred: BANKCAMPAIGN
2015-11-27 03:24:55,317 19448 [main] DEBUG com.ibm.nz.fq.NzTransfer  - PERF: Fdm done | Total execution time: 18944 ms
Done
2015-11-27 03:24:55,317 19448 [main] INFO  com.ibm.nz.fq.NzTransfer  - Exiting with code: 0
[bigsql@rvm fluidqueryLocal]$

Note that Fluid Query detects that the target database and table do not exist and creates them on the fly.

Let's verify that the data is now accessible in Hive.

[root@rvm ~]# su - hive
[hive@rvm ~]$ hive
SLF4J: Class path contains multiple SLF4J bindings.
…
Logging initialized using configuration in file:/etc/hive/4.1.0.0/0/hive-log4j.properties
hive> describe bigsql.bankcampaign;
OK
age                     int
job                     string
marital                 string
education               string
def                     string
balance                 int
housing                 string
loan                    string
contact                 string
cday                    int
cmonth                  string
cduration               int
campaign                int
pdays                   int
cprevious               int
poutcome                string
yes                     string
Time taken: 1.392 seconds, Fetched: 17 row(s)
hive> select count(*) from bigsql.bankcampaign;
Query ID = hive_20151127032543_5385156c-2ce6-4977-ba87-579a846d8822
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks determined at compile time: 1
…
MapReduce Jobs Launched:
Stage-Stage-1: Map: 1  Reduce: 1   Cumulative CPU: 2.42 sec   HDFS Read: 20584 HDFS Write: 4 SUCCESS
Total MapReduce CPU Time Spent: 2 seconds 420 msec
OK
148
Time taken: 17.252 seconds, Fetched: 1 row(s)
hive>

We will now have to synchronize Big SQL and the HCatalog in order to be able to access the table from Big SQL. Log into Big SQL using jsqsh and check what tables are available to user bigsql.

[bigsql@rvm ~]$ jsqsh bigsql
Password:
Welcome to JSqsh 4.3
Type "\help" for help topics. Using JLine.
[rvm.svl.ibm.com][bigsql] 1>
[rvm.svl.ibm.com][bigsql] 1> \show tables;
+-----------+-------------+------------+------------+---------+----------+------------+-----------+-------------------+----------------+
| TABLE_CAT | TABLE_SCHEM | TABLE_NAME | TABLE_TYPE | REMARKS | TYPE_CAT | TYPE_SCHEM | TYPE_NAME | SELF_REF_COL_NAME | REF_GENERATION |
+-----------+-------------+------------+------------+---------+----------+------------+-----------+-------------------+----------------+
+-----------+-------------+------------+------------+---------+----------+------------+-----------+-------------------+----------------+
[rvm.svl.ibm.com][bigsql] 1>

Let's call the HCAT_SYNC_OBJECTS procedure, which imports the definition of the Hive objects into the local database catalogues, and try again.

[rvm.svl.ibm.com][bigsql] 1> CALL SYSHADOOP.HCAT_SYNC_OBJECTS('bigsql', '.*', 'a', 'REPLACE', 'CONTINUE');
+-----------+--------------+-----------+------+--------+---------+
| OBJSCHEMA | OBJNAME      | OBJATTRIB | TYPE | STATUS | DETAILS |
+-----------+--------------+-----------+------+--------+---------+
| BIGSQL    | BANKCAMPAIGN | [NULL]    | T    | OK     | [NULL]  |
+-----------+--------------+-----------+------+--------+---------+
1 row in results(first row: 5.12s; total: 5.12s)
[rvm.svl.ibm.com][bigsql] 1> \show tables;
+-----------+-------------+--------------+------------+---------+----------+------------+-----------+-------------------+----------------+
| TABLE_CAT | TABLE_SCHEM | TABLE_NAME   | TABLE_TYPE | REMARKS | TYPE_CAT | TYPE_SCHEM | TYPE_NAME | SELF_REF_COL_NAME | REF_GENERATION |
+-----------+-------------+--------------+------------+---------+----------+------------+-----------+-------------------+----------------+
| [NULL]    | BIGSQL      | BANKCAMPAIGN | TABLE      | [NULL]  | [NULL]   | [NULL]     | [NULL]    | [NULL]            | [NULL]         |
+-----------+-------------+--------------+------------+---------+----------+------------+-----------+-------------------+----------------+
 [rvm.svl.ibm.com][bigsql] 1> select count(*) from bankcampaign;
+-----+
|   1 |
+-----+
| 148 |
+-----+
1 row in results(first row: 3.6s; total: 3.6s)
[rvm.svl.ibm.com][bigsql] 1> select * from bankcampaign limit 10;
+-----+---------------+----------+-----------+-----+---------+---------+------+----------+------+--------+-----------+----------+-------+-----------+----------+-----+
| AGE | JOB           | MARITAL  | EDUCATION | DEF | BALANCE | HOUSING | LOAN | CONTACT  | CDAY | CMONTH | CDURATION | CAMPAIGN | PDAYS | CPREVIOUS | POUTCOME | YES |
+-----+---------------+----------+-----------+-----+---------+---------+------+----------+------+--------+-----------+----------+-------+-----------+----------+-----+
|  33 | blue-collar   | single   | secondary | no  |      22 | no      | no   | cellular |   30 | jan    |        76 |        2 |   207 |         1 | failure  | no  |
|  46 | admin.        | married  | secondary | no  |     179 | yes     | no   | cellular |   29 | jan    |       194 |        4 |   436 |         1 | other    | no  |
|  55 | technician    | married  | secondary | no  |     273 | yes     | no   | cellular |   29 | jan    |        84 |        3 |   183 |         3 | failure  | no  |
|  45 | blue-collar   | divorced | secondary | no  |    1269 | yes     | no   | cellular |   29 | jan    |       549 |        2 |    -1 |         0 | unknown  | no  |
|  30 | student       | single   | secondary | no  |    3096 | no      | no   | cellular |   26 | jan    |       123 |        1 |    -1 |         0 | unknown  | no  |
|  36 | management    | married  | tertiary  | no  |       0 | yes     | no   | cellular |   28 | jan    |        60 |        1 |   250 |         1 | other    | no  |
|  45 | blue-collar   | single   | primary   | no  |     999 | no      | no   | cellular |   29 | jan    |       182 |        2 |    -1 |         0 | unknown  | no  |
|  26 | self-employed | single   | tertiary  | no  |     211 | no      | no   | cellular |   29 | jan    |       168 |        3 |    -1 |         0 | unknown  | no  |
|  59 | unemployed    | married  | primary   | no  |       0 | no      | no   | cellular |   30 | jan    |      3025 |        2 |    -1 |         0 | unknown  | no  |
|  28 | entrepreneur  | married  | unknown   | no  |    1034 | no      | no   | cellular |   29 | jan    |       113 |        1 |    -1 |         0 | unknown  | no  |
+-----+---------------+----------+-----------+-----+---------+---------+------+----------+------+--------+-----------+----------+-------+-----------+----------+-----+
10 rows in results(first row: 0.16s; total: 0.17s)
[rvm.svl.ibm.com][bigsql] 1>

We can confirm that the number of records in BigInsights matches the number of records for January in Netezza (148).

Running federated queries

First, let's see if we can reach the data in BigInsights from Netezza.
Connect using nzsql and use the fqread function to pull the first ten rows via Big SQL.

[nz@netezza ~]$ nzsql -d BANKDB
Welcome to nzsql, the IBM Netezza SQL interactive terminal.

Type:  \h for help with SQL commands
       \? for help on internal slash commands
       \g or terminate with semicolon to execute query
       \q to quit

BANKDB.ADMIN(ADMIN)=> SELECT * FROM TABLE WITH FINAL (fqread('BIGSQL','BANKCAMPAIGN')) LIMIT 10;
 AGE |      JOB      | MARITAL  | EDUCATION | DEF | BALANCE | HOUSING | LOAN | CONTACT  | CDAY | CMONTH | CDURATION | CAMPAIGN | PDAYS | CPREVIOUS | POUTCOME | YES
-----+---------------+----------+-----------+-----+---------+---------+------+----------+------+--------+-----------+----------+-------+-----------+----------+-----
  33 | blue-collar   | single   | secondary | no  |      22 | no      | no   | cellular |   30 | jan    |        76 |        2 |   207 |         1 | failure  | no
  46 | admin.        | married  | secondary | no  |     179 | yes     | no   | cellular |   29 | jan    |       194 |        4 |   436 |         1 | other    | no
  55 | technician    | married  | secondary | no  |     273 | yes     | no   | cellular |   29 | jan    |        84 |        3 |   183 |         3 | failure  | no
  45 | blue-collar   | divorced | secondary | no  |    1269 | yes     | no   | cellular |   29 | jan    |       549 |        2 |    -1 |         0 | unknown  | no
  30 | student       | single   | secondary | no  |    3096 | no      | no   | cellular |   26 | jan    |       123 |        1 |    -1 |         0 | unknown  | no
  36 | management    | married  | tertiary  | no  |       0 | yes     | no   | cellular |   28 | jan    |        60 |        1 |   250 |         1 | other    | no
  45 | blue-collar   | single   | primary   | no  |     999 | no      | no   | cellular |   29 | jan    |       182 |        2 |    -1 |         0 | unknown  | no
  26 | self-employed | single   | tertiary  | no  |     211 | no      | no   | cellular |   29 | jan    |       168 |        3 |    -1 |         0 | unknown  | no
  59 | unemployed    | married  | primary   | no  |       0 | no      | no   | cellular |   30 | jan    |      3025 |        2 |    -1 |         0 | unknown  | no
  28 | entrepreneur  | married  | unknown   | no  |    1034 | no      | no   | cellular |   29 | jan    |       113 |        1 |    -1 |         0 | unknown  | no
(10 rows)

BANKDB.ADMIN(ADMIN)=>

So far so good. Now let's create a view using the fqread function so we can access the data sitting in Hadoop in an easier way.

Confirm that we can pull all the records (148) from BigInsights.

BANKDB.ADMIN(ADMIN)=> CREATE OR REPLACE VIEW BANKCAMPAIGN_JAN AS SELECT * FROM TABLE WITH FINAL (fqread('BIGSQL', 'BANKCAMPAIGN'));
CREATE VIEW
BANKDB.ADMIN(ADMIN)=> SELECT COUNT(*) FROM BANKCAMPAIGN_JAN;
 COUNT
-------
   148
(1 row)

BANKDB.ADMIN(ADMIN)=>

We can now delete the January records from the Netezza database.

BANKDB.ADMIN(ADMIN)=> DELETE FROM BANKCAMPAIGN WHERE CMONTH='jan';
DELETE 148
BANKDB.ADMIN(ADMIN)=>

We now have the old records (January) sitting in Hadoop and all other records sitting in Netezza. We can run a query with the SQL UNION operator that can return all data in a federated result set.

BANKDB.ADMIN(ADMIN)=> SELECT COUNT(*) FROM (SELECT * FROM BANKCAMPAIGN UNION ALL SELECT * FROM BANKCAMPAIGN_JAN) CAMPAIGN;
 COUNT
-------
  4521
(1 row)

BANKDB.ADMIN(ADMIN)=>

We can also pull individual records or just the first 10 rows from the result set.

BANKDB.ADMIN(ADMIN)=> SELECT * FROM (SELECT * FROM BANKCAMPAIGN UNION ALL SELECT * FROM BANKCAMPAIGN_JAN) CAMPAIGN LIMIT 10;
 AGE |      JOB      | MARITAL  | EDUCATION | DEF | BALANCE | HOUSING | LOAN | CONTACT  | CDAY | CMONTH | CDURATION | CAMPAIGN | PDAYS | CPREVIOUS | POUTCOME | YES
-----+---------------+----------+-----------+-----+---------+---------+------+----------+------+--------+-----------+----------+-------+-----------+----------+-----
  30 | student       | single   | secondary | no  |    3096 | no      | no   | cellular |   26 | jan    |       123 |        1 |    -1 |         0 | unknown  | no
  59 | unemployed    | married  | primary   | no  |       0 | no      | no   | cellular |   30 | jan    |      3025 |        2 |    -1 |         0 | unknown  | no
  45 | blue-collar   | divorced | secondary | no  |    1269 | yes     | no   | cellular |   29 | jan    |       549 |        2 |    -1 |         0 | unknown  | no
  26 | self-employed | single   | tertiary  | no  |     211 | no      | no   | cellular |   29 | jan    |       168 |        3 |    -1 |         0 | unknown  | no
  28 | entrepreneur  | married  | unknown   | no  |    1034 | no      | no   | cellular |   29 | jan    |       113 |        1 |    -1 |         0 | unknown  | no
  35 | management    | single   | tertiary  | no  |    1350 | yes     | no   | cellular |   16 | apr    |       185 |        1 |   330 |         1 | failure  | no
  25 | blue-collar   | single   | primary   | no  |    -221 | yes     | no   | unknown  |   23 | may    |       250 |        1 |    -1 |         0 | unknown  | no
  41 | management    | married  | tertiary  | no  |    5883 | no      | no   | cellular |   20 | nov    |       182 |        2 |    -1 |         0 | unknown  | no
  35 | management    | single   | tertiary  | no  |     747 | no      | no   | cellular |   23 | feb    |       141 |        2 |   176 |         3 | failure  | no
  33 | blue-collar   | single   | secondary | no  |      22 | no      | no   | cellular |   30 | jan    |        76 |        2 |   207 |         1 | failure  | no
(10 rows)

BANKDB.ADMIN(ADMIN)=>

Note that these are not the same records compared to when we only query the Netezza table.

This concludes our tutorial on using Fluid Query to build an online archive for Netezza in Hadoop.