Importing data from Oracle RDBMS into Hadoop using Apache Sqoop 2

Importing data from Oracle RDBMS into Hadoop using Apache Sqoop 2

Apache Sqoop is a bulk data transferring tool that can link traditional relational databases (like Oracle Database) and Apache Hadoop (HDFS, Hive, HBase). Sqoop can easily transfer data in both directions and it also integrates with Oozie, allowing us to schedule automated import/export operations.

There are major differences between Sqoop and Sqoop2. This tutorial focuses on the latter, as it is clearly the future of the Apache Sqoop project. There are certain feature differences but the important thing to remember is that Sqoop runs on the client side while Sqoop2's architecture is server based.

Sqoop vs Sqoop 2 Architecture

In this tutorial we will use Sqoop 2 to load a table residing in an Oracle Database schema into HDFS. To follow the tutorial you will need to have the following prerequisites in place:

  • An Oracle Database 11gR2 installation (the one I am using is with the sample schemas installed. Here is a nice 11gR2 installation tutorial you can follow, just don't forget to include the sample schemas during the Configuration Options step.
  • An Apache Hadoop installation (I am using version 2.7.0). You can follow this installation guide if you are happy with the 32-bit Hadoop library or you can build 2.7.0 from source to enable 64-bit support.

Prerequisite check
Make sure your database is up and running, set a password and unlock the HR account. Make sure you can connect as user HR and there is data in the EMPLOYEES table – this is the table we will be loading into HDFS.

[oracle@oradb ~]$ sqlplus / as sysdba

SQL*Plus: Release Production on Mon Jul 3 06:30:10 2015

Copyright (c) 1982, 2013, Oracle.  All rights reserved.

Connected to:
Oracle Database 11g Enterprise Edition Release - 64bit Production
With the Partitioning, OLAP, Data Mining and Real Application Testing options

SQL> alter user hr identified by hr account unlock;

User altered.

SQL> conn hr/hr;
SQL> select count(*) from employees;



Confirm that your Hadoop/Sqoop machine can reach the database server over the network. The host running my Oracle Database is called oradb.

[haduser@hadoop ~]$ ping oradb
PING oradb ( 56(84) bytes of data.
64 bytes from oradb ( icmp_seq=1 ttl=64 time=0.584 ms
64 bytes from oradb ( icmp_seq=2 ttl=64 time=0.358 ms
64 bytes from oradb ( icmp_seq=3 ttl=64 time=0.316 ms
--- oradb ping statistics ---
3 packets transmitted, 3 received, 0% packet loss, time 2003ms
rtt min/avg/max/mdev = 0.316/0.419/0.584/0.118 ms
[haduser@hadoop ~]$

Make sure HDFS, YARN, and History Server are up or start them if they aren't. My Hadoop distribution is located in /opt/hadoop-2.7.0 and I'll be referring to this path as my Hadoop home during the course of this tutorial.

It is also important to change your core-site.xml (/opt/hadoop-2.7.0/etc/hadoop/core-site.xml) and add the following property in the <configuration></configuration> section.


This will help you avoid hitting unpleasant Connection reset errors due to an issue with the Oracle JDBC driver (you can check the details at StackOverflow if you are interested). Start your HDFS services or if they are already up, bounce them after the core-site.xml change.

[haduser@hadoop ~]$
Starting namenodes on [localhost]
localhost: starting namenode, logging to /opt/hadoop-2.7.0/logs/hadoop-haduser-namenode-hadoop.out
localhost: starting datanode, logging to /opt/hadoop-2.7.0/logs/hadoop-haduser-datanode-hadoop.out
Starting secondary namenodes [] starting secondarynamenode, logging to /opt/hadoop-2.7.0/logs/hadoop-haduser-secondarynamenode-hadoop.out
[haduser@hadoop ~]$
starting yarn daemons
starting resourcemanager, logging to /opt/hadoop-2.7.0/logs/yarn-haduser-resourcemanager-hadoop.out
localhost: starting nodemanager, logging to /opt/hadoop-2.7.0/logs/yarn-haduser-nodemanager-hadoop.out
[haduser@hadoop ~]$ start historyserver
starting historyserver, logging to /opt/hadoop-2.7.0/logs/mapred-haduser-historyserver-hadoop.out
[haduser@hadoop ~]$

Prepare a directory to store the imported HR data. Note that my Hadoop owner is called haduser so I am using /user/haduser as my base directory.

[haduser@hadoop ~]$ hadoop fs -ls /user/haduser
[haduser@hadoop ~]$ hadoop fs -mkdir /user/haduser/hr
[haduser@hadoop ~]$ hadoop fs -ls
Found 1 items
drwxr-xr-x   - haduser supergroup          0 2015-08-04 06:30 hr
[haduser@hadoop ~]$

Sqoop installation
Start by downloading, extracting, and putting Sqoop in /usr/lib/sqoop.

[haduser@hadoop ~]$ wget
--2015-08-04 06:32:30--
Resolving (
Connecting to (||:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 68177818 (65M) [application/x-gzip]
Saving to: ‘sqoop-1.99.6-bin-hadoop200.tar.gz’

100%[=====================================================================================================================================================================>] 68,177,818   644KB/s   in 93s

2015-08-04 06:34:03 (716 KB/s) - ‘sqoop-1.99.6-bin-hadoop200.tar.gz’ saved [68177818/68177818]

[haduser@hadoop ~]$ tar -xf sqoop-1.99.6-bin-hadoop200.tar.gz
[haduser@hadoop ~]$ sudo mv sqoop-1.99.6-bin-hadoop200 /usr/lib/sqoop
[haduser@hadoop ~]$

Set the SQOOP_HOME environment variable and add $SQOOP_HOME/bin to your path.

[haduser@hadoop ~]$ cat << EOF >> ~/.bash_profile
> # Sqoop
> export SQOOP_HOME=/usr/lib/sqoop
> export PATH=$PATH:$SQOOP_HOME/bin
[haduser@hadoop ~]$ source ~/.bash_profile
[haduser@hadoop ~]$

Edit the $SQOOP_HOME/server/conf/ file and change the value of the common.loader property to include the Hadoop JAR files from your Hadoop deployment.


Modify $SQOOP_HOME/server/conf/ and set the correct Hadoop configuration directory:

# Hadoop configuration directory

Find the line


and replace the @BASEDIR@ placeholder with the correct Sqoop base directory:


It is also a good idea to replace every occurrence of @LOGDIR@ with a proper location for system generated log files.

[haduser@hadoop ~]$ sed -i -- 's/\@LOGDIR\@/\/usr\/lib\/sqoop\/server\/logs/g' $SQOOP_HOME/server/conf/
[haduser@hadoop ~]$

Verify your Sqoop installation by running sqoop2-tool verify.

[haduser@hadoop ~]$ sqoop2-tool verify
Sqoop home directory: /usr/lib/sqoop
Setting SQOOP_HTTP_PORT:     12000
Setting SQOOP_ADMIN_PORT:     12001
Adding to CATALINA_OPTS:    -Dsqoop.http.port=12000 -Dsqoop.admin.port=12001
Sqoop tool executor:
        Version: 1.99.6
        Revision: 07244c3915975f26f03d9e1edf09ab7d06619bb8
        Compiled on Wed Apr 29 10:40:43 CST 2015 by root
Running tool: class
log4j: Finished configuring.
Exception in thread "PurgeThread" org.apache.sqoop.common.SqoopException: JDBCREPO_0009:Failed to finalize transaction
        at org.apache.sqoop.repository.JdbcRepositoryTransaction.close(
        at org.apache.sqoop.repository.JdbcRepository.doWithConnection(
        at org.apache.sqoop.repository.JdbcRepository.doWithConnection(
        at org.apache.sqoop.repository.JdbcRepository.purgeSubmissions(
        at org.apache.sqoop.driver.JobManager$
Caused by: java.sql.SQLNonTransientConnectionException: No current connection.
        at org.apache.derby.impl.jdbc.SQLExceptionFactory40.getSQLException(Unknown Source)
        at org.apache.derby.impl.jdbc.Util.newEmbedSQLException(Unknown Source)
        at org.apache.derby.impl.jdbc.Util.newEmbedSQLException(Unknown Source)
        at org.apache.derby.impl.jdbc.Util.noCurrentConnection(Unknown Source)
        at org.apache.derby.impl.jdbc.EmbedConnection.checkIfClosed(Unknown Source)
        at org.apache.derby.impl.jdbc.EmbedConnection.setupContextStack(Unknown Source)
        at org.apache.derby.impl.jdbc.EmbedConnection.commit(Unknown Source)
        at org.apache.commons.dbcp.DelegatingConnection.commit(
        at org.apache.commons.dbcp.DelegatingConnection.commit(
        at org.apache.commons.dbcp.PoolingDataSource$PoolGuardConnectionWrapper.commit(
        at org.apache.sqoop.repository.JdbcRepositoryTransaction.close(
        ... 4 more
Caused by: java.sql.SQLException: No current connection.
        at org.apache.derby.impl.jdbc.SQLExceptionFactory.getSQLException(Unknown Source)
        at org.apache.derby.impl.jdbc.SQLExceptionFactory40.wrapArgsForTransportAcrossDRDA(Unknown Source)
        ... 15 more
Verification was successful.
Tool class has finished correctly.
[haduser@hadoop ~]$

Pay no attention to the JDBCREPO_0009 exception. If you get the “Verification was successful” message at the end your installation should be fine.

Oracle JDBC drivers
Time to get the Oracle Database JDBC drivers.

Go to, accept the OTN License Agreement and download the ojdbc6.jar file. You might have to create an account to access OTN, but it is free and only requires a valid e-mail address.

Once you've got the driver simply put it in /usr/lib/sqoop/lib.

[haduser@hadoop ~]$ mkdir -p /usr/lib/sqoop/lib
[haduser@hadoop ~]$ mv ojdbc6.jar /usr/lib/sqoop/lib/
[haduser@hadoop ~]$

Setting up a link to the Oracle schema
Start Sqoop and connect to to it via sqoop2-shell. You'll have to point sqoop2-shell to your local Sqoop instance using the set server command.

[haduser@hadoop ~]$ server start
Sqoop home directory: /usr/lib/sqoop
Setting SQOOP_HTTP_PORT:     12000
Setting SQOOP_ADMIN_PORT:     12001
Adding to CATALINA_OPTS:    -Dsqoop.http.port=12000 -Dsqoop.admin.port=12001
Using CATALINA_BASE:   /usr/lib/sqoop/server
Using CATALINA_HOME:   /usr/lib/sqoop/server
Using CATALINA_TMPDIR: /usr/lib/sqoop/server/temp
Using JRE_HOME:        /opt/jdk1.8.0_45
Using CLASSPATH:       /usr/lib/sqoop/server/bin/bootstrap.jar

[haduser@hadoop ~]$ sqoop2-shell
Sqoop home directory: /usr/lib/sqoop
Sqoop Shell: Type 'help' or '\h' for help.

sqoop:000> set server --host localhost --port 12000 --webapp sqoop
Server is set successfully

We can run a show version command to confirm that we are connected to the server and it is responding to requests.

sqoop:000> show version --all
client version:
  Sqoop 1.99.6 source revision 07244c3915975f26f03d9e1edf09ab7d06619bb8
  Compiled by root on Wed Apr 29 10:40:43 CST 2015
server version:
  Sqoop 1.99.6 source revision 07244c3915975f26f03d9e1edf09ab7d06619bb8
  Compiled by root on Wed Apr 29 10:40:43 CST 2015
API versions:

Let's look at the available connectors.

sqoop:000> show connector
| Id |          Name          | Version |                        Class                         | Supported Directions |
| 1  | generic-jdbc-connector | 1.99.6  | org.apache.sqoop.connector.jdbc.GenericJdbcConnector | FROM/TO              |
| 2  | kite-connector         | 1.99.6  | org.apache.sqoop.connector.kite.KiteConnector        | FROM/TO              |
| 3  | hdfs-connector         | 1.99.6  | org.apache.sqoop.connector.hdfs.HdfsConnector        | FROM/TO              |
| 4  | kafka-connector        | 1.99.6  | org.apache.sqoop.connector.kafka.KafkaConnector      | TO                   |

We will use the generic JDBC connector (Id 1) to create a link with the Oracle JDBC Driver. Replace the values with the listener address and database SID of your Oracle instance.

sqoop:000> create link -c 1
Creating link for connector with id 1
Please fill following values to create new link object
Name: Oracle HR

Link configuration

JDBC Driver Class: oracle.jdbc.OracleDriver
JDBC Connection String: jdbc:oracle:thin:@//
Username: hr
Password: **
JDBC Connection Properties:
There are currently 0 values in the map:
New link was successfully created with validation status OK and persistent id 1

Next we need to create a link for our HDFS destination (connector Id 3). Don't forget to replace localhost:9000 and /opt/hadoop-2.7.0/etc/hadoop/ with the relevant URI and conf directory for your HDFS installation.

sqoop:000> create link -c 3
Creating link for connector with id 3
Please fill following values to create new link object
Name: HDFS Link

Link configuration

HDFS URI: hdfs://localhost:9000/
Hadoop conf directory: /opt/hadoop-2.7.0/etc/hadoop/
New link was successfully created with validation status OK and persistent id 2

We can verify that both our source and destination links are in place and take a note on their Ids.

sqoop:000> show link
| Id |   Name    | Connector Id |     Connector Name     | Enabled |
| 1  | Oracle HR | 1            | generic-jdbc-connector | true    |
| 2  | HDFS Link | 3            | hdfs-connector         | true    |

Creating and running the Sqoop job
We can now create a Sqoop job that will extract data from the Oracle database (Id 1) and place the data into HDFS (link Id 2). We will configure the job to extract all records from the HR.EMPLOYEES table (no custom SQL statement) and store it under the /user/hduser/hr HDFS directory as an uncompressed text file.

sqoop:000> create job -f 1 -t 2
Creating job for links with from id 1 and to id 2
Please fill following values to create new job object

From database configuration

Schema name: HR
Table name: EMPLOYEES
Table SQL statement:
Table column names:
Partition column name:
Null value allowed for the partition column:
Boundary query:

Incremental read

Check column:
Last value:

To HDFS configuration

Override null value:
Null value:
Output format:
Choose: 0
Compression format:
  0 : NONE
  3 : GZIP
  4 : BZIP2
  5 : LZO
  6 : LZ4
  7 : SNAPPY
  8 : CUSTOM
Choose: 0
Custom compression format:
Output directory: /user/haduser/hr
Append mode:

Throttling resources

New job was successfully created with validation status OK  and persistent id 1

We can look at the job we've just created by running a show job command.

sqoop:000> show job
| Id |        Name         | From Connector | To Connector | Enabled |
| 1  | Oracle HR EMPLOYEES | 1              | 3            | true    |

Start the job by running a start job command and providing the correct Id (1).

sqoop:000> start job -j 1 -s
Submission details
Job ID: 1
Server URL: http://localhost:12000/sqoop/
Created by: haduser
Creation date: 2015-08-07 06:49:31 BST
Lastly updated by: haduser
External ID: job_1438926535423_0001
2015-08-07 06:49:31 BST: BOOTING  - Progress is not available
2015-08-07 06:49:45 BST: RUNNING  - 0.00 %
2015-08-07 06:49:55 BST: RUNNING  - 15.00 %
2015-08-07 06:50:05 BST: RUNNING  - 40.00 %
2015-08-07 06:50:19 BST: SUCCEEDED
                FILE_LARGE_READ_OPS: 0
                FILE_WRITE_OPS: 0
                HDFS_READ_OPS: 10
                HDFS_BYTES_READ: 1349
                HDFS_LARGE_READ_OPS: 0
                FILE_READ_OPS: 0
                FILE_BYTES_WRITTEN: 2701930
                FILE_BYTES_READ: 0
                HDFS_WRITE_OPS: 10
                HDFS_BYTES_WRITTEN: 10881
                BYTES_WRITTEN: 0
                BYTES_READ: 0
                TOTAL_LAUNCHED_MAPS: 18
                MB_MILLIS_MAPS: 87859200
                VCORES_MILLIS_MAPS: 85800
                NUM_FAILED_MAPS: 8
                SLOTS_MILLIS_MAPS: 85800
                OTHER_LOCAL_MAPS: 18
                MILLIS_MAPS: 85800
                ROWS_READ: 107
                ROWS_WRITTEN: 107
                SPILLED_RECORDS: 0
                MERGED_MAP_OUTPUTS: 0
                VIRTUAL_MEMORY_BYTES: 40715288576
                MAP_INPUT_RECORDS: 0
                SPLIT_RAW_BYTES: 1349
                MAP_OUTPUT_RECORDS: 107
                FAILED_SHUFFLE: 0
                PHYSICAL_MEMORY_BYTES: 2296299520
                GC_TIME_MILLIS: 1142
                CPU_MILLISECONDS: 10710
                COMMITTED_HEAP_BYTES: 1849163776
Job executed successfully

After the successful completion of the job we can look inside the target HDFS directory and confirm that it contains a set of text files.

[haduser@hadoop ~]$ hadoop fs -ls /user/haduser/hr
Found 14 items
-rw-r--r--   3 haduser supergroup       1158 2015-08-07 06:49 /user/haduser/hr/2f5d9aa4-465e-4afb-bc90-edbaadc7783c.txt
-rw-r--r--   3 haduser supergroup          0 2015-08-07 06:49 /user/haduser/hr/3f72430b-3d82-4be9-8c91-28a4b29ac8d8.txt
-rw-r--r--   3 haduser supergroup          0 2015-08-07 06:49 /user/haduser/hr/44d388e8-2c13-4de5-aef0-82a1c0a76adb.txt
-rw-r--r--   3 haduser supergroup       1107 2015-08-07 06:49 /user/haduser/hr/5cd62736-1562-48ac-b563-b686a1a2df2c.txt
-rw-r--r--   3 haduser supergroup       1042 2015-08-07 06:50 /user/haduser/hr/682cb80a-6304-48f1-84e0-8dba0391e07e.txt
-rw-r--r--   3 haduser supergroup       1048 2015-08-07 06:50 /user/haduser/hr/6cb593bf-65aa-4237-976d-3430124b1f07.txt
-rw-r--r--   3 haduser supergroup       1105 2015-08-07 06:50 /user/haduser/hr/71a4bf9d-dee7-46ef-9072-81a173c8d01f.txt
-rw-r--r--   3 haduser supergroup       1104 2015-08-07 06:49 /user/haduser/hr/8cd4b362-190b-4d50-abc1-7c15c36aa73c.txt
-rw-r--r--   3 haduser supergroup          0 2015-08-07 06:49 /user/haduser/hr/8de393b7-a957-40a2-a0a5-370cfa3c51b1.txt
-rw-r--r--   3 haduser supergroup       1101 2015-08-07 06:50 /user/haduser/hr/a8dc8250-c634-4907-8cc5-81ddd784e82b.txt
-rw-r--r--   3 haduser supergroup       1087 2015-08-07 06:50 /user/haduser/hr/d0d19c49-a1d7-4a0e-be71-d4cf54a92a13.txt
-rw-r--r--   3 haduser supergroup       1119 2015-08-07 06:49 /user/haduser/hr/d41cf292-29c9-47e1-b3cc-ff700a99814f.txt
-rw-r--r--   3 haduser supergroup       1010 2015-08-07 06:49 /user/haduser/hr/d55861b3-da5d-46f9-8cff-76ce0f0bc0cd.txt
-rw-r--r--   3 haduser supergroup          0 2015-08-07 06:49 /user/haduser/hr/ea2eeba3-7695-4094-b2ac-f10d19c30517.txt
[haduser@hadoop ~]$

Each of them is in fact a CSV file populated with records from the HR.EMPLOYEES table.

[haduser@hadoop ~]$ hadoop fs -cat /user/haduser/hr/2f5d9aa4-465e-4afb-bc90-edbaadc7783c.txt
153,'Christopher','Olsen','COLSEN','011.44.1344.498718','2006-03-30 00:00:00.000','SA_REP',8000,0.2,145,80
154,'Nanette','Cambrault','NCAMBRAU','011.44.1344.987668','2006-12-09 00:00:00.000','SA_REP',7500,0.2,145,80
155,'Oliver','Tuvault','OTUVAULT','011.44.1344.486508','2007-11-23 00:00:00.000','SA_REP',7000,0.15,145,80
156,'Janette','King','JKING','011.44.1345.429268','2004-01-30 00:00:00.000','SA_REP',10000,0.35,146,80
157,'Patrick','Sully','PSULLY','011.44.1345.929268','2004-03-04 00:00:00.000','SA_REP',9500,0.35,146,80
158,'Allan','McEwen','AMCEWEN','011.44.1345.829268','2004-08-01 00:00:00.000','SA_REP',9000,0.35,146,80
159,'Lindsey','Smith','LSMITH','011.44.1345.729268','2005-03-10 00:00:00.000','SA_REP',8000,0.3,146,80
160,'Louise','Doran','LDORAN','011.44.1345.629268','2005-12-15 00:00:00.000','SA_REP',7500,0.3,146,80
161,'Sarath','Sewall','SSEWALL','011.44.1345.529268','2006-11-03 00:00:00.000','SA_REP',7000,0.25,146,80
162,'Clara','Vishney','CVISHNEY','011.44.1346.129268','2005-11-11 00:00:00.000','SA_REP',10500,0.25,147,80
163,'Danielle','Greene','DGREENE','011.44.1346.229268','2007-03-19 00:00:00.000','SA_REP',9500,0.15,147,80
[haduser@hadoop ~]$

This concludes our short Sqoop 2 tutorial. If you are interested in finding more about the advanced features provided by Sqoop 2 take a look at the official documentation.