Compiling Hadoop 2.7.0 on 64-bit Linux

Compiling Hadoop 2.7.0 on 64-bit Linux


If you are planning to run Hadoop on a 64-bit OS you might want to compile it from source instead of using the pre-built 32-bit i386-Linux native Hadoop library (libhadoop.so). Besides the performance advantages this also removes the annoying message that keeps popping when using the 32-bit library on a 64-it OS:

WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable

In this tutorial we will see how to prepare a clean 64-bit CentOS 7 system for accommodating Hadoop and how to actually build Hadoop from source.

Prerequisites

The only prerequisite for following this tutorial is a minimal installation of a 64-bit CentOS 7 Linux. The host I am using is named hadoop.

If this is a dev/test system you might as well disable SELinux and firewalld to make your life easier.

[root@hadoop opt]# systemctl disable firewalld
rm '/etc/systemd/system/dbus-org.fedoraproject.FirewallD1.service'
rm '/etc/systemd/system/basic.target.wants/firewalld.service'
[root@hadoop opt]# systemctl stop firewalld
[root@hadoop opt]#

For disabling SELinux set the SELINUX parameter in /etc/sysconfig/selinux from enforcing to disabled. You will also have to reboot the system for the changes to take effect.

[root@hadoop ~]# sed -i 's/SELINUX=enforcing/SELINUX=disabled/' /etc/selinux/config
[root@hadoop ~]# reboot

JDK Installation

You can download the latest JDK from Oracle Technology Network or use wget to download it directly if you know the exact URL for the version you’re downloading.

[root@hadoop ~]# wget --no-check-certificate --no-cookies --header "Cookie: oraclelicense=accept-securebackup-cookie" http://download.oracle.com/otn-pub/java/jdk/8u45-b14/jdk-8u45-linux-x64.tar.gz
--2015-05-08 11:41:28--  http://download.oracle.com/otn-pub/java/jdk/8u45-b14/jdk-8u45-linux-x64.tar.gz
...
Saving to: ‘jdk-8u45-linux-x64.tar.gz’

100%[=====================================================================================================================================================================>] 173,271,626 2.04MB/s   in 85s

2015-05-08 11:42:54 (1.94 MB/s) - ‘jdk-8u45-linux-x64.tar.gz’ saved [173271626/173271626]

[root@hadoop ~]#

Extract the archive in /opt and use alternatives to set the Java symbolic links to your newly installed JDK.

[root@hadoop ~]# tar -xzf jdk-8u45-linux-x64.tar.gz -C /opt/
[root@hadoop ~]# alternatives --install /usr/bin/java java /opt/jdk1.8.0_45/bin/java 2
[root@hadoop ~]# alternatives --config java

There is 1 program that provides 'java'.

  Selection    Command
-----------------------------------------------
*+ 1           /opt/jdk1.8.0_45/bin/java

Enter to keep the current selection[+], or type selection number: 1
[root@hadoop ~]#

Confirm that java is on the path and that it points to the correct version.

[root@hadoop ~]# java -version
java version "1.8.0_45"
Java(TM) SE Runtime Environment (build 1.8.0_45-b14)
Java HotSpot(TM) 64-Bit Server VM (build 25.45-b02, mixed mode)
[root@hadoop ~]#

Create a dedicated Hadoop user account

Our next step is to create a dedicated user account that owns and runs the Hadoop software. I am going to name my user haduser and his primary group will be called hadgroup.

[root@hadoop ~]# groupadd hadgroup
[root@hadoop ~]# useradd haduser -G hadgroup
[root@hadoop ~]# passwd haduser
Changing password for user haduser.
New password:
Retype new password:
passwd: all authentication tokens updated successfully.
[root@hadoop ~]#

Key based authentication for haduser

Hadoop requires secure shell connections to localhost without a pass phrase so let’s configure key-based SSH access. Switch to haduser, generate an RSA key, and add it to authorized_keys.

[root@hadoop ~]# su - haduser
[haduser@hadoop ~]$ ssh-keygen -t rsa -P ""
Generating public/private rsa key pair.
Enter file in which to save the key (/home/haduser/.ssh/id_rsa):
Created directory '/home/haduser/.ssh'.
Your identification has been saved in /home/haduser/.ssh/id_rsa.
Your public key has been saved in /home/haduser/.ssh/id_rsa.pub.
The key fingerprint is:
2d:a3:20:58:45:69:0f:30:ee:21:0d:44:66:a0:3f:18 haduser@hadoop
The key's randomart image is:
+--[ RSA 2048]----+
|*=ooo.           |
|++ o+            |
|E =. o           |
| O .  .  .       |
|o = .   S .      |
|   o . . o       |
|      .          |
|                 |
|                 |
+-----------------+
[haduser@hadoop ~]$ cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys
[haduser@hadoop ~]$ chmod 0600 ~/.ssh/authorized_keys
[haduser@hadoop ~]$

Do a quick test by invoking date via ssh and adding localhost to the list of known hosts if necessary.

[haduser@hadoop ~]$ ssh localhost date
The authenticity of host 'localhost (::1)' can't be established.
ECDSA key fingerprint is 25:7c:ff:5f:cf:a3:c2:c7:e2:f7:a3:92:ea:63:29:5b.
Are you sure you want to continue connecting (yes/no)? yes
Warning: Permanently added 'localhost' (ECDSA) to the list of known hosts.
Fri May  8 11:48:22 BST 2015
[haduser@hadoop ~]$ 

Install build tools

There is a set of tools and libraries we have to install that are required for compiling Hadoop. Let’s start by adding the default development toolset (gcc, autoconf etc.). It is easier to do this as root.

[root@hadoop ~]# yum groupinstall "Development Tools" "Development Libraries"
Loaded plugins: fastestmirror
...
Transaction Summary
===============================================================================================================================================================================================================
Install  26 Packages (+69 Dependent packages)

Total download size: 92 M
Installed size: 270 M
Is this ok [y/d/N]: y
Downloading packages:
...
Complete!
[root@hadoop ~]#

Another two packages required for successfully compiling Hadoop are openssl-devel and cmake.

[root@hadoop ~]# yum install openssl-devel cmake
Loaded plugins: fastestmirror
...
Total download size: 9.6 M
Installed size: 32 M
Is this ok [y/d/N]: y
...
Complete!
[root@hadoop ~]#

We will also need Apache Maven (build automation tool) and Protocol Buffers (serialization library developed by Google). Let’s start by getting and uncompressing the latest version of Maven (3.3.3 at the time of writing of this article).

[root@hadoop ~]# wget http://mirrors.gigenet.com/apache/maven/maven-3/3.3.3/binaries/apache-maven-3.3.3-bin.tar.gz
--2015-05-08 11:58:28--  http://mirrors.gigenet.com/apache/maven/maven-3/3.3.3/binaries/apache-maven-3.3.3-bin.tar.gz
Resolving mirrors.gigenet.com (mirrors.gigenet.com)... 69.65.15.34
Connecting to mirrors.gigenet.com (mirrors.gigenet.com)|69.65.15.34|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 8042383 (7.7M) [application/x-gzip]
Saving to: ‘apache-maven-3.3.3-bin.tar.gz’

100%[=====================================================================================================================================================================>] 8,042,383    816KB/s   in 11s

2015-05-08 11:58:39 (732 KB/s) - ‘apache-maven-3.3.3-bin.tar.gz’ saved [8042383/8042383]

[root@hadoop ~]# tar -zxf apache-maven-3.3.3-bin.tar.gz -C /opt/
[root@hadoop ~]#

Our next step is to create a dedicated initialization script that will set the following environment variables for Maven.

JAVA_HOME=/opt/jdk1.8.0_45
M3_HOME=/opt/apache-maven-3.3.3
PATH=/opt/apache-maven-3.3.3/bin:$PATH

We can create a new file (maven.sh) in /etc/profile.d and put the variables above inside:

[root@hadoop ~]# cat << EOF > /etc/profile.d/maven.sh
> export JAVA_HOME=/opt/jdk1.8.0_45
> export M3_HOME=/opt/apache-maven-3.3.3
> export PATH=/opt/apache-maven-3.3.3/bin:$PATH
> EOF
[root@hadoop ~]#

Logout and login back and verify that M3_HOME is correctly set.

login as: root
root@192.168.56.110's password:
Last login: Fri May  8 11:35:02 2015 from 192.168.56.1
[root@hadoop ~]# echo $M3_HOME
/opt/apache-maven-3.3.3
[root@hadoop ~]#

Confirm that you can successfully start Maven and that it is using the correct Java version.

[root@hadoop ~]# mvn -version
Apache Maven 3.3.3 (7994120775791599e205a5524ec3e0dfe41d4a06; 2015-04-22T12:57:37+01:00)
Maven home: /opt/apache-maven-3.3.3
Java version: 1.8.0_45, vendor: Oracle Corporation
Java home: /opt/jdk1.8.0_45/jre
Default locale: en_US, platform encoding: UTF-8
OS name: "linux", version: "3.10.0-229.1.2.el7.x86_64", arch: "amd64", family: "unix"
[root@hadoop ~]#

Time to deal with Protocol Buffers. Let’s start by download and unpacking it. Note that I am not using the latest version. This is because Hadoop 2.7.0 depends on Protocol Buffers 2.5.0.

[root@hadoop ~]# wget https://github.com/google/protobuf/releases/download/v2.5.0/protobuf-2.5.0.tar.gz
--2015-05-08 12:29:26--  https://github.com/google/protobuf/releases/download/v2.5.0/protobuf-2.5.0.tar.gz
...
2015-05-08 12:29:30 (761 KB/s) - ‘protobuf-2.5.0.tar.gz’ saved [2401901/2401901]

[root@hadoop ~]# tar -xzf protobuf-2.5.0.tar.gz -C /root
[root@hadoop ~]#

Run the configure script to prepare the source code for compilation.

[root@hadoop ~]# cd protobuf-2.5.0
[root@hadoop protobuf-2.5.0]# ./configure
checking whether to enable maintainer-specific portions of Makefiles... yes
checking build system type... x86_64-unknown-linux-gnu
checking host system type... x86_64-unknown-linux-gnu
...
config.status: executing libtool commands
[root@hadoop protobuf-2.5.0]#

Build the Protocol Buffer objects by running make.

[root@hadoop protobuf-2.5.0]# make
make  all-recursive
make[1]: Entering directory `/root/protobuf-2.5.0'
Making all in .
make[2]: Entering directory `/root/protobuf-2.5.0'
make[2]: Leaving directory `/root/protobuf-2.5.0'
...
make[1]: Leaving directory `/root/protobuf-2.5.0'
[root@hadoop protobuf-2.5.0]#

Invoke make install to put the objects we’ve just built into their proper locations and run protoc to confirm that we can invoke the protocol buffers compiler.

[root@hadoop protobuf-2.5.0]# make install
Making install in .
make[1]: Entering directory `/root/protobuf-2.5.0'
make[1]: Leaving directory `/root/protobuf-2.5.0/src'
[root@hadoop protobuf-2.5.0]# protoc --version
libprotoc 2.5.0
[root@hadoop protobuf-2.5.0]#

This concludes all preparations and we are now ready to crack on with compiling Hadoop.

Compiling Hadoop

We perform the compilation as the Hadoop owner (haduser).

[root@hadoop ~]# su - haduser
Last login: Fri May  8 12:04:21 BST 2015 on pts/0
[haduser@hadoop ~]$

Get Hadoop’s source code from Apache and unpack it.

[haduser@hadoop ~]$ wget http://www.mirrorservice.org/sites/ftp.apache.org/hadoop/common/hadoop-2.7.0/hadoop-2.7.0-src.tar.gz
--2015-05-08 12:19:10--  http://www.mirrorservice.org/sites/ftp.apache.org/hadoop/common/hadoop-2.7.0/hadoop-2.7.0-src.tar.gz
Resolving www.mirrorservice.org (www.mirrorservice.org)... 212.219.56.184
Connecting to www.mirrorservice.org (www.mirrorservice.org)|212.219.56.184|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 18101141 (17M) [application/x-gzip]
Saving to: ‘hadoop-2.7.0-src.tar.gz’

100%[=====================================================================================================================================================================>] 18,101,141  1.98MB/s   in 8.9s

2015-05-08 12:19:19 (1.95 MB/s) - ‘hadoop-2.7.0-src.tar.gz’ saved [18101141/18101141]

[haduser@hadoop ~]$ tar xf hadoop-2.7.0-src.tar.gz

Invoke Maven with the appropriate build profile, sit back and wait for the build process to complete. Depending on your system this may take a while.

[haduser@hadoop hadoop-2.7.0-src]$ mvn package -Pdist,native -DskipTests -Dtar
[INFO] Scanning for projects...
Downloading: https://repo.maven.apache.org/maven2/org/apache/felix/maven-bundle-plugin/2.5.0/maven-bundle-plugin-2.5.0.pom
...
[INFO] Apache Hadoop Distribution ......................... SUCCESS [ 22.888 s]
[INFO] ------------------------------------------------------------------------
[INFO] BUILD SUCCESS
[INFO] ------------------------------------------------------------------------
[INFO] Total time: 08:00 min
[INFO] Finished at: 2015-05-08T12:42:01+01:00
[INFO] Final Memory: 218M/854M
[INFO] ------------------------------------------------------------------------
[haduser@hadoop hadoop-2.7.0-src]$

After the build process is complete, switch back to root and place the compiled code in its final location – I normally use /opt.

[root@hadoop ~]#  mv /home/haduser/hadoop-2.7.0-src/hadoop-dist/target/hadoop-2.7.0 /opt/
[root@hadoop ~]#

Switch back to haduser and put the JAVA_HOME and HADOOP_HOME environment variables in the user’s profile.

[haduser@hadoop ~]$ cat >> ~/.bash_profile << EOF
> # Hadoop
> export HADOOP_HOME=/opt/hadoop-2.7.0
> export PATH=$PATH:$HADOOP_HOME/bin:$HADOOP_HOME/sbin
> EOF
[haduser@hadoop ~]$ source ~/.bash_profile

Testing Hadoop

Our final task is to quickly configure Hadoop and test the code we’ve just built.
Edit the $HADOOP_HOME/etc/hadoop/core-site.xml file and put the following lines between the <configuration></configuration> tags.

<property>
  <name>fs.default.name</name>
    <value>hdfs://localhost:9000</value>
</property>
<property>
  <name>hadoop.tmp.dir</name>
    <value>/opt/hadoop-2.7.0/tmp</value>
</property>

Create a new mapred-site.xml file based on the standard template, by copying the mapred-site.xml.template file.

[haduser@hadoop ~]$ cp /opt/hadoop-2.7.0/etc/hadoop/mapred-site.xml.template /opt/hadoop-2.7.0/etc/hadoop/mapred-site.xml
[haduser@hadoop ~]$

Edit the newly created $HADOOP_HOME/etc/hadoop/mapred-site.xml file and put the following between the <configuration></configuration> tags.

<property>
   <name>mapreduce.framework.name</name>
   <value>yarn</value>
</property>

Format the NameNode.

[haduser@hadoop ~]$ hdfs namenode -format
15/05/08 11:04:18 INFO namenode.NameNode: STARTUP_MSG:
/************************************************************
STARTUP_MSG: Starting NameNode
STARTUP_MSG:   host = hadoop/192.168.56.110
STARTUP_MSG:   args = [-format]
STARTUP_MSG:   version = 2.7.0
...
15/05/08 11:04:20 INFO common.Storage: Storage directory /opt/hadoop-2.7.0/tmp/dfs/name has been successfully formatted.
15/05/08 11:04:20 INFO namenode.NNStorageRetentionManager: Going to retain 1 images with txid >= 0
15/05/08 11:04:20 INFO util.ExitUtil: Exiting with status 0
15/05/08 11:04:20 INFO namenode.NameNode: SHUTDOWN_MSG:
/************************************************************
SHUTDOWN_MSG: Shutting down NameNode at hadoop/192.168.56.110
************************************************************/
[haduser@hadoop ~]$

Now start the Hadoop DFS and Yarn daemons.

[haduser@hadoop ~]$ start-dfs.sh
Starting namenodes on [localhost]
localhost: starting namenode, logging to /opt/hadoop-2.7.0/logs/hadoop-haduser-namenode-hadoop.out
localhost: starting datanode, logging to /opt/hadoop-2.7.0/logs/hadoop-haduser-datanode-hadoop.out
Starting secondary namenodes [0.0.0.0]
0.0.0.0: starting secondarynamenode, logging to /opt/hadoop-2.7.0/logs/hadoop-haduser-secondarynamenode-hadoop.out
[haduser@hadoop ~]$ start-yarn.sh
starting yarn daemons
starting resourcemanager, logging to /opt/hadoop-2.7.0/logs/yarn-haduser-resourcemanager-hadoop.out
localhost: starting nodemanager, logging to /opt/hadoop-2.7.0/logs/yarn-haduser-nodemanager-hadoop.out
[haduser@hadoop ~]$

Create a test directory and list the HDFS root directory to verify.

[haduser@hadoop ~]$ hadoop fs -mkdir hdfs://localhost:9000/test
[haduser@hadoop ~]$ hadoop fs -ls hdfs://localhost:9000/
Found 1 items
drwxr-xr-x   - haduser supergroup          0 2015-05-08 11:06 hdfs://localhost:9000/test
[haduser@hadoop ~]$

Start a web browser and open an HTTP connection to your Hadoop host's IP at port 50070 to confirm you can access the web console.

Hadoop Console

Congratulations, you have successfully built Hadoop for the 64-bit CentOS 7 platform!