Building SystemML from source

Building SystemML from source


Apache SystemML is a declarative, large-scale machine learning platform that provides automatic optimisation for custom machine learning algorithms. It supports two high-level front end languages – DML (R syntax) and PyDML (Python syntax).
SystemML was originally developed by IBM, but it was open sourced in November 2015, and at the moment is undergoing incubation at the Apache Software Foundation.

In this short tutorial I'll show you how to get and compile from source the latest version of SystemML.

Prerequisites

As the primary goal of SystemML is to distribute the execution across a Hadoop cluster I'll be using a VMware image of IBM's Hadoop distribution BigInsights 4.2. Note, that Hadoop is not a pre-req for building SystemML. The platform supports a standalone execution mode and will happily run with a local file system. I am using BigInsights in this tutorial as I prefer to keep my training data and outputs in HDFS and have everything nicely packaged in a single VM.

The BigInsights 4.2 image is freely available for download from www.ibm.com/analytics/us/en/technology/hadoop

The other two things that you need to have upfront are Git:

[root@rvm ~]# git --version
git version 1.7.1
[root@rvm ~]# 

and Java:

[root@rvm ~]# echo $JAVA_HOME
/usr/jdk64/java-1.8.0-openjdk-1.8.0.45-28.b13.el6_6.x86_64
[root@rvm ~]# java -version
openjdk version "1.8.0_45"
OpenJDK Runtime Environment (build 1.8.0_45-b13)
OpenJDK 64-Bit Server VM (build 25.45-b02, mixed mode)
[root@rvm ~]#

Git is not included in the BigInsights image, but you can easily fix that with a simple yum install git command.

Installing Maven

The Maven build manager is needed for building the SystemML project. You can check what the latest available version is at https://maven.apache.org

Get the latest version, unpack it, and place it in /usr/local.

[root@rvm ~]# wget http://mirror.catn.com/pub/apache/maven/maven-3/3.3.9/binaries/apache-maven-3.3.9-bin.tar.gz
--2016-05-13 05:47:59--  http://mirror.catn.com/pub/apache/maven/maven-3/3.3.9/binaries/apache-maven-3.3.9-bin.tar.gz
Resolving mirror.catn.com... 87.124.126.49
Connecting to mirror.catn.com|87.124.126.49|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 8491533 (8.1M) [application/x-gzip]
Saving to: `apache-maven-3.3.9-bin.tar.gz'

100%[===================================================================================================>] 8,491,533   2.08M/s   in 3.9s    

2016-05-13 05:48:03 (2.07 MB/s) - `apache-maven-3.3.9-bin.tar.gz' saved [8491533/8491533]

[root@rvm ~]# tar xzf apache-maven-3.3.9-bin.tar.gz 
[root@rvm ~]# mv apache-maven-3.3.9 /usr/local/
[root@rvm ~]#

Switch to the user that you want to build SystemML under.

[root@rvm ~]# su – hadoop
[hadoop@rvm ~]$

Append the following to his .bash_profile (replace 3.3.9 with the Maven version that you're using):

export M2_HOME=/usr/local/apache-maven-3.3.9
export M2=$M2_HOME/bin
export PATH=$M2:$PATH

Execute the .bash_profile scripts for the changes to take into effect, then make sure that you can run Maven.

[hadoop@rvm ~]$ source ~/.bash_profile 
[hadoop@rvm ~]$ mvn -v
Apache Maven 3.3.9 (bb52d8502b132ec0a5a3f4c09453c07478323dc5; 2015-11-10T08:41:47-08:00)
Maven home: /usr/local/apache-maven-3.3.9
Java version: 1.8.0_45, vendor: Oracle Corporation
Java home: /usr/jdk64/java-1.8.0-openjdk-1.8.0.45-28.b13.el6_6.x86_64/jre
Default locale: en_US, platform encoding: UTF-8
OS name: "linux", version: "2.6.32-504.el6.x86_64", arch: "amd64", family: "unix"
[hadoop@rvm ~]$

Get and build SystemML

Get the latest version of SystemML from the Apache Incubator.

[hadoop@rvm ~]$ git clone git://git.apache.org/incubator-systemml.git
Initialized empty Git repository in /home/hadoop/incubator-systemml/.git/
remote: Counting objects: 78572, done.
remote: Compressing objects: 100% (20108/20108), done.
remote: Total 78572 (delta 47008), reused 75179 (delta 43923)
Receiving objects: 100% (78572/78572), 178.46 MiB | 755 KiB/s, done.
Resolving deltas: 100% (47008/47008), done.
[hadoop@rvm ~]$ 

Use Maven to build the SystemML distributions.

[hadoop@rvm ~]$ cd incubator-systemml
[hadoop@rvm incubator-systemml]$ mvn clean package -P distribution
[INFO] Scanning for projects...
Downloading: https://repo.maven.apache.org/maven2/org/apache/apache/17/apache-17.pom
...
[INFO] Building jar: /home/hadoop/incubator-systemml/target/systemml-0.10.0-incubating-SNAPSHOT-javadoc.jar
[INFO] ------------------------------------------------------------------------
[INFO] BUILD SUCCESS
[INFO] ------------------------------------------------------------------------
[INFO] Total time: 01:09 min
[INFO] Finished at: 2016-05-13T06:30:16-07:00
[INFO] Final Memory: 120M/1503M
[INFO] ------------------------------------------------------------------------
[hadoop@rvm incubator-systemml]$

If you want the Standalone SystemML mode, create a new folder for SystemML and unpack the newly built standalone target inside.

[hadoop@rvm incubator-systemml]$ mkdir ~/systemml
[hadoop@rvm incubator-systemml]$ tar -xzf target/systemml-0.10.0-incubating-SNAPSHOT-standalone.tar.gz -C ~/systemml --strip 1
[hadoop@rvm incubator-systemml]$

Or you can go straight for the distributed mode.

[hadoop@rvm incubator-systemml]$ mkdir ~/systemml-distd
[hadoop@rvm incubator-systemml]$ tar -xzf target/systemml-0.10.0-incubating-SNAPSHOT.tar.gz -C ~/systemml-dist --strip 1
[hadoop@rvm incubator-systemml]$ 

Testing the installation

You can do a quick 'Hello world' example with the standalone version like this:

[hadoop@rvm incubator-systemml]$ cd /home/hadoop/systemml
[hadoop@rvm systemml]$ echo 'print("Hello world!");' > helloworld.dml
[hadoop@rvm systemml]$ ./runStandaloneSystemML.sh helloworld.dml 
16/05/13 08:30:26 INFO api.DMLScript: BEGIN DML run 05/13/2016 08:30:26
Hello world!
16/05/13 08:30:27 INFO api.DMLScript: SystemML Statistics:
Total execution time:		0.211 sec.
Number of executed MR Jobs:	0.

16/05/13 08:30:27 INFO api.DMLScript: END DML run 05/13/2016 08:30:27
[hadoop@rvm systemml]$ 

For more complex examples take a look at the SystemML Documentation