Installing Jupyter with the PySpark and R kernels for Spark development

Installing Jupyter with the PySpark and R kernels for Spark development


This is a quick tutorial on installing Jupyter and setting up the PySpark and the R kernel (IRkernel) for Spark development. The pre-reqs for following this tutorial is to have a Hadoop/Spark cluster deployed and the relevant services up and running (e.g. HDFS, YARN, Hive, Spark etc.).

In this tutorial I am using IBM's Hadoop distribution BigInsights 4.2, but technically this should work with any ODPi compliant distribution (e.g. Hortonworks).

Installing Anaconda and setting up Jupyter

Start by downloading Anaconda 3 and running the installer. Accept the defaults, but set the destination to /opt/anaconda3.

[root@biginsights ~]# wget https://repo.continuum.io/archive/Anaconda3-4.1.1-Linux-x86_64.sh
--2016-09-28 01:05:08--  https://repo.continuum.io/archive/Anaconda3-4.1.1-Linux-x86_64.sh
Resolving repo.continuum.io (repo.continuum.io)... 54.221.225.221, 54.225.212.75, 54.225.68.13, ...
Connecting to repo.continuum.io (repo.continuum.io)|54.221.225.221|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 425991075 (406M) [application/octet-stream]
Saving to: ‘Anaconda3-4.1.1-Linux-x86_64.sh’

100%[=================================================================================>] 425,991,075 4.29MB/s   in 1m 48s 

2016-09-28 01:06:56 (3.76 MB/s) - ‘Anaconda3-4.1.1-Linux-x86_64.sh’ saved [425991075/425991075]

[root@biginsights ~]# chmod +x Anaconda3-4.1.1-Linux-x86_64.sh 
[root@biginsights ~]# ./Anaconda3-4.1.1-Linux-x86_64.sh 

Welcome to Anaconda3 4.1.1 (by Continuum Analytics, Inc.)

In order to continue the installation process, please review the license
agreement.
Please, press ENTER to continue
>>> 

...

cryptography
A Python library which exposes cryptographic recipes and primitives.

Do you approve the license terms? [yes|no]
>>> yes

Anaconda3 will now be installed into this location:
/root/anaconda3

  - Press ENTER to confirm the location
  - Press CTRL-C to abort the installation
  - Or specify a different location below

[/root/anaconda3] >>> /opt/anaconda3
PREFIX=/opt/anaconda3
installing: python-3.5.2-0 ...
...
installing: conda-env-2.5.1-py35_0 ...
Python 3.5.2 :: Continuum Analytics, Inc.
creating default environment...
installation finished.
Do you wish the installer to prepend the Anaconda3 install location
to PATH in your /root/.bashrc ? [yes|no]
[no] >>> yes

Prepending PATH=/opt/anaconda3/bin to PATH in /root/.bashrc
A backup will be made to: /root/.bashrc-anaconda3.bak

For this change to become active, you have to open a new terminal.

Thank you for installing Anaconda3!

Share your notebooks and packages on Anaconda Cloud!
Sign up for free: https://anaconda.org

[root@biginsights ~]# 

Reload your bash profile and verify that you are indeed using Python 3.

[root@biginsights ~]# source ~/.bashrc
[root@biginsights ~]# python
Python 3.5.2 |Anaconda 4.1.1 (64-bit)| (default, Jul  2 2016, 17:53:06) 
[GCC 4.4.7 20120313 (Red Hat 4.4.7-1)] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> 
[root@biginsights ~]# 

Optionally, verify that you can start the notebook server. Switch to a non-privileged user.

[root@biginsights ~]#  su – nick
[nick@biginsights ~]$

Add Anaconda to the PATH by exporting a new value for the PATH variable (or by editing your ~/.bash_profile and loading it).

[nick@biginsights ~]$ export PATH=/opt/anaconda3/bin:$PATH
[nick@biginsights ~]$

Start Jupyter notebook server by running jupyter notebook with the host IP for your environment.

[nick@biginsights ~]$ jupyter notebook --ip="192.168.153.131" --no-browser
[W 01:38:03.248 NotebookApp] Unrecognized JSON config file version, assuming version 1
[I 01:38:03.427 NotebookApp] [nb_conda_kernels] enabled, 1 kernels found
[I 01:38:03.435 NotebookApp] Writing notebook server cookie secret to /home/manchev/.local/share/jupyter/runtime/notebook_cookie_secret
[I 01:38:03.701 NotebookApp] ✓ nbpresent HTML export ENABLED
[W 01:38:03.701 NotebookApp] ✗ nbpresent PDF export DISABLED: No module named 'nbbrowserpdf'
[I 01:38:03.704 NotebookApp] [nb_conda] enabled
[I 01:38:03.736 NotebookApp] [nb_anacondacloud] enabled
[I 01:38:03.740 NotebookApp] Serving notebooks from local directory: /home/manchev
[I 01:38:03.740 NotebookApp] 0 active kernels 
[I 01:38:03.740 NotebookApp] The Jupyter Notebook is running at: http://192.168.153.131:8888/
[I 01:38:03.740 NotebookApp] Use Control-C to stop this server and shut down all kernels (twice to skip confirmation).

Open a web browser and point to the IP address at port 8888. You should be able to see the main Jupyter page.

Jupyter Home Page

Installation of the kernels

Check what kernels are currently available to Jupyter.

[root@biginsights ~]# jupyter kernelspec list
Available kernels:
  python3    /opt/anaconda3/lib/python3.5/site-packages/ipykernel/resources
[root@biginsights ~]# 

Create a directory to host the PySpark kernel.

[root@biginsights ~]# mkdir -p /usr/local/share/jupyter/kernels/pyspark
[root@biginsights ~]#

Create a kernel file named kernel.json with the following content and put it in /usr/local/share/jupyter/kernels/pyspark/. Don't forget to replace values for SPARK_HOME, PYTHONPATH, and PYTHONSTARTUP with values matching your environment.

{
  "display_name": "PySpark",
  "language": "python",
  "argv": [ "/opt/anaconda3/bin/python", "-m", "ipykernel", "-f", "{connection_file}" ],
  "env": {
    "SPARK_HOME": "/usr/iop/current/spark-client",
    "PYSPARK_PYTHON": "/opt/anaconda3/bin/python3",
    "PYTHONPATH": "/usr/iop/current/spark-client/python/:/usr/iop/current/spark-client/python/lib/py4j-0.9-src.zip",
    "PYTHONSTARTUP": "/usr/iop/current/spark-client/python/pyspark/shell.py",
    "PYSPARK_SUBMIT_ARGS": "--master yarn-client pyspark-shell"
  }
}

Check that Jupyter can now pick up the new kernel.

[root@biginsights ~]# jupyter kernelspec list
Available kernels:
  python3    /opt/anaconda3/lib/python3.5/site-packages/ipykernel/resources
  pyspark    /usr/local/share/jupyter/kernels/pyspark
[root@biginsights ~]# 

Create another directory for the R kernel.

[root@biginsights ~]# mkdir -p /usr/local/share/jupyter/kernels/r
[root@biginsights ~]#

Place a kernel.json file inside with the following content:

{
 "argv": ["R", "--slave", "-e", "IRkernel::main()", "--args", "{connection_file}"],
 "display_name":"R",
 "language":"R"
}

Verify that kernel is now in the available kernels list:

[root@biginsights ~]# jupyter kernelspec list
Available kernels:
  python3    /opt/anaconda3/lib/python3.5/site-packages/ipykernel/resources
  pyspark    /usr/local/share/jupyter/kernels/pyspark
  r          /usr/local/share/jupyter/kernels/r
[root@biginsights ~]# 

As the R kernel is not part of the CRAN repos it has to be compiled from sources. Start by getting the following additional packages (if they are not already part of your OS installation):

[root@biginsights ~]# yum install -y openssl-devel openssl libcurl-devel libssh2-devel 
...
[root@biginsights ~]#

Create links to libssl.so.1.0.0 and libcrypto.so.1.0.0 under /usr/lib64 to avoid errors like "libssl.so.10: cannot open shared object file" during compilation.

[root@biginsights ~]# ln -s /opt/anaconda3/lib/libssl.so.1.0.0 /usr/lib64/libssl.so.1.0.0
[root@biginsights ~]# ln -s /opt/anaconda3/lib/libcrypto.so.1.0.0 /usr/lib64/libcrypto.so.1.0.0
[root@biginsights ~]# 

Start R and install the following packages.

[root@biginsights ~]# R

R version 3.3.0 (2016-05-03) -- "Supposedly Educational"
Copyright (C) 2016 The R Foundation for Statistical Computing
Platform: x86_64-redhat-linux-gnu (64-bit)

R is free software and comes with ABSOLUTELY NO WARRANTY.
You are welcome to redistribute it under certain conditions.
Type 'license()' or 'licence()' for distribution details.

  Natural language support but running in an English locale

R is a collaborative project with many contributors.
Type 'contributors()' for more information and
'citation()' on how to cite R or R packages in publications.

Type 'demo()' for some demos, 'help()' for on-line help, or
'help.start()' for an HTML browser interface to help.
Type 'q()' to quit R.

> install.packages('git2r')
...
>

Add the following packages, making sure that they get compiled and installed correctly:

> install.packages('devtools')
...
> install.packages('repr')
...
> install.packages('IRdisplay')
...
> install.packages('crayon')
...
> install.packages('pbdZMQ')
...
>

Now you can use devtools to get and compile IRkernel.

> devtools::install_github('IRkernel/IRkernel')
…
>

If you are planning to use SparkR now it's the time to get and install the SparkR package (make sure to use the correct package version for your version of Spark).

> devtools::install_github('apache/spark@v1.6.1', subdir='R/pkg')
…
>

Start (restart) your notebook server and verify that the new kernels are available.

Testing the kernels

Create a HDFS directory for your non-privileged user and put some test data inside.

[root@biginsights ~]# su - hdfs
[hdfs@biginsights ~]$ hadoop fs -mkdir /user/nick
[hdfs@biginsights ~]$ hadoop fs -chown nick:nick /user/nick
[hdfs@biginsights ~]$ hadoop fs -chmod 775 /user/nick
[hdfs@biginsights ~]$ 

Get some test data and upload it to the HDFS directory.

[root@biginsights ~]# su - nick
[nick@biginsights ~]$ wget https://github.com/databricks/spark-csv/raw/master/src/test/resources/cars.csv
...
[nick@biginsights ~]$ hadoop fs -put cars.csv .
[nick@biginsights ~]$

Open the Jupyter web page (IP:8888) and select New.

Jupyter new notebook menu

Select PySpark and read some data from the test file. You can put the following code in a cell and run the cell, making sure that it executes successfully.

lines = sc.textFile("cars.csv")
lines.count()

Next, you can test R and SparkR by parsing the CSV file and loading the data in an R Data Frame.

If the CSV data source is not already part of your Spark installation, you'll have to download it and add it to your Spark libraries.

[root@biginsights ~]# wget http://central.maven.org/maven2/com/databricks/spark-csv_2.11/1.5.0/spark-csv_2.11-1.5.0.jar
...
[root@biginsights ~]# mv spark-csv_2.11-1.5.0.jar /usr/iop/current/spark-client/lib
[root@biginsights ~]# 

Next, create a new notebook using the R kernel. Put the following code inside (mind the correct spark-csv version) and run it.

Sys.setenv(SPARK_HOME='/usr/iop/current/spark-client')
.libPaths(c(file.path(Sys.getenv('SPARK_HOME'), 'R', 'lib'), .libPaths()))

library(SparkR)

sc <- sparkR.init(master='yarn-client', sparkPackages="com.databricks:spark-csv_2.11:1.5.0")
sqlContext <- sparkRSQL.init(sc)

df <- read.df(sqlContext, "cars.csv", source = "com.databricks.spark.csv", inferSchema = "true", header="true")
head(df)

 Jupyter notebook with the test R code

Congratulations, you now have a PySpark/R/SparkR work environment based on Jupyter.