JupyterHub — Open source data analysis hub inside the browser

How I set up Jupyter notebooks for big data trainings

Radek Stankiewicz

Published in

Radoslaw Stankiewicz Desk

3 min readJan 16, 2017

Problem

Using CLI is hard (for most people).

Asking to start putty, login, enter some commands, run a program, edit a file in vim is an impossible task.

What for developer is a common environment, for analysts is magic.

My objective during training for analysts is to teach people 3 things:

manage datasets
run interactive analysis using Hive
write and run Spark code

Analyst environment

Analyst environment is simple — some kind of SQL client like SQL Developer or Teradata Assistant, RStudio and Excel. That’s it.

My requirement was to introduce an easy and reliable environment that won’t require switching too often to CLI and give them as easy transition from their current working environment as possible.

“Could you also install…”?

One more thing — Asking to install something before training is really painful. Hadoop Sandbox requires 8GB of RAM, downloading VM image takes forever and corporate laptops in most cases have restricted access in terms of installing things. The only medium of communication is a browser.

Solution

There are two nice projects that give possibility of writing via browser — Apache Zeppelin and Apache Jupiter — they give a simple way to write a notebook that can execute code behind browser (on the server side). Problem is that Apache Zeppelin for many users (hub) is paid but Jupiter has its Hub version which I’ve chosen to explore.

I’ve found very nice presentation that explains how JupiterHub works internally.

In few words — JupyterHub is a server that spawns Jupyter notebooks for each user and proxies communication between the browser and this notebook.

Hub can spawn a Jupiter process or it can spawn docker image. I’ve chosen the second option where I have better control over resources and in the future use docker swarm with smaller and cheaper machines.

Image preparation

As a base image, I’ve used Jupyterhub’s single user notebook. Above that, I’ve installed Spark 2.0 and impyla client for hive queries. Modified image is shared on github and docker registry.

stankiewicz/all-spark-notebook-with-hive-client

Contribute to all-spark-notebook-with-hive-client development by creating an account on GitHub.

github.com

Hub installation on centos

Install python3 with pip:

yum install python34
curl https://bootstrap.pypa.io/get-pip.py | python3.4

Install docker:

tee /etc/yum.repos.d/docker.repo <<-‘EOF’
[dockerrepo]
name=Docker Repository
baseurl=https://yum.dockerproject.org/repo/main/centos/7/
enabled=1
gpgcheck=1
gpgkey=https://yum.dockerproject.org/gpg
EOFyum install docker-engine
systemctl enable docker.service
systemctl start docker

Install Hub:

yum install nodejs
pip3 install jupyterhub
npm install -g configurable-http-proxy
pip3 install notebook
pip3 install DockerSpawner

Hub Configuration

Generate empty configuration:

jupyterhub –generate-config

Modify JupyterHub to include docker spawner:

c.JupyterHub.spawner_class = ‘dockerspawner.DockerSpawner’
c.DockerSpawner.container_image = ‘stankiewicz/all-spark-notebook-with-hive-client’
c.DockerSpawner.volumes = { ‘jupyterhub-user-{username}’: notebook_dir }

Set IP that will be accessible from docker machines. Default is localhost so it will not work here:

c.JupyterHub.hub_ip = ‘10.0.1.4’

Start JupyterHub pointing to new configuration:

jupyterhub -f jupyterhub_config.py

Testing

Create new account, open the website (http://ip:8000) and login with newly created credentials.

I’ve setup Hub on 8cpu, 16GB RAM machine and gave access to 8 users. It worked without any issues. From usability perspective — users easily swithed from current IDEs.

What can be done better

Currently, Spark is running in local mode, to schedule tasks in YARN I need to pass YARN_CONF_DIR variable to Spark process and have this directory visible inside docker container.
JupyterHub should be hidden behind SSL proxy.