Setting up PySpark for Jupyter Notebook – with Docker

When you google “How to run PySpark on Jupyter”, you get so many tutorials that showcase so many different ways to configure iPython notebook to support PySpark, that it’s a little bit confusing. So when I finally figured out a way to do it, with the help of multiple websites, I thought I will post it as a blog here to help my fellow data geeks!

Prerequisites: You should have Jupyter Notebook and PySpark locally installed on your machine.

Step 1: Download Docker (for your Mac or Windows system). You can use the following link to download Docker for Mac:

https://docs.docker.com/docker-for-mac/install/

Now, you may have questions like what’s a Docker Container, what’s the difference between a Docker container and a Virtual Machine etc. Just to give an introduction, Docker container is a form of virtualization. While Virtual Machines (VMs) virtualize hardware, by virtually splitting a piece of hardware into different VMs, Docker containers are virtualization of the Operating System, splitting the OS into virtualized compartments to run multiple containerized applications.

For learning more about Docker containers and how they differ from Virtual Machines, you can start by looking at the following links and if needed, continue your research further.

https://www.docker.com/what-container

https://www.sdxcentral.com/cloud/containers/definitions/what-is-docker-container-open-source-project/

Step 2: Open your command prompt, and type the following command; this command starts a container with the Notebook server listening for HTTP connections on port 8888 with a randomly generated authentication token configured.

docker run -it –rm -p 8888:8888 jupyter/pyspark-notebook

Few pointers to what the above command means:

“Docker run”: Runs a command in a new conatainer

–rm : Automatically remove the container when it exits (default value is FALSE)

“8888:8888” : <host port>:<container port>

Step 3: After you run the above command, You will see that a URL is displayed in the notebook startup log messages. Copy the URL and paste it in a new tab, a new notebook will be displayed.

Step 4: Open a Python 2 or Python 3 notebook

Step 5: Create a SparkContext configured for local mode by typing the following:

import pyspark

sc = pyspark.SparkContext(‘local[*]’)

Step 6: Do something to prove that the spark context works. For example:

rdd = sc.parallelize(range(1000))

rdd.takeSample(False, 5)

To release the Docker container port after completing your project with PySpark :

Step 1: Download your notebook in whichever format you need.

Step 2: Get the container ID by typing the following command in your command prompt:

docker ps –a

Step 3: Use the container ID that you see after executing the command in Step 2, in the following command in the place of <container_ID from step 2>, to release the container port:

docker kill <container_ID from step 2>

Hope this helps !!!

Sources:

http://ramhiser.com/2015/02/01/configuring-ipython-notebook-support-for-pyspark/

https://github.com/jupyter/docker-stacks/tree/master/pyspark-notebook

http://stackoverflow.com/questions/24810458/docker-not-releasing-ports

http://stackoverflow.com/questions/24993704/docker-error-cannot-start-container-port-has-already-been-allocated

https://docs.docker.com/engine/reference/commandline/run/#options