Monday 23 February 2015

Apache Spark and IPython Notebook - simple demo

I've been playing around with Apache Spark for a while now, mainly using Scala.  A couple of my colleagues are interested in learning about Spark as well, but they're data scientists, not developers, and they are more comfortable using Python.  So far so good  - Spark has a nifty Python shell and a comprehensive Python API, after all.

But what my colleagues really like is nice interactive tools without too much command-line voodoo.  And they're already using IPython Notebook to give them all this goodness for their Python data science work.

Happily, it turns out you can use IPython Notebook with a local Apache Spark installation, so I wrote a quick demo notebook to illustrate how you can use Spark to do the traditional word count inside a notebook.

The repo is on BitBucket so you can do a git clone or download the code as a zip-file.  It's just a single notebook, a data folder containing a couple of text files, and a sample shell-script for starting the notebook on Linux.

So if you're new to Spark  with IPython Notebook, feel free to try it out.