Tech Stuff I Want To Remember: Get Apache Spark running without hadoop

Saturday, 1 August 2015

Get Apache Spark running without hadoop

Summary: If you want to play around with spark without messing with a hadoop installation/cluster etc, download a spark package which is built with a version of hadoop - so pick packages like "Pre-built for Hadoop 2.6 and later", not the one which says "Pre-built with user provided Hadoop...".

Details
I downloaded Apache Spark from here, unzipped and tried to run it like so:
.\bin\spark-shell --master local
And here's what I got (edited exception):
Exception in thread "main" java.lang.NoClassDefFoundError: org/apache/hadoop/fs/FSDataInputStream
...
Caused by: java.lang.ClassNotFoundException: org.apache.hadoop.fs.FSDataInputStream at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
at java.lang.ClassLoader.loadClass(ClassLoader.java:423)
...

Googling gave me suggestions like installing hadoop - and I was pretty sure that I should be able to run spark without the whole hadoop infrastructure. Finally realised that I had downloaded "spark without hadoop" and that works only if you point it to a hadoop installation that you already have. The reason is that though spark doesn't need a hadoop cluster to work, it does need some hadoop libraries (which is what the ClassLoader is complaining about in the stack trace above).
So I downloaded a spark distribution with hadoop and voila!