W.P. McNeill W.P. McNeill - 29 days ago 13
Python Question

How do I install pyspark for use in standalone scripts?

I'm am trying to use Spark with Python. I installed the Spark 1.0.2 for Hadoop 2 binary distribution from the downloads page. I can run through the quickstart examples in Python interactive mode, but now I'd like to write a standalone Python script that uses Spark. The quick start documentation says to just import

pyspark
, but this doesn't work because it's not on my PYTHONPATH.

I can run
bin/pyspark
and see that the module is installed beneath
SPARK_DIR/python/pyspark
. I can manually add this to my PYTHONPATH environment variable, but I'd like to know the preferred automated method.

What is the best way to add
pyspark
support for standalone scripts? I don't see a
setup.py
anywhere under the Spark install directory. How would I create a pip package for a Python script that depended on Spark?

Answer

You can set the PYTHONPATH manually as you suggest, and this may be useful to you when testing stand-alone non-interactive scripts on a local installation.

However, (py)spark is all about distributing your jobs to nodes on clusters. Each cluster has a configuration defining a manager and many parameters; the details of setting this up are here, and include a simple local cluster (this may be useful for testing functionality).

In production, you will be submitting tasks to spark via spark-submit, which will distribute your code to the cluster nodes, and establish the context for them to run within on those nodes. You do, however, need to make sure that the python installations on the nodes have all the required dependencies (the recommended way) or that the dependencies are passed along with your code (I don't know how that works).