How can I ship C compiled modules (for example, python-Levenshtein) to each node in a spark cluster?
I know that I can ship python files in spark using a standalone python script (example code below):
from pyspark import SparkContext
sc = SparkContext("local", "App Name", pyFiles=['MyFile.py', 'MyOtherFile.py'])
If you can package your module into a
.zip file, you should be able to list it in
pyFiles when constructing your SparkContext (or you can add it later through sc.addPyFile).
For Python libraries that use setuptools, you can run
python setup.py bdist_egg to build an egg distribution.
Another option is to install the library cluster-wide, either by using pip/easy_install on each machine or by sharing a Python installation over a cluster-wide filesystem (like NFS).