follyroof follyroof - 4 months ago 8x
Python Question

Python Dependency Management on EMR

i'm sending code to amazon's EMR via the mrjob/boto modules. i've got some external python dependencies (ie. numpy, boto, etc) and currently have to download the source of the python packages, and send them over as a tarball in the "python_archives" field of the mrjob.config file.

this makes dependency management messier than i would like, and am wondering if i can somehow use the same requirements.txt file i use for my virtualenv setup to bootstrap the emr instance with my dependencies. is it possible to set up virtualenv's on EMR instances and do something like:

pip install -r requirements.txt

as i would locally?


One way to accomplish this is using a bootstrap action. You can use these to run shell scripts.

If you have a setup python file that does something like:

requirements = open("requirements.txt", "r")
shell_script = open("", "w+")
shell_script.write("sudo apt-get install python-pip\n")
for line in requirements:
    shell_script.write("sudo pip install -I " + line)

Then you can just run this as the bootstrap action without needing to upload your requirements.txt