i'm sending code to amazon's EMR via the mrjob/boto modules. i've got some external python dependencies (ie. numpy, boto, etc) and currently have to download the source of the python packages, and send them over as a tarball in the "python_archives" field of the mrjob.config file.
this makes dependency management messier than i would like, and am wondering if i can somehow use the same requirements.txt file i use for my virtualenv setup to bootstrap the emr instance with my dependencies. is it possible to set up virtualenv's on EMR instances and do something like:
pip install -r requirements.txt
One way to accomplish this is using a bootstrap action. You can use these to run shell scripts.
If you have a setup python file that does something like:
requirements = open("requirements.txt", "r") shell_script = open("pip.sh", "w+") shell_script.write("sudo apt-get install python-pip\n") for line in requirements: shell_script.write("sudo pip install -I " + line)
Then you can just run this as the bootstrap action without needing to upload your requirements.txt