I have a recurring problem with my hadoop cluster in that occasionally a functioning code stops seeing python modules that are in the proper location. I'm looking for tips from someone who might have faced the same problem.
When I first started programming and a code stopped working I asked a question here on SO and someone told me to just go to bed and in the morning it should work, or some other "you're a dummy, you must have changed something" kind of comment.
I run the code several times, it works, I go to sleep, in the morning I try to run it again and it fails. Sometimes I kill jobs with CTRL+C, and sometimes I use CTRL+Z. But this just takes up resources and doesn't cause any other issue besides that - the code still runs. I have yet to have see this problem right after the code works. This usually happens the morning after, when I come into work after the code worked when I left 10 hours ago. Restarting the cluster typically solves the issue
I'm currently checking to see if the cluster restarts itself for some reason, or if some part of it is failing, but so far the ambari screens show everything green. I'm not sure if there is some automated maintenance or something that is known to screw things up.
Still working my way through the elephant book, sorry if this topic is clearly addressed on page XXXX, I just haven't made it to that page yet.
I looked through all the error logs, but the only meaningful thing I see is in stderr:
File "/data5/hadoop/yarn/local/usercache/melvyn/appcache/application_1470668235545_0029/container_e80_1470668235545_0029_01_000002/format_text.py", line 3, in <module>
from formatting_functions import *
ImportError: No module named formatting_functions
So we solved the problem. The issue is particular to our set up. We have all of our datanodes nfs mounted. Occasionally a node fails, and someone has to bring it back up and remount it.
Our script specifies the path to libraries like:'
pig -Dmapred.child.env="PYTHONPATH=$path_to_mnt$hdfs_library_path" ...
so pig couldn't find the libraries, because $path_to_mnt was invalid for one of the nodes.