Abel Morelos Abel Morelos - 18 days ago 11
Java Question

MapReduce job fails with ExitCodeException exitCode=255

I am trying to run a MapReduce job that requires a shared library (a .so file). If I use the shared library from an standalone Java program I don't have problems at all (this program used java.library.path to find the library), but if I try to use the same native methods from the MapReduce program then I obtain the exception I pasted below (for the MapReduce program I am using the Distributed cache).

I know the native library is actually being loaded and the native code (C++) is called from MapReduce since the native function prints something to the standard output but after the native function returns I see a "Signal caught, exiting" message and then the application logs only provide the information below (I think the 255 is a -1 in this case) but that's it, I don't know where else to look for information to debug this issue or to figure out out why there is an uncaught signal. Any pointers about where to look for debugging/log info are appreciated.


Exception from container-launch: ExitCodeException exitCode=255:
ExitCodeException exitCode=255: at
org.apache.hadoop.util.Shell.runCommand(Shell.java:538) at
org.apache.hadoop.util.Shell.run(Shell.java:455) at
org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:702)
at
org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:195)
at
org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:300)
at
org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:81)
at java.util.concurrent.FutureTask.run(FutureTask.java:262) at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)


Container exited with a non-zero exit code 255

Answer

I already solved my own problem, but I am not sure about how to explain this. In C++ I have a method that returns to Java a pointer as a jlong (I was casting the pointer to jlong) and everything worked OK when the library was called from a standalone application, but the same code failed when called from Yarn.

I wrote a custom signal handler and that helped me know that it was a segment fault error, so I was doing something wrong with the pointers or the memory. When trying to call the same library from MapReduce (Yarn) is when I was having the previous exception. I changed the cast for the pointer to long instead of jlong and that is what I returned to Java. Later, when the handle to the native object was invoked the problem was not happening again. I am not sure why this helped, but something is going on with how Yarn manipulates the environment.

Best regards.

Comments