ethrbunny ethrbunny - 3 months ago 42
Java Question

Understanding Spark container failure

I have a Spark job that runs on AWS EMR. It often fails after a few steps and gives error messages like:

2016-08-18 23:29:50,167 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl (Container Monitor): Memory usage of ProcessTree 7031 for container-id container_: 48.6 GB of 52 GB physical memory used; 52.6 GB of 260 GB virtual memory used
2016-08-18 23:29:53,191 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl (Container Monitor): Memory usage of ProcessTree 7031 for container-id container_: 1.2 MB of 52 GB physical memory used; 110.4 MB of 260 GB virtual memory used
2016-08-18 23:29:56,208 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl (Container Monitor): Memory usage of ProcessTree 7031 for container-id container_: 1.2 MB of 52 GB physical memory used; 110.4 MB of 260 GB virtual memory used
2016-08-18 23:29:56,385 WARN org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor (ContainersLauncher #0): Exit code from container container_ is : 52
2016-08-18 23:29:56,387 WARN org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor (ContainersLauncher #0): Exception from container-launch with container ID: container_ and exit code: 52
org.apache.hadoop.util.Shell$ExitCodeException:
at org.apache.hadoop.util.Shell.runCommand(Shell.java:505)
at org.apache.hadoop.util.Shell.run(Shell.java:418)
at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:650)
at org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:200)
at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:300)
at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:81)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
2016-08-18 23:29:56,393 INFO org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor (ContainersLauncher #0):
2016-08-18 23:29:56,455 WARN org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch (ContainersLauncher #0): Container exited with a non-zero exit code 52


From what I can find it seems this is an OOM error. Looking earlier in the logs I can see this:

2016-08-18 23:19:00,462 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl (Container Monitor): Memory usage of ProcessTree 7031 for container-id container_: 53.6 GB of 52 GB physical memory used; 104.4 GB of 260 GB virtual memory used
2016-08-18 23:19:03,480 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl (Container Monitor): Memory usage of ProcessTree 7031 for container-id container_: 53.9 GB of 52 GB physical memory used; 104.4 GB of 260 GB virtual memory used
2016-08-18 23:19:06,498 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl (Container Monitor): Memory usage of ProcessTree 7031 for container-id container_: 53.9 GB of 52 GB physical memory used; 104.4 GB of 260 GB virtual memory used
2016-08-18 23:19:09,516 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl (Container Monitor): Memory usage of ProcessTree 7031 for container-id container_: 53.8 GB of 52 GB physical memory used; 104.4 GB of 260 GB virtual memory used


My questions:


  1. Is this an OOM?

  2. Why the 10 minute gap between the over-allocation (?) and the exit?

  3. Do I need more executors in my job or do I have an incorrect param somewhere?


Answer

The answer to your first question is almost definitely "yes". I'm guessing that if you look into the yarn nodemanager log you will see a number of "running beyond physical memory"-errors which is basically YARN-language for OOM.

As for question 2, an executor will of course die when encountering an OOM but normally your job will not be killed off immediately. YARN has to be resilient to executors suddenly dying, so it will simply try to re-execute the tasks from the failing executor again on other executors. Only when several executors die, will it shut down your job.

Finally, OOMs may happen for a number of reasons and the solution depends on the reason, so you have to do some digging :) Here are a couple of typical reasons you may want to look into:

1) If you are not already doing so, you probably need to increase spark.memoryOverhead. The default setting depends on the available memory, but I find that it is frequently too low, so increasing it often helps. However, if you find that you need to increase memoryOverhead to more than 1/3 of your available memory, you should probably look for other solutions.

2) You might be in a situation where your data is very skewed, in which case you may be able to solve your problem by repartitioning the data or re-thinking how your data is partitioned in the first place.

3) Your cluster may simply not be big enough for your needs or you may run of instance types that are not suitable for your job. Changing to instance types with more memory may solve your problem.

Generally, I would recommend that you look at how your cluster is being utilized in Ganglia. If Ganglia only shows a few workers doing anything, you are most likely in scenario 2. If all workers are being utilized and you simply use up all available memory, then scenario 3 should be considered.

I hope this helps.