hosur narahari hosur narahari - 2 months ago 17
Java Question

Garbage Collection of RDDs

I have a fundamental question in spark. Spark maintains lineage of RDDs to recalculate in case few RDDs get corrupted. So JVM cannot find it as orphan objects. Then how and when the garbage collection of RDDs happen?

Answer

The memory for RDD storage can be configured using

"spark.storage.memoryFracion" property.

If this limit exceeded, older partitions will be dropped from memory.

We can set it as a value between 0 and 1, describing what portion of executor JVM memory will be dedicated for caching RDDs. By default value is 0.66

Suppose if we have 2 GB memory, then we will get 0.4 * 2g memory for your heap and 0.66 * 2g for RDD storage by default.

We can configure Spark properties to print more details about GC is behaving:

Set spark.executor.extraJavaOptions to include

“-verbose:gc -XX:-PrintGCDetails -XX:+PrintGCTimeStamps”

In case your tasks slow down and you find that your JVM is garbage-collecting frequently or running out of memory, lowering "spark.storage.memoryFracion" value will help reduce the memory consumption.

For more details,have a look at below reference:

http://spark.apache.org/docs/1.2.1/tuning.html#garbage-collection-tuning