jeffery.yuan jeffery.yuan - 4 months ago 35
Scala Question

Spark RDD Lifecycle: whether RDD will be reclaimed out of scope

In a method, I create a new RDD, and cache it, whether Spark will unpersist the RDD automatically after the rdd is out of scope?

I am thinking so, but what's actually happens?


No, it won't be unpersisted automatically.

Why ? Because maybe it looks to you that the RDD is not needed anymore, but spark model is to not materialize RDD until they are needed for a transformation, so it's actually very hard to tell "I won't need this RDD" anymore. Even for you, it can be very tricky, because of the following situation :

JavaRDD<T> rddUnion = sc.parallelize(new ArrayList<T>()); // create empty for merging
for (int i = 0; i < 10; i++)
  JavaRDD<T2> rdd = sc.textFile(inputFileNames[i]);
  rdd.cache(); // Since it will be used twice, cache.[i]); //  Transform and save, rdd materializes
  rddUnion = rddUnion.union(; // Do another transform to T and merge by union
  rdd.unpersist(); // Now it seems not needed. (But is needed actually)

 // Here, rddUnion actually materializes, and needs all 10 rdds that already unpersisted. So, rebuilding all 10 rdds will occur.

Credit for the code sample to the spark-user ml