igx igx - 3 months ago 14
Scala Question

Why extracting an argument in spark to local variable is considered safer?

I saw this example in this book “Learning Spark: Lightning-Fast Big Data Analysis”:

class SearchFunctions(val query: String) {
// more methods here
def getMatchesNoReference(rdd: RDD[String]): RDD[String] = {
// Safe: extract just the field we need into a local variable
val query_ = this.query
rdd.map(x => x.split(query_))
}
}


My question is - the comment says :
Safe: extract just the field we need into a local variable

Why extracting to local variable is safer than using the field (defined as a
val
) itself?

Answer

Passing Functions in Spark is really helpful and has the answer to your question.

The idea is that you want only the query to be communicated to the workers that need it, and not the whole object (of the class).

If you didn't do it that way (if you were using the field in your map(), instead of the local variable), then:

...sending the object that contains that class along with the method. In a similar way, accessing fields of the outer object will reference the whole object


Note, this is also safer, not just more efficient, because it minimizes the memory usage.

You see, when handling really big data, your job will be facing its memory limitations, and if it exceeds them, it will be killed by the resource manager (for example YARN), so we want to make sure we use as less memory as possible, to make sure our job will make it and not fail!

Moreover, a big object will result in larger communication overhead. The TCP connection may be reset by peer, when the communication size is too big, which will invoke unnecessary overhead, which we want to avoid, because bad communication is also a reason for a job to fail.