I am currently trying to use custom
with Apache Spark's
My question is the following:
Does the value returned by
affect the execution of a
function in the Spark Runtime?
I am asking because I have used these two custom classes on Apache Hadoop and they work as intended. However, in Spark, I see that new
objects are generated during Runtime, which is something I do not want my code to do. To be more precise:
In the beginning of the execution, I see in my log files that the correct number of
objects are generated (let us say only 1 for this example). In turn, a
object associated to that split, is generated and starts fetching records. At some point, I get a message that the Job that is handling the previous
stops, and a new Job is spawned with a new
. I do not understand why this is happening? Does it have to do with the value returned by
Also, I define the InputSplit's length to be some arbitrary large number of bytes (i.e. 1GB). Does this value affect the number of Spark Jobs that are spawned during Runtime?
Any help and/or advice is welcome?
P.S.-1 : I apologize for posting so many questions, but Apache Spark is a new Tool with little documentation on the Hadoop-Spark integration through
P.S.-2: I can provide more technical details if they are needed.