nick.katsip nick.katsip - 1 month ago 9
Java Question

Spark integrated with Hadoop InputFormat confusion

I am currently trying to use custom

InputSplit
and
RecordReader
with Apache Spark's
SparkContext
hadoopRDD()
function.

My question is the following:

Does the value returned by
InpuSplit.getLenght()
and/or
RecordReader.getProgress()
affect the execution of a
map()
function in the Spark Runtime?

I am asking because I have used these two custom classes on Apache Hadoop and they work as intended. However, in Spark, I see that new
InputSplit
objects are generated during Runtime, which is something I do not want my code to do. To be more precise:

In the beginning of the execution, I see in my log files that the correct number of
InputSplit
objects are generated (let us say only 1 for this example). In turn, a
RecordReader
object associated to that split, is generated and starts fetching records. At some point, I get a message that the Job that is handling the previous
InputSplit
stops, and a new Job is spawned with a new
InputSplit
. I do not understand why this is happening? Does it have to do with the value returned by
RecordReader.getProgress()
method or
InputSplit.getLength()
method?

Also, I define the InputSplit's length to be some arbitrary large number of bytes (i.e. 1GB). Does this value affect the number of Spark Jobs that are spawned during Runtime?

Any help and/or advice is welcome?

Thank you,
Nick

P.S.-1 : I apologize for posting so many questions, but Apache Spark is a new Tool with little documentation on the Hadoop-Spark integration through
HadoopRDD
s.

P.S.-2: I can provide more technical details if they are needed.

Answer

yes if return any value from getLength() then after reading those no. of bytes from your file, hadoop will generate a new split to read further data. If you don't want this behavior, override method InputFormat.getSplits() to return false. i.e.. you dont want it splits.

getProgress() method nothing to do with generating a new split.

Comments