monster monster - 3 months ago 9
Scala Question

Spark RDD's - how do they work

I have a small Scala program that runs fine on a single-node. However, I am scaling it out so it runs on multiple nodes. This is my first such attempt. I am just trying to understand how the RDD's work in Spark so this question is based around theory and may not be 100% correct.

Let's say I create an RDD:

val rdd = sc.textFile(file)


Now once I've done that, does that mean that the file at
file
is now partitioned across the nodes (assuming all nodes have access to the file path)?

Secondly, I want to count the number of objects in the RDD (simple enough), however, I need to use that number in a calculation which needs to be applied to objects in the RDD - a pseudocode example:

rdd.map(x => x / rdd.size)


Let's say there are 100 objects in
rdd
, and say there are 10 nodes, thus a count of 10 objects per node (assuming this is how the RDD concept works), now when I call the method is each node going to perform the calculation with
rdd.size
as
10
or
100
? Because, overall, the RDD is size
100
but locally on each node it is only
10
. Am I required to make a broadcast variable prior to doing the calculation? This question is linked to the question below.

Finally, if I make a transformation to the RDD, e.g.
rdd.map(_.split("-"))
, and then I wanted the new
size
of the RDD, do I need to perform an action on the RDD, such as
count()
, so all the information is sent back to the driver node?

Answer

Usually, the file (or parts of the file, if it's too big) is replicated to N nodes in the cluster (by default N=3). It's not an intention to split every file between all available nodes.

However, for you (i.e. the client) working with file using Spark should be transparent - you should not see any difference in rdd.size, no matter on how many nodes it's split and/or replicated. There are methods (at least, in Hadoop) to find out on which nodes (parts of the) file can be located at the moment. However, in simple cases you most probably won't need to use this functionality.

UPDATE: an article describing RDD internals: http://www.cs.berkeley.edu/~matei/papers/2012/nsdi_spark.pdf

Comments