monster monster - 2 months ago 4
Scala Question

Spark RDD's - how do they work

I have a small Scala program that runs fine on a single-node. However, I am scaling it out so it runs on multiple nodes. This is my first such attempt. I am just trying to understand how the RDD's work in Spark so this question is based around theory and may not be 100% correct.

Let's say I create an RDD:

val rdd = sc.textFile(file)

Now once I've done that, does that mean that the file at
is now partitioned across the nodes (assuming all nodes have access to the file path)?

Secondly, I want to count the number of objects in the RDD (simple enough), however, I need to use that number in a calculation which needs to be applied to objects in the RDD - a pseudocode example: => x / rdd.size)

Let's say there are 100 objects in
, and say there are 10 nodes, thus a count of 10 objects per node (assuming this is how the RDD concept works), now when I call the method is each node going to perform the calculation with
? Because, overall, the RDD is size
but locally on each node it is only
. Am I required to make a broadcast variable prior to doing the calculation? This question is linked to the question below.

Finally, if I make a transformation to the RDD, e.g."-"))
, and then I wanted the new
of the RDD, do I need to perform an action on the RDD, such as
, so all the information is sent back to the driver node?


Usually, the file (or parts of the file, if it's too big) is replicated to N nodes in the cluster (by default N=3). It's not an intention to split every file between all available nodes.

However, for you (i.e. the client) working with file using Spark should be transparent - you should not see any difference in rdd.size, no matter on how many nodes it's split and/or replicated. There are methods (at least, in Hadoop) to find out on which nodes (parts of the) file can be located at the moment. However, in simple cases you most probably won't need to use this functionality.

UPDATE: an article describing RDD internals: