I have a small Scala program that runs fine on a single-node. However, I am scaling it out so it runs on multiple nodes. This is my first such attempt. I am just trying to understand how the RDD's work in Spark so this question is based around theory and may not be 100% correct.
Let's say I create an RDD:
val rdd = sc.textFile(file)
rdd.map(x => x / rdd.size)
Usually, the file (or parts of the file, if it's too big) is replicated to N nodes in the cluster (by default N=3). It's not an intention to split every file between all available nodes.
However, for you (i.e. the client) working with file using Spark should be transparent - you should not see any difference in
rdd.size, no matter on how many nodes it's split and/or replicated. There are methods (at least, in Hadoop) to find out on which nodes (parts of the) file can be located at the moment. However, in simple cases you most probably won't need to use this functionality.
UPDATE: an article describing RDD internals: http://www.cs.berkeley.edu/~matei/papers/2012/nsdi_spark.pdf