superbatfish superbatfish - 3 months ago 13
Python Question

Spark: Extract zipped keys without computing values

Spark has a

function to combine two RDDs. It also has functions to split them apart again:
keys()
and
values()
. But to my surprise, if you ask for just the
keys()
, both RDDs are fully computed, even if the values weren't necessary for the computation.

In this example, I create an RDD of
(key, value)
pairs, but then I only ask for the keys. Why are the values computed anyway? Does Spark make no attempt to simplify it's internal DAG in such cases?

In [1]: def process_value(val):
...: print "Processing {}".format(val)
...: return 2*val
...:

In [2]: k = sc.parallelize(['a','b','c'])

In [3]: v = sc.parallelize([1,2,3]).map(process_value)

In [4]: zipped = k.zip(v)

In [5]: zipped.keys().collect()
Processing 1
Processing 2
Processing 3

Out[5]: ['a', 'b', 'c']

Answer

If you look at the source (at least at 2.0) keys() is simply implemented as

rdd.map(_._1)

I.e. returning the first attribute of the tuple, so the tuple must be fully instantiated.

This might have worked if zip returned RDD[Seq[K, V]] or some other lazy data structure, but a tuple is not a lazy data structure.

In short: no.