superbatfish superbatfish - 1 month ago 7x
Python Question

Spark: Extract zipped keys without computing values

Spark has a

function to combine two RDDs. It also has functions to split them apart again:
. But to my surprise, if you ask for just the
, both RDDs are fully computed, even if the values weren't necessary for the computation.

In this example, I create an RDD of
(key, value)
pairs, but then I only ask for the keys. Why are the values computed anyway? Does Spark make no attempt to simplify it's internal DAG in such cases?

In [1]: def process_value(val):
...: print "Processing {}".format(val)
...: return 2*val

In [2]: k = sc.parallelize(['a','b','c'])

In [3]: v = sc.parallelize([1,2,3]).map(process_value)

In [4]: zipped =

In [5]: zipped.keys().collect()
Processing 1
Processing 2
Processing 3

Out[5]: ['a', 'b', 'c']


If you look at the source (at least at 2.0) keys() is simply implemented as

I.e. returning the first attribute of the tuple, so the tuple must be fully instantiated.

This might have worked if zip returned RDD[Seq[K, V]] or some other lazy data structure, but a tuple is not a lazy data structure.

In short: no.