I have an rdd with multiple values(list) against one key, I want to filter the garbage out from each value in a key.
rdd has this data
((key1, [('',val1),('', val2),..]),(key2,[...)
((key1,[val1, val2,...]), key2[...)
values = 
for a in x:
The main idea is to consider each entry of an RDD as a single collection an process it as so. Meaning, if we consider the following entry
entry = ("key1", [('',"val1"),('',"val2")])
to process this collection into the expected output, we need to understand the structure of the collection
entry # 'key1' entry # [('', 'val1'), ('', 'val2')]
now let's work on this second part :
map(lambda x : x,entry) # ['val1', 'val2']
We can now define a function that takes an entry as an input and the resulting output will be a (key,[values...]) tuple. We'll call it
mapper. We can apply the mapper on every entry in the rdd.
Putting the code together :
def mapper(entry): return (entry,map(lambda x : x,entry)) data = [("key1", [('',"val1"),('',"val2")]),("key2",[('',"val3"),('',"val2"),('',"val4")])] rdd = sc.parallelize(data) rdd2 = rdd.map(lambda x : mapper(x)) rdd2.collect() # [('key1', ['val1', 'val2']), ('key2', ['val3', 'val2', 'val4'])]