Linghao Linghao - 4 months ago 22
Python Question

How can I use "for" loop in spark with pyspark

I met a problem while using spark with python3 in my project. In a Key-Value pair, like

('1','+1 2,3')
, the part
was the content I wanted to check. So I wrote the following code:

(Assume this key-Value pair was saved in a RDD called p_list)

def add_label(x):
for i in value:
return (i,label)

After doing like that, I could only get the result:
and it should be
. It seems like that the "for" loop in map operation just did once. How can I let it do multiple times? Or is there any other way I can use to implement such a function like "for" loop in map operation or reduce operation?

I want to mention that what I really deal with is a large dataset. So I have to use AWS cluster and implement the loop with parallelization. The slave nodes in the cluster seem not to understand the loop. How can I let them know that with Spark RDD function? Or how can have such a loop operation in another pipeline way (which is one of the main design of Spark RDD)?


Your return statement cannot be inside the loop; otherwise, it returns after the first iteration, never to make it to the second iteration.

What you could try is this

result = []
for i in value:
return result

and then result would be a list of all of the tuples created inside the loop.