Linghao Linghao - 15 days ago 5
Python Question

How can I use "for" loop in spark with pyspark

I met a problem while using spark with python3 in my project. In a Key-Value pair, like

('1','+1 2,3')
, the part
"2,3"
was the content I wanted to check. So I wrote the following code:

(Assume this key-Value pair was saved in a RDD called p_list)




def add_label(x):
label=x[1].split()[0]
value=x[1].split()[1].split(",")
for i in value:
return (i,label)
p_list=p_list.map(add_label)





After doing like that, I could only get the result:
('2','+1')
and it should be
('2','+1')
and
('3','+1')
. It seems like that the "for" loop in map operation just did once. How can I let it do multiple times? Or is there any other way I can use to implement such a function like "for" loop in map operation or reduce operation?

I want to mention that what I really deal with is a large dataset. So I have to use AWS cluster and implement the loop with parallelization. The slave nodes in the cluster seem not to understand the loop. How can I let them know that with Spark RDD function? Or how can have such a loop operation in another pipeline way (which is one of the main design of Spark RDD)?

Answer

Your return statement cannot be inside the loop; otherwise, it returns after the first iteration, never to make it to the second iteration.

What you could try is this

result = []
for i in value:
    result.append((i,label))
return result

and then result would be a list of all of the tuples created inside the loop.

Comments