Denise Mauldin Denise Mauldin - 2 months ago 13
Python Question

pyspark redueByKey modify single results

I have a dataset that looks like this in pyspark:

samp = sc.parallelize([(1,'TAGA'), (1, 'TGGA'), (1, 'ATGA'), (1, 'GTGT'), (2, 'GTAT'), (2, 'ATGT'), (3, 'TAAT'), (4, 'TAGC')])


I have a function that I'm using to combine the strings:

def combine_strings(x,y):
if (isinstance(x,list) and isinstance(y, list)):
z = x + y
return z
if (isinstance(x, list) and isinstance(y, str)):
x.append(y)
return x
if (isinstance(x, str) and isinstance(y, list)):
y.append(x)
return y
return [x,y]


The result I get is:

samp.reduceByKey(lambda x,y : combine_strings(x,y)).collect()
[(1, ['TAGA', 'TGGA', 'ATGA', 'GTGT']), (2, ['GTAT', 'ATGT']), (3, 'TAAT'), (4, 'TAGC')]


What I want is:

[(1, ['TAGA', 'TGGA', 'ATGA', 'GTGT']), (2, ['GTAT', 'ATGT']), (3, ['TAAT']), (4, ['TAGC'])]

Where everything is an array. I can't tell if pyspark is calling combine_strings on a result where there's 1 entry or if I can tell reduceByKey to do something with singleton results? How do I modify the reduceByKey() or the combine_strings function to produce what I'd like?

Answer

You could first map the values into lists and then only combine those lists:

samp.mapValues(lambda x : [x]).reduceByKey(lambda x,y : x + y).collect()