Kratos Kratos - 1 month ago 8
Python Question

turning pandas to pyspark expression

I need to turn a two column Dataframe to a list grouped by one of the columns. I have done it successfully in pandas:

expertsDF = expertsDF.groupby('session', as_index=False).agg(lambda x: x.tolist())


But now I am trying to do the same thing in pySpark as follows:

expertsDF = df.groupBy('session').agg(lambda x: x.collect())


and I am getting the error:

all exprs should be Column


I have tried several commands but I simply cannot get it right. And the spark dokumentation does not contain something similar.

An example input for it would be a dataframe:

session name
1 a
1 b
2 v
2 c


output:
session name
1 [a, b....]
2 [v, c....]

Answer

You could use reduceByKey() to do this efficiently:

(df.rdd
 .map(lambda x: (x[0],[x[1]]))
 .reduceByKey(lambda x,y: x+y)
 .collect())
Out[21]: [(1, [u'a', u'b']), (2, [u'v', u'c'])]

Data:

df = sc.parallelize([(1, "a"),
                     (1, "b"),
                     (2, "v"),
                     (2, "c")]).toDF(["session", "name"])
Comments