Kratos Kratos - 6 months ago 35
Python Question

turning pandas to pyspark expression

I need to turn a two column Dataframe to a list grouped by one of the columns. I have done it successfully in pandas:

expertsDF = expertsDF.groupby('session', as_index=False).agg(lambda x: x.tolist())

But now I am trying to do the same thing in pySpark as follows:

expertsDF = df.groupBy('session').agg(lambda x: x.collect())

and I am getting the error:

all exprs should be Column

I have tried several commands but I simply cannot get it right. And the spark dokumentation does not contain something similar.

An example input for it would be a dataframe:

session name
1 a
1 b
2 v
2 c

session name
1 [a, b....]
2 [v, c....]


You could use reduceByKey() to do this efficiently:

 .map(lambda x: (x[0],[x[1]]))
 .reduceByKey(lambda x,y: x+y)
Out[21]: [(1, [u'a', u'b']), (2, [u'v', u'c'])]


df = sc.parallelize([(1, "a"),
                     (1, "b"),
                     (2, "v"),
                     (2, "c")]).toDF(["session", "name"])