Brian B Brian B -3 years ago 183
Python Question

RDD of pyspark Row lists to DataFrame

I have an RDD whose partitions contain elements (pandas dataframes, as it happens) that can easily be turned into lists of rows. Think of it as looking something like this

rows_list = []
for word in 'quick brown fox'.split():
rows = []
for i,c in enumerate(word):
x = ord(c) + i
row = pyspark.sql.Row(letter=c, number=i, importance=x)
rows.append(row)
rows_list.append(rows)
rdd = sc.parallelize(rows_list)
rdd.take(2)


which gives

[[Row(importance=113, letter='q', number=0),
Row(importance=118, letter='u', number=1),
Row(importance=107, letter='i', number=2),
Row(importance=102, letter='c', number=3),
Row(importance=111, letter='k', number=4)],
[Row(importance=98, letter='b', number=0),
Row(importance=115, letter='r', number=1),
Row(importance=113, letter='o', number=2),
Row(importance=122, letter='w', number=3),
Row(importance=114, letter='n', number=4)]]


I want to turn it into a Spark DataFrame. I hoped that I could just do

rdd.toDF()


but that gives a useless structure

DataFrame[_1: struct<importance:bigint,letter:string,number:bigint>,
_2: struct<importance:bigint,letter:string,number:bigint>,
_3: struct<importance:bigint,letter:string,number:bigint>,
_4: struct<importance:bigint,letter:string,number:bigint>,
_5: struct<importance:bigint,letter:string,number:bigint>]


What I really want is a 3 column DataFrame, such as this

desired_df = sql_context.createDataFrame(sum(rows_list, []))


so that I can perform operations like

desired_df.agg(pyspark.sql.functions.sum('number')).take(1)


and get the answer 23.

What is the right way to go about this?

Answer Source

You have a RDD of lists of rows while you need RDD of rows; you can flatten rdd with flatMap and then convert it to data frame:

rdd.flatMap(lambda x: x).toDF().show()

+----------+------+------+
|importance|letter|number|
+----------+------+------+
|       113|     q|     0|
|       118|     u|     1|
|       107|     i|     2|
|       102|     c|     3|
|       111|     k|     4|
|        98|     b|     0|
|       115|     r|     1|
|       113|     o|     2|
|       122|     w|     3|
|       114|     n|     4|
|       102|     f|     0|
|       112|     o|     1|
|       122|     x|     2|
+----------+------+------+

import pyspark.sql.functions as F

rdd.flatMap(lambda x: x).toDF().agg(F.sum('number')).show()
+-----------+
|sum(number)|
+-----------+
|         23|
+-----------+
Recommended from our users: Dynamic Network Monitoring from WhatsUp Gold from IPSwitch. Free Download