Kratos Kratos - 1 year ago 152
Python Question

Merging lists in a single pySpark dataframe

I am going through the pySpark 1.6.2 documentation in order to merge my data into a single dataframe.

I have a list of 19 items (listname:sizes):

[9, 78, 13, 3, 57, 60, 66, 32, 24, 1, 2, 15, 2, 2, 76, 79, 100, 73, 4]

and a 2D list containing 19 not same length sub-lists (listname:data):


I am trying to create a dataframe that looks like this:

name size
0 [a,b,c] 9
1 [d,e,f,g,h,i,j] 78
2 ........ ...
. ........ ...
. ........ ...
18 [x,y,z,a,f] 4

But I can't figure out a way to do that.

I have already iterated through the list and I could append the two columns after each iteration.
But I am finding it hard to find a way to create a Dataframe and fill it step by step.

This is my code:

schema = StructType([StructField("name", StringType(), True), StructField("size", IntegerType(), True)])
dataframe = sqlContext.createDataFrame([],schema)

for i in range(len(data)):
t = sqlContext.DataFrame([[data[i], sizes[i]]],
columns=['name', 'size'])
dataframe = dataframe.append(t, ignore_index=True)

but it returns me this:


Answer Source

There is an easy way to do this using the zip() function. If you do:

t = zip(data, sizes)

You will have a list of tuples, one for each pair:

[(['a', 'b', 'c'], 9),
 (['d', 'e', 'f', 'g', 'h', 'i', 'j'], 78),
 (['x', 'y', 'z', 'a', 'f'], 4)]

Now you just have to create the DataFrame using the list of tuples:

dataframe = sqlContext.createDataFrame(t,schema)
Recommended from our users: Dynamic Network Monitoring from WhatsUp Gold from IPSwitch. Free Download