user1234579 user1234579 - 1 month ago 24
Python Question

pyspark | transforming list of numpy arrays into columns in dataframe

I am trying to take an rdd that looks like:

[<1x24000 sparse matrix of type ''
with 10 stored elements in Compressed Sparse Row format>, . . . ]

and ideally turn it into a dataframe that looks like:

<code>
+-----------------+
| A | B | C |
+-----------------+
| 1.0 | 0.0 | 0.0 |
+-----+-----+-----+
| 1.0 | 1.0 | 0.0 |
+-----+-----+-----+
</code>


However, I keep getting this:

<code>
+---------------+
| _1|
+---------------+
|[1.0, 0.0, 0.0]|
+---------------+
|[1.0, 1.0, 0.0]|
+---------------+
</code>


I am having the darnedest time because each row is filled with numpy arrays.

I used this code to create the dataframe from the rdd:

<code>res.flatMap(lambda x: np.array(x.todense())).map(list).map(lambda l : Row([float(x) for x in l])).toDF()</code>


**Explode does not help (it puts everything into the same column)

** I tried using a UDF on the resulting dataframe but I cannot seem to separate the numpy array into individual values.

Please help!

Answer

Try:

.map(lambda l : Row(*[float(x) for x in l]))