Anna Anna - 3 years ago 111
Scala Question

How to get columns from dataframe into a list in spark

I have a

that has like 80 columns, and I need to get 12 of them into a collection, either
is fine. I did google a bit and found this:"YOUR_COLUMN_NAME") => r(0)).collect()

The problem is, this works for one column. If I do,col2,col3...)
, then it's giving me something like this:

What I want is
. Is there any way to do this in Spark?

Thanks in advance.


For example I have a dataframe:

1 2 3
4 5 6

I need to get the columns into this format:


Hope this is more clear...Sorry for the confusion

Answer Source

you can get Array[Array[Any]] by doing the following

scala>"col1", "col2", "col3", "col4") => (Array(row(0)), Array(row(1)), Array(row(2)), Array(row(3))))
res6: org.apache.spark.rdd.RDD[(Array[Any], Array[Any], Array[Any], Array[Any])] = MapPartitionsRDD[34] at map at <console>:32

RDD is like an Array so your required array is above. If you want RDD[Array[Array[Any]]] then you can do

scala>"col1", "col2", "col3", "col4") => Array(Array(row(0)), Array(row(1)), Array(row(2)), Array(row(3))))
res7: org.apache.spark.rdd.RDD[Array[Array[Any]]] = MapPartitionsRDD[39] at map at <console>:32

You can proceed the same way for your twelve columns


Your updated question is more clear. So you can use collect_list function before you convert into an rdd and carry on as before.

scala> import org.apache.spark.sql.functions._
import org.apache.spark.sql.functions._

scala> val rdd ="col1"), collect_list("col2"), collect_list("col3"), collect_list("col4")) => Array(row(0), row(1), row(2), row(3)))
rdd: org.apache.spark.rdd.RDD[Array[Any]] = MapPartitionsRDD[41] at map at <console>:36

scala> => => println(element))).collect
[Stage 11:>                                                         (0 + 0) / 2]WrappedArray(1, 1)
WrappedArray(2, 2)
WrappedArray(3, 3)
WrappedArray(4, 4)
res8: Array[Array[Unit]] = Array(Array((), (), (), ())) 

Dataframe only

You can do all of these in a dataframe itself and do not need to convert to rdd

given that you have dataframe as

|1   |2   |3   |4   |5   |6   |
|1   |2   |3   |4   |5   |6   |

You can simply do the following

scala> import org.apache.spark.sql.functions._
import org.apache.spark.sql.functions._

scala>"col1"), collect_list("col2"), collect_list("col3"), collect_list("col4")).as("collectedArray")).show(false)
|collectedArray                                                                  |
|[WrappedArray(1, 1), WrappedArray(2, 2), WrappedArray(3, 3), WrappedArray(4, 4)]|
Recommended from our users: Dynamic Network Monitoring from WhatsUp Gold from IPSwitch. Free Download