mt88 - 1 year ago 230

Scala Question

I just used Standard Scaler to normalize my features for a ML application. After selecting the scaled features, I want to convert this back to a dataframe of Doubles, though the length of my vectors are arbitrary. I know how to do it for a specific 3 features by using

`myDF.map{case Row(v: Vector) => (v(0), v(1), v(2))}.toDF("f1", "f2", "f3")`

but not for an arbitrary amount of features. Is there an easy way to do this?

Example:

`val testDF = sc.parallelize(List(Vectors.dense(5D, 6D, 7D), Vectors.dense(8D, 9D, 10D), Vectors.dense(11D, 12D, 13D))).map(Tuple1(_)).toDF("scaledFeatures")`

val myColumnNames = List("f1", "f2", "f3")

// val finalDF = DataFrame[f1: Double, f2: Double, f3: Double]

EDIT

I found out how to unpack to column names when creating the dataframe, but still am having trouble converting a vector to a sequence needed to create the dataframe:

`finalDF = testDF.map{case Row(v: Vector) => v.toArray.toSeq /* <= this errors */}.toDF(List("f1", "f2", "f3"): _*)`

Recommended for you: Get network issues from **WhatsUp Gold**. **Not end users.**

Answer Source

One possible approach is something similar to this

```
import org.apache.spark.sql.functions.udf
import org.apache.spark.mllib.linalg.Vector
// Get size of the vector
val n = testDF.first.getAs[org.apache.spark.mllib.linalg.Vector](0).size
// Simple helper to convert vector to array<double>
val vecToSeq = udf((v: Vector) => v.toArray)
// Prepare a list of columns to create
val exprs = (0 until n).map(i => $"_tmp".getItem(i).alias(s"f$i"))
testDF.select(vecToSeq($"scaledFeatures").alias("_tmp")).select(exprs:_*)
```

If you know a list of columns upfront you can simplify this a little:

```
val cols: Seq[String] = ???
val exprs = cols.zipWithIndex.map{ case (c, i) => $"_tmp".getItem(i).alias(c) }
```