Yahia Zakaria Yahia Zakaria - 1 year ago 148
Scala Question

Apply QuantileDiscretizer to all columns in a DataFrame

Assume that I have a dataframe with id and 100 columns. I want to apply QuantileDiscretizer on each column and return and a new dataframe with the id column tied with new columns with the discretized values.
Example for two columns only:

Input


id | col1 | col2
----|------|------
0 | 18.0 | 20.0
----|------|------
1 | 19.0 | 30.0
----|------|------
2 | 8.0 | 35.0
----|------|------
3 | 5.0 | 10.0
----|------|------
4 | 2.2 | 5.0


Output


id | col1Disc | col2Disc
----|----------|------
0 | 2 | 2
----|----------| ------
1 | 2 | 3
----|----------|------
2 | 1 | 3
----|----------|------
3 | 2 | 1
----|----------|------
4 | 0 | 0

Answer Source

You can use Pipeline API:

import org.apache.spark.ml.Pipeline

val df = Seq(
  (0, 18.0, 20.0), (1, 19.0, 30.0), (2, 8.0, 35.0), (3, 5.0, 10.0), (4, 2.2, 5.0)
).toDF("id", "col1", "col2")


val pipeline = new Pipeline().setStages(for {
  c <- df.columns
  if c != "id"
} yield new QuantileDiscretizer().setInputCol(c).setOutputCol(s"${c}Disc"))

val result = pipeline.fit(df).transform(df)
result.drop(df.columns.diff(Seq("id")): _*).show


+---+--------+--------+
| id|col1Disc|col2Disc|
+---+--------+--------+
|  0|     1.0|     1.0|
|  1|     1.0|     1.0|
|  2|     1.0|     1.0|
|  3|     0.0|     0.0|
|  4|     0.0|     0.0|
+---+--------+--------+
Recommended from our users: Dynamic Network Monitoring from WhatsUp Gold from IPSwitch. Free Download