Boern Boern - 28 days ago 12
Scala Question

tidyr::spread() with plain Scala and Spark (rows to columns)

my prototype (written in R with the packages dplyr and tidyr) is hitting a wall in terms of computational complexity - even on my powerfull working station. Therefore, I want to port the code to Spark using Scala.

I looked up all transformations, actions, functions (SparkSQL) and column operations (also SparkSQL) and found all function equivalents except the one for the

tidyr::spread()
function, available in R.

df %>% tidyr::spread(key = COL_KEY , value = COL_VAL)
basically spreads a key-value pair across multiple columns. E.g. the table

COL_KEY | COL_VAL
-----------------
A | 1
B | 1
A | 2


will be transformed to by

A | B
------------
1 | 0
0 | 1
2 | 1


In case there is no "out-of-the-box"-solution available: Could you point me in the right direction? Maybe a user defined function?

I'm free which Spark (and Scala) version to choose (therefore I'd go for the latest,
2.0.0
).

Thanks!

Answer

Out-of-the-box but requires a shuffle:

df
  // A dummy unique key to perform grouping
  .withColumn("_id", monotonically_increasing_id)
  .groupBy("_id")
  .pivot("COL_KEY")
  .agg(first("COL_VAL"))
  .drop("_id")

// +----+----+
// |   A|   B|
// +----+----+
// |   1|null|
// |null|   1|
// |   2|null|
// +----+----+

You can optionally follow it with .na.fill(0).

Manually without shuffle:

//  Find distinct keys
val keys = df.select($"COL_KEY").as[String].distinct.collect.sorted

// Create column expressions for each key
val exprs =  keys.map(key => 
  when($"COL_KEY" === key, $"COL_VAL").otherwise(lit(0)).alias(key)
)

df.select(exprs: _*)

// +---+---+
// |  A|  B|
// +---+---+
// |  1|  0|
// |  0|  1|
// |  2|  0|
// +---+---+
Comments