Rajarshi Bhadra Rajarshi Bhadra - 10 months ago 62
Scala Question

Column Manipulations in Spark Scala

I am learning to work with Apache Spark(Scala) and still figuring out how things work out here

I am trying to acheive a simple task of
1. Finding Max of column
2. Subtract each value of the column from this max and create a new column

The code I am using is

import org.apache.spark.sql.functions._
val training = sqlContext.createDataFrame(Seq(

val training_max = training.withColumn("Val_Max",training.groupBy().agg(max("Values"))
val training_max_sub = training_max.withColumn("Subs",training_max.groupBy().agg(col("Val_Max")-col("Values) ))

However I am getting a lot of errors. I am more or less fluent in R and had I been doing the same task my code would have been

new_data <- training %>%
mutate(Subs= max(Values) - Values)

Any help on this will be greatly appreciated

Answer Source

Here is a solution using window functions. You'll need a HiveContext to use them

import org.apache.spark.sql.hive.HiveContext
import org.apache.spark.sql.functions._
import org.apache.spark.sql.expressions.Window

val sqlContext = new HiveContext(sc)
import sqlContext.implicits._

val training = sc.parallelize(Seq(10,13,14,21)).toDF("values")

   max($"values").over(Window.partitionBy()) - $"values").show

Which produces the expected output :

|    10|  11|
|    13|   8|
|    14|   7|
|    21|   0|