Jds Jds - 2 months ago 86
Scala Question

Converting multiple different columns to Map column with Spark Dataframe scala

I have a data frame with column:

user, address1, address2, address3, phone1, phone2
and so on.
I want to convert this data frame to -
user, address, phone where address = Map("address1" -> address1.value, "address2" -> address2.value, "address3" -> address3.value)


I was able to convert the columns to map using:

val mapData = List("address1", "address2", "address3")
df.map(_.getValuesMap[Any](mapData))


but I am not sure how to add this to my df.

I am new to spark and scala and could really use some help here.

Answer

As far as I know there is no direct way to do it. You can use an UDF like this:

import org.apache.spark.sql.functions.{udf, array, lit, col}

val df = sc.parallelize(Seq(
  (1L, "addr1", "addr2", "addr3")
)).toDF("user", "address1", "address2", "address3")

val asMap = udf((keys: Seq[String], values: Seq[String]) => 
  keys.zip(values).filter{
    case (k, null) => false
    case _ => true
  }.toMap)

val keys = array(mapData.map(lit): _*)
val values = array(mapData.map(col): _*)

val dfWithMap = df.withColumn("address", asMap(keys, values))

Another option, which doesn't require UDFs, is to struct field instead of map:

val dfWithStruct = df.withColumn("address", struct(mapData.map(col): _*))

The biggest advantage is that it can easily handle values of different types.