Rajni Kant Sharma Rajni Kant Sharma - 4 months ago 24
Scala Question

concatenate all struct fields nested to array in spark

My schema structure is following. I need to concatenate #VALUE,@DescriptionCode and @LanguageCode these are nested to an array.

root
|-- partnumber: string (nullable = true)
|-- brandlabel: string (nullable = true)
|-- availabledate: string (nullable = true)
|-- description: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- #VALUE: string (nullable = true)
| | |-- @DescriptionCode: string (nullable = true)
| | |-- @LanguageCode: string (nullable = true)


I have tried a lot but nothing work for me.
I need following schema

root
|-- partnumber: string (nullable = true)
|-- brandlabel: string (nullable = true)
|-- availabledate: string (nullable = true)
|-- descriptions: array (nullable = true)
|-- |-- element: string (containsNull = true)

Answer

I believe you need to create an User Defined Function:

import org.apache.spark.sql.functions._

val func: (Seq[Row]) => Seq[String] = {
  _.map( 
    element =>
      element.getAs[String]("#VALUE") + 
      element.getAs[String]("@DescriptionCode") +
      element.getAs[String]("@LanguageCode")
  )
}

val myUDF = udf(func)

df.withColumn("descriptions", myUDF(col("description"))).drop(col("description"))

For more information about UDFs, you can read this article.