Feynman27 Feynman27 - 2 months ago 17
Scala Question

Renaming nested elements in Scala Spark Dataframe

I have a Spark Scala dataframe with a nested structure:

|-- _History: struct (nullable = true)
| |-- Article: array (nullable = true)
| | |-- element: struct (containsNull = true)
| | | |-- Id: string (nullable = true)
| | | |-- Timestamp: long (nullable = true)
| |-- Channel: struct (nullable = true)
| | |-- <font><font>Cultura pop</font></font>: array (nullable = true)
| | | |-- element: long (containsNull = true)
| | |-- <font><font>Deportes</font></font>: array (nullable = true)
| | | |-- element: long (containsNull = true)


I'm trying to rename the nested elements (e.g.
<font><font>Deportes</font></font>
to
Deportes
. Is there a way to do this using a UDF or something similar?

I've tried the following, which doesn't work:

var filterDF2 = filterDF
.withColumnRenamed("_History.Channel.<font><font>Deportes</font></font>", "_History.Channel.Deportes")

Answer

The simplest approach is to use type casting with properly named schema string (or equivalent StructField definition):

val schema = """struct<
  Article: array<struct<Id:string,Timestamp:bigint>>,
  Channel: struct<Cultura: bigint, Deportes: array<bigint>>>"""
df.withColumn("_History", $"_History".cast(schema))

You could also model this with case classes:

import org.apache.spark.sql.Row

case class ChannelRecord(Cultura: Option[Long], Deoprtes: Option[Seq[Long]])

val rename = udf((row: Row) => 
  ChannelRecord(Option(row.getLong(0)), Option(row.getSeq[Long](1))))

df.withColumn("_History",
  struct($"_History.Article", rename($"_History.channel").alias("channel")))
Comments