Nick Nick - 3 months ago 30
Scala Question

Re-name nested field in Scala Spark 2.0 Dataset

I am trying to re-name a nested field within a Dataset of case classes using Spark 2.0. An example is as follows, where I am trying to rename "element" to "address" (maintaining where it is nested within the data structure):

df.printSchema
//Current Output:
root
|-- companyAddresses: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- addressLine: string (nullable = true)
| | |-- addressCity: string (nullable = true)
| | |-- addressCountry: string (nullable = true)
| | |-- url: string (nullable = true)

//Desired Output:
root
|-- companyAddresses: array (nullable = true)
| |-- address: struct (containsNull = true)
| | |-- addressLine: string (nullable = true)
| | |-- addressCity: string (nullable = true)
| | |-- addressCountry: string (nullable = true)
| | |-- url: string (nullable = true)


For reference, the following do not work:

df.withColumnRenamed("companyAddresses.element","companyAddresses.address")
df.withColumnRenamed("companyAddresses.element","address")

Answer

What you're asking for here is not possible. companyAddresses is an array and element is simply not a column. It is just indicator of the schema of the array members. It cannot be selected, and it cannot be renamed.

You can only rename parent container:

df.withColumnRenamed("companyAddresses", "foo")

or names of the individual fields by modifying schema. In simple cases it is also possible to use struct and select:

df.select(struct($"foo".as("bar"), $"bar".as("foo")))

but obviously this is not applicable here.