echen echen - 9 months ago 143
Scala Question

Automatically and Elegantly flatten DataFrame in Spark SQL


Is there an elegant and accepted way to flatten a Spark SQL table (Parquet) with columns that are of nested


For example

If my schema is:


How do I select it into a flattened tabular form without resorting to manually running"","foo.baz","x","y","z")

In other words, how do I obtain the result of the above code programmatically given just a
and a

Answer Source

The short answer is, there's no "accepted" way to do this, but you can do it very elegantly with a recursive function that generates your select(...) statement by walking through the DataFrame.schema.

The recursive function should return an Array[Column]. Every time the function hits a StructType, it would call itself and append the returned Array[Column] to its own Array[Column].

Something like:

def flattenSchema(schema: StructType, prefix: String = null) : Array[Column] = {
  schema.fields.flatMap(f => {
    val colName = if (prefix == null) else (prefix + "." +

    f.dataType match {
      case st: StructType => flattenSchema(st, colName)
      case _ => Array(col(colName))

You would then use it like this:*)