Havnar Havnar - 1 month ago 20
Scala Question

Spark merge dataframe with mismatching schemas without extra disk IO

I would like to merge 2 dataframes with (potentially) mismatching schemas

org.apache.spark.sql.DataFrame = [name: string, age: int, height: int]
org.apache.spark.sql.DataFrame = [name: string, age: int]

scala> A.unionAll(B)


would result in :

org.apache.spark.sql.AnalysisException: Union can only be performed on tables with the same number of columns, but the left table has 2 columns and the right has 3;


I would like to do this from within Spark.
However, the Spark docs only propose to write the whole 2 dataframes out to a directory and read them back in using
spark.read.option("mergeSchema", "true")
.

link to docs

So a union doesn't help me out, and neither does the documentation. I would like to keep this extra I/O out of my job if at all possible. Am I missing some undocumented info, or is it not possible (yet)?

Answer

You can append a null column to frame B and after union 2 frames:

import org.apache.spark.sql.functions._
val missingFields = A.schema.toSet.diff(B.schema.toSet)
var C: DataFrame = null
for (field <- missingFields){ 
   C = A.withColumn(field.name, expr("null")); 
} 
A.unionAll(C)
Comments