Havnar Havnar - 5 months ago 103
Scala Question

Spark merge dataframe with mismatching schemas without extra disk IO

I would like to merge 2 dataframes with (potentially) mismatching schemas

org.apache.spark.sql.DataFrame = [name: string, age: int, height: int]
org.apache.spark.sql.DataFrame = [name: string, age: int]

scala> A.unionAll(B)

would result in :

org.apache.spark.sql.AnalysisException: Union can only be performed on tables with the same number of columns, but the left table has 2 columns and the right has 3;

I would like to do this from within Spark.
However, the Spark docs only propose to write the whole 2 dataframes out to a directory and read them back in using
spark.read.option("mergeSchema", "true")

link to docs

So a union doesn't help me out, and neither does the documentation. I would like to keep this extra I/O out of my job if at all possible. Am I missing some undocumented info, or is it not possible (yet)?


You can append a null column to frame B and after union 2 frames:

import org.apache.spark.sql.functions._
val missingFields = A.schema.toSet.diff(B.schema.toSet)
var C: DataFrame = null
for (field <- missingFields){ 
   C = A.withColumn(field.name, expr("null"));