DilTeam DilTeam - 4 months ago 22
Scala Question

Merging Dataframes in Spark

I've 2 Dataframes, say A & B. I would like to join them on a key column & create another Dataframe. When the keys match in A & B, I need to get the row from B, not from A.

For example:

DataFrame A:

Employee1, salary100
Employee2, salary50
Employee3, salary200


DataFrame B

Employee1, salary150
Employee2, salary100
Employee4, salary300


The resulting DataFrame should be:

DataFrame C:

Employee1, salary150
Employee2, salary100
Employee3, salary200
Employee4, salary300


How can I do this in Spark & Scala?

Answer

Try:

dfA.registerTempTable("dfA")
dfB.registerTempTable("dfB")

sqlContext.sql("""
SELECT coalesce(dfA.employee, dfB.employee), 
       coalesce(dfB.salary, dfA.salary) FROM dfA FULL OUTER JOIN dfB
ON dfA.employee = dfB.employee""")

or

sqlContext.sql("""
SELECT coalesce(dfA.employee, dfB.employee),
  CASE dfB.employee IS NOT NULL THEN dfB.salary
  CASE dfB.employee IS NOT NULL THEN dfA.salary
  END FROM dfA FULL OUTER JOIN dfB
ON dfA.employee = dfB.employee""")