NS Saravanan NS Saravanan - 1 month ago 7
Scala Question

Minus logic implementation not working with spark/scala

Minus Logic in

Hive
:

The below (
Hive
)query will return only records available in left side table (
Full_Table ft
), but not in both.

Select ft.* from Full_Table ft left join Stage_Table stg where stg.primary_key1 IS null and stg.primary_key2 IS null


I tried to implement the same in
spark
/
scala
using following method ( To support both primary key and composite key ) , But joined result set does not have
column
from right table, because of that not able to apply
stg.primary_key2 IS null
condition in joined result set.

ft.join(stg,usingColumns, “left_outer”) // used seq to support composite key column join


Please suggest me how to implement minus logic in spark
scala
.

Thanks,
Saravanan
https://www.linkedin.com/in/saravanan303/

Answer Source

If your tables have the same columns you can use except method from DataSet:

fullTable.except(stageTable)

If they don't have, but you are interested only on subset of columns that exists in both tables you can first select those column using select transformation and than use except:

val fullTableSelectedColumns = fullTable.select(c1,c2,c3)
val stageTableSelectedColumns = stageTable.select(c1,c2,c3)

fullTableSelectedColumns.except(stageTableSelectedColumns)

On other case, you can use join and filter transformations:

fullTable
    .join(stageTable, fullTable("primary_key") === stageTable("primary_key"), "left")
    .filter(stageTable("primary_key1").isNotNull)