Avijit Avijit - 5 months ago 80
Scala Question

Conditional Join in Spark DataFrame

I am trying to join two

with condition.

I have two dataframe A and B.

A contains id,m_cd and c_cd columns
B contains m_cd,c_cd and record columns

Conditions are -

  • If m_cd is null then join c_cd of A with B

  • If m_cd is not null then join m_cd of A with B

we can use "
" and "
()" in
() method of dataframe, so is there any way to do this for the case of join in dataframe.

I have already done this using
.But wanted to know if there any other option available.


You can use the "when" / "otherwise" in the join condition:

case class Foo(m_cd: Option[Int], c_cd: Option[Int])
val dfA = spark.createDataset(Array(
    Foo(Some(1), Some(2)),
    Foo(Some(2), Some(3)),
    Foo(None: Option[Int], Some(4))

val dfB = spark.createDataset(Array(
    Foo(Some(1), Some(5)),
    Foo(Some(2), Some(6)),
    Foo(Some(10), Some(4))

val joinCondition = when($"a.m_cd".isNull, $"a.c_cd"===$"b.c_cd")

dfA.as('a).join(dfB.as('b), joinCondition).show

It might still be more readable to use the union, though.