d80tb7 d80tb7 - 18 days ago 7
Scala Question

How to join Datasets on multiple columns?

Given two Spark Datasets, A and B I can do a join on single column as follows:

a.joinWith(b, $"a.col" === $"b.col", "left")


My question is whether you can do a join using multiple columns. Essentially the equivalent of the following DataFrames api code:

a.join(b, a("col") === b("col") && a("col2") === b("col2"), "left")

Answer

You can do it exactly the same way as with Dataframe:

val xs = Seq(("a", "foo", 2.0), ("x", "bar", -1.0)).toDS
val ys = Seq(("a", "foo", 2.0), ("y", "bar", 1.0)).toDS

xs.joinWith(ys, xs("_1") === ys("_1") && xs("_2") === ys("_2"), "left").show
// +------------+-----------+
// |          _1|         _2|
// +------------+-----------+
// | [a,foo,2.0]|[a,foo,2.0]|
// |[x,bar,-1.0]|       null|
// +------------+-----------+

In Spark < 2.0.0 you can use something like this:

xs.as("xs").joinWith(
  ys.as("ys"), ($"xs._1" === $"ys._1") && ($"xs._2" === $"ys._2"), "left")
Comments