Jim Hendricks Jim Hendricks - 4 months ago 47
Scala Question

How do I combine two columns in a Spark SchemaRDD containing WrappedArrays into a 3rd column with the combined WrappedArray?

I have a DataFrame with two columns ( "features1" and "features2" ) containing WrappedArrays.

I need to combine the two columns into a third column containing the merged contents of the first two columns as a WrappedArray.

How do I do this?

I'm using Scala not PySpark


I didn't find another way than a udf, surprisingly

def catArray[A](a:Seq[A], b: Seq[A]): Seq[A] = a ++ b 
val catArrayUdf = udf { catArray[Int] _ }


scala> sc.parallelize(List((Seq(1,2),Seq(3,4))))
|A     |B     |cat         |
|[1, 2]|[3, 4]|[1, 2, 3, 4]|

Maybe there is a shorter way to define the UDF based on ++ though.