shakedzy shakedzy - 3 months ago 15
Scala Question

Does Spark do UnionAll in parallel?

I got 10

DataFrame
s with the same schema which I'd like to combine into one
DataFrame
. Each
DataFrame
is constructed using a
sqlContext.sql("select ... from ...").cahce
, which means that technically, the
DataFrame
s are not really calculated until it's time to use them.

So, if I run:

val df_final = df1.unionAll(df2).unionAll(df3).unionAll(df4) ...


will Spark calculate all these
DataFrame
s in parallel or one by one (due to the dot operator)?

And also, while we're here - is there a more elegant way to preform a
unionAll
on several
DataFrame
s than the one I listed above?

Answer

unionAll is lazy. The example line in your question does not trigger any calculation, synchronous or asynchronous.

In general Spark is a distributed computation system. Each operation itself is made up of a bunch of tasks that are processed in parallel. So in general you don't have to worry about whether two operations can run in parallel or not. The cluster resources will be well utilized anyway.

Comments