shakedzy shakedzy - 8 months ago 42
Scala Question

Does Spark do UnionAll in parallel?

I got 10

s with the same schema which I'd like to combine into one
. Each
is constructed using a
sqlContext.sql("select ... from ...").cahce
, which means that technically, the
s are not really calculated until it's time to use them.

So, if I run:

val df_final = df1.unionAll(df2).unionAll(df3).unionAll(df4) ...

will Spark calculate all these
s in parallel or one by one (due to the dot operator)?

And also, while we're here - is there a more elegant way to preform a
on several
s than the one I listed above?


unionAll is lazy. The example line in your question does not trigger any calculation, synchronous or asynchronous.

In general Spark is a distributed computation system. Each operation itself is made up of a bunch of tasks that are processed in parallel. So in general you don't have to worry about whether two operations can run in parallel or not. The cluster resources will be well utilized anyway.