FranktheTank FranktheTank - 1 month ago 18
Python Question

Spark coalesce vs collect, which one is faster?

I am using

to process 50Gb data using AWS EMR with ~15 m4.large cores.

Each row of the data contains some information at a specific time on a day. I am using the following
loop to extract and aggregate information for every hour. Finally I
the data, as I want my result to save in one csv file.

# daily_df is a empty pyspark DataFrame
for hour in range(24):
hourly_df = df.filter(hourFilter("Time")).groupby("Animal").agg(mean("weights"), sum("is_male"))
daily_df = daily_df.union(hourly_df)

As of my knowledge, I have to perform the following to force the
object to save to 1 csv files (approx 1Mb) instead of 100+ files:


It seems it took about 70min to finish this progress, and I am wondering if I can make it faster by using
method like?

daily_df_pandas = daily_df.collect()


Both coalesce(1) and collect are pretty bad in general but with expected output size around 1MB it doesn't really matter. It simply shouldn't be a bottleneck here.

One simple improvement here is to drop loop -> filter -> union and perform a single aggregation:

df.groupby(hour("Time"), co("Animal")).agg(mean("weights"), sum("is_male"))

but most likely the issue here is configuration (the good place to start could be adjusting spark.sql.shuffle.partitions if you don't do that already).