FranktheTank FranktheTank - 2 months ago 46
Python Question

Spark coalesce vs collect, which one is faster?

I am using

pyspark
to process 50Gb data using AWS EMR with ~15 m4.large cores.

Each row of the data contains some information at a specific time on a day. I am using the following
for
loop to extract and aggregate information for every hour. Finally I
union
the data, as I want my result to save in one csv file.

# daily_df is a empty pyspark DataFrame
for hour in range(24):
hourly_df = df.filter(hourFilter("Time")).groupby("Animal").agg(mean("weights"), sum("is_male"))
daily_df = daily_df.union(hourly_df)


As of my knowledge, I have to perform the following to force the
pyspark.sql.Dataframe
object to save to 1 csv files (approx 1Mb) instead of 100+ files:

daily_df.coalesce(1).write.csv("some_local.csv")


It seems it took about 70min to finish this progress, and I am wondering if I can make it faster by using
collect()
method like?

daily_df_pandas = daily_df.collect()
daily_df_pandas.to_csv("some_local.csv")

Answer

Both coalesce(1) and collect are pretty bad in general but with expected output size around 1MB it doesn't really matter. It simply shouldn't be a bottleneck here.

One simple improvement here is to drop loop -> filter -> union and perform a single aggregation:

df.groupby(hour("Time"), co("Animal")).agg(mean("weights"), sum("is_male"))

but most likely the issue here is configuration (the good place to start could be adjusting spark.sql.shuffle.partitions if you don't do that already).