I am using
# daily_df is a empty pyspark DataFrame
for hour in range(24):
hourly_df = df.filter(hourFilter("Time")).groupby("Animal").agg(mean("weights"), sum("is_male"))
daily_df = daily_df.union(hourly_df)
daily_df_pandas = daily_df.collect()
collect are pretty bad in general but with expected output size around 1MB it doesn't really matter. It simply shouldn't be a bottleneck here.
One simple improvement here is to drop
union and perform a single aggregation:
df.groupby(hour("Time"), co("Animal")).agg(mean("weights"), sum("is_male"))
but most likely the issue here is configuration (the good place to start could be adjusting
spark.sql.shuffle.partitions if you don't do that already).