JackR JackR - 8 days ago 5
Python Question

Spark request avg with count

I want to make a request to find the most busy day of the week on average.

I have a field day (Monday, Tuesday ...) on my dataframe.

My request is the following:

mostBusyDayDF = logDF.groupBy("day_week")
.agg(functions.count("day_week").alias("cntDay"))
.groupBy("cntDay").agg(functions.avg("cntDay").alias("avgDay"))
.orderBy(functions.desc("avgDay")).limit(1).show()


The result is an avg of the sum of all count but I want to have the avg for each day.

For example:

My input

day_week
Monday
Tuesday
Monday
Monday


I want in output

day_week count() avg()
Monday 1 1/(1+2+4+2)
Tuesday 2 2/(1+2+4+2)
Monday 4 4/(1+2+4+2)
Monday 2 2/(1+2+4+2)


How can I solve my problem?

Thanks a lot.

Answer

You'll need to sum the columns count first and there is many ways to do it e.g :

sum_count = logDF.count() 

Once you have this sum, you can perform your group by and divide by the sum e.g :

mostBusyDayDF = logDF.groupBy("day_week")
       .agg(functions.count("day_week").alias("cntDay"))
       .withColumn("avgDay" , col("cntDay") / functions.lit(sum_count))
mostBusyDayDF.show()

Note : lit creates a Column of literal value.

Comments