JackR JackR - 3 months ago 28
Python Question

Spark request avg with count

I want to make a request to find the most busy day of the week on average.

I have a field day (Monday, Tuesday ...) on my dataframe.

My request is the following:

mostBusyDayDF = logDF.groupBy("day_week")

The result is an avg of the sum of all count but I want to have the avg for each day.

For example:

My input


I want in output

day_week count() avg()
Monday 1 1/(1+2+4+2)
Tuesday 2 2/(1+2+4+2)
Monday 4 4/(1+2+4+2)
Monday 2 2/(1+2+4+2)

How can I solve my problem?

Thanks a lot.


You'll need to sum the columns count first and there is many ways to do it e.g :

sum_count = logDF.count() 

Once you have this sum, you can perform your group by and divide by the sum e.g :

mostBusyDayDF = logDF.groupBy("day_week")
       .withColumn("avgDay" , col("cntDay") / functions.lit(sum_count))

Note : lit creates a Column of literal value.