Newbie Newbie - 1 month ago 12
Scala Question

How to do Multiple column count in SPARK/SCALA efficiently?

I have a data frame from which I need counts of all the columns with filter (value > 0) for each column .

Right now I am doing :

import org.apache.spark.storage.StorageLevel._

df.persist(MEMORY_AND_DISK)

val totalCustomers = df.count

val totalpurchaseCount = df.filter("totalpurchase > 0").count

val totalRentsCount = df.filter("totalRent > 0").count

val totalTVCount = df.filter("tvCount > 0").count

val totalMovieCount = df.filter("movieCount > 0").count

val totalKidsCount = df.filter("kidsCount > 0 ").count

val totalNonKidsCount = df.filter("adultsCount > 0").count


Is there any way to do this efficiently ?

Answer

Define a list of columns:

val cols = Seq("totalpurchase", "totalRent", ...)

Import required functions:

import org.apache.spark.sql.functions.{count, col, when}

Select aggregates:

val aggs = df.select(cols.map(c => count(when(col(c) > 0, true)).alias(c)): _*)

Collect result and extract values.

aggs.first match {
  case row => (0 until cols.size).map(i => cols(i) -> row.getLong(i)).toMap
}
Comments