BreakPhreak - 7 months ago 214
SQL Question

# How to calculate the percentage of total in Spark SQL

Considering the following data:

``````Name | Flag
A    | 0
A    | 1
A    | 0
B    | 0
B    | 1
B    | 1
``````

I'd like to transform it to:

``````Name | Total | With Flag | Percentage
A    | 3     | 1         | 33%
B    | 3     | 2         | 66%
``````

Preferably, in Spark SQL.

For example like this:

``````val df = sc.parallelize(Seq(
("A", 0), ("A", 1), ("A", 0),
("B", 0), ("B", 1), ("B", 1)
)).toDF("Name", "Flag")

df.groupBy(\$"Name").agg(
count("*").alias("total"),
sum(\$"flag").alias("with_flag"),
// Do you really want to truncate not for example round?
mean(\$"flag").multiply(100).cast("integer").alias("percentage"))

// +----+-----+---------+----------+
// |name|total|with_flag|percentage|
// +----+-----+---------+----------+
// |   A|    3|        1|        33|
// |   B|    3|        2|        66|
// +----+-----+---------+----------+
``````

or:

``````df.registerTempTable("df")
sqlContext.sql("""
SELECT name, COUNT(*) total, SUM(flag) with_flag,
CAST(AVG(flag) * 100 AS INT) percentage
FROM df
GROUP BY name""")

// +----+-----+---------+----------+
// |name|total|with_flag|percentage|
// +----+-----+---------+----------+
// |   A|    3|        1|        33|
// |   B|    3|        2|        66|
// +----+-----+---------+----------+
``````