shakedzy shakedzy - 7 months ago 36
SQL Question

SQL on Spark: How do I get all values of DISTINCT?

So, assume I have the following table:

Name | Color
------------------------------
John | Blue
Greg | Red
John | Yellow
Greg | Red
Greg | Blue


I would like to get a table of the distinct colors for each name - how many and their values. Meaning, something like this:

Name | Distinct | Values
--------------------------------------
John | 2 | Blue, Yellow
Greg | 2 | Red, Blue


Any ideas how to do so?

Answer

collect_list will give you a list without removing duplicates. collect_set will automatically remove duplicates so just

select 
Name,
count(distinct color) as Distinct, # not a very good name
collect_set(Color) as Values
from TblName
group by Name

this feature is implemented since spark 1.6.0 check it out:

https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/functions.scala

/**
   * Aggregate function: returns a set of objects with duplicate elements eliminated.
   *
   * For now this is an alias for the collect_set Hive UDAF.
   *
   * @group agg_funcs
   * @since 1.6.0
   */
  def collect_set(columnName: String): Column = collect_set(Column(columnName))
Comments