shakedzy shakedzy - 8 months ago 50
SQL Question

SQL on Spark: How do I get all values of DISTINCT?

So, assume I have the following table:

Name | Color
John | Blue
Greg | Red
John | Yellow
Greg | Red
Greg | Blue

I would like to get a table of the distinct colors for each name - how many and their values. Meaning, something like this:

Name | Distinct | Values
John | 2 | Blue, Yellow
Greg | 2 | Red, Blue

Any ideas how to do so?


collect_list will give you a list without removing duplicates. collect_set will automatically remove duplicates so just

count(distinct color) as Distinct, # not a very good name
collect_set(Color) as Values
from TblName
group by Name

this feature is implemented since spark 1.6.0 check it out:

   * Aggregate function: returns a set of objects with duplicate elements eliminated.
   * For now this is an alias for the collect_set Hive UDAF.
   * @group agg_funcs
   * @since 1.6.0
  def collect_set(columnName: String): Column = collect_set(Column(columnName))