batman batman - 1 month ago 17
Java Question

COLLECT_SET() in Hive, keep duplicates?

Is there a way to keep the duplicates in a collected set in Hive, or simulate the sort of aggregate collection that Hive provides using some other method? I want to aggregate all of the items in a column that have the same key into an array, with duplicates.

I.E.:

hash_id | num_of_cats
=====================
ad3jkfk 4
ad3jkfk 4
ad3jkfk 2
fkjh43f 1
fkjh43f 8
fkjh43f 8
rjkhd93 7
rjkhd93 4
rjkhd93 7


should return:

hash_agg | cats_aggregate
===========================
ad3jkfk Array<int>(4,4,2)
fkjh43f Array<int>(1,8,8)
rjkhd93 Array<int>(7,4,7)

Answer

Try to use COLLECT_LIST(col) after Hive 0.13.0

SELECT
    hash_id, COLLECT_LIST(num_of_cats) AS aggr_set
FROM
    tablename
WHERE
    blablabla
GROUP BY
    hash_id
;
Comments