rrr rrr - 1 month ago 19
SQL Question

MySQL query to Hiveql

work(id, rank)

Data:

work
------------------
1 | A
1 | B
1 | C
1 | D
2 | A
2 | C
2 | B
3 | C


I need to find all pairs of ids that have the common rank with their count and it should display only if count of rank is greater than 2 and print them in descending order. I have written a mysql query for this but, I am new to SparkSQL and HIVEQL. So please help me how to do that.
For example using the data above the result set should be:

mysql query is:

select a.id,b.id
from work as a, work as b
where a.id>b.id
group by a.id,b.id having group_concat(distinct a.rank order by a.rank)=group_concat(distinct b.rank order by b.rank)

---------------------
id1 | id2 | Count
---------------------
A | B | 3
B | C | 3

Answer

I don't think Hive supports group_concat(). I think this does the same thing:

select a.id, b.id, a.cnt
from (select a.*, count(*) over (partition by a.id) as cnt
      from work a
     ) a join
     (select b.*, count(*) over (partition by b.id) as cnt
      from work b
     ) b
     on a.rank = b.rank and a.cnt = b.cnt
where a.id < b.id   -- I *think* this is allowed in Hive; it not, a subquery or expression in the `having` clause will do the same thing
group by a.id, b.id, a.cnt
having count(*) = a.cnt;

This is the more natural way to get pairs of ids with the same ranking. In fact, it should be more efficient in almost any database than the MySQL version. The cross join generates a lot of data.