Patrick Patrick - 4 years ago 163
SQL Question

How to write the following Query in HIVE

I need to implement the following query:


Which 10 movies were the most controversial in 2015 (i.e., had the highest variance in ratings between 2015/01/01 and 2015/12/31)?

Tables

I implemented the following query in HIVE:



SELECT title, rating FROM
(
SELECT m.title as title, variance(r.rating) as var, r.rating as rating, r.time_stamp as time_stamp
FROM movies m JOIN ratings r ON m.movieid = r.movieid
DISTRIBUTE BY m.title, r.rating
GROUP BY m.title
SORT BY m.title, r.rating
) A
WHERE year(from_unixtime(time_stamp)) = '2015'
GROUP BY title
LIMIT 10;




But I get the following error:

Error while compiling statement: FAILED: ParseException line 6:4 missing ) at 'GROUP' near 'GROUP' line 6:10 missing EOF at 'BY' near 'GROUP'

Answer Source

Patrick, it is still SQL.
- You can't select a column that was not a part of the GROUP BY.
- YEAR returns an integer (P.s. ratings is not partitioned?).
- You should have a really good reason to use DISTRIBUTE BY and SORT BY, technical clauses from the beggining time of Hive.

select      m.title
           ,r.var

from                   (select      r.movieid
                                   ,variance(r.rating)  as var

                        from        ratings as r

                        where       year(from_unixtime(time_stamp)) = 2015

                        group by    r.movieid

                        order by    var desc

                        limit       10
                        ) as r

            join        movies as m

            on          m.movieid   =
                        r.movieid
;                        
Recommended from our users: Dynamic Network Monitoring from WhatsUp Gold from IPSwitch. Free Download