I need to implement the following query:
Which 10 movies were the most controversial in 2015 (i.e., had the highest variance in ratings between 2015/01/01 and 2015/12/31)?
I implemented the following query in HIVE:
SELECT title, rating FROM
SELECT m.title as title, variance(r.rating) as var, r.rating as rating, r.time_stamp as time_stamp
FROM movies m JOIN ratings r ON m.movieid = r.movieid
DISTRIBUTE BY m.title, r.rating
GROUP BY m.title
SORT BY m.title, r.rating
WHERE year(from_unixtime(time_stamp)) = '2015'
GROUP BY title
Error while compiling statement: FAILED: ParseException line 6:4 missing ) at 'GROUP' near 'GROUP' line 6:10 missing EOF at 'BY' near 'GROUP'
Patrick, it is still SQL.
- You can't select a column that was not a part of the GROUP BY.
YEAR returns an integer (P.s. ratings is not partitioned?).
- You should have a really good reason to use
DISTRIBUTE BY and
SORT BY, technical clauses from the beggining time of Hive.
select m.title ,r.var from (select r.movieid ,variance(r.rating) as var from ratings as r where year(from_unixtime(time_stamp)) = 2015 group by r.movieid order by var desc limit 10 ) as r join movies as m on m.movieid = r.movieid ;