I need to implement the following query:
Which 10 movies were the most controversial in 2015 (i.e., had the highest variance in ratings between 2015/01/01 and 2015/12/31)?
Tables
I implemented the following query in HIVE:
SELECT title, rating FROM
(
SELECT m.title as title, variance(r.rating) as var, r.rating as rating, r.time_stamp as time_stamp
FROM movies m JOIN ratings r ON m.movieid = r.movieid
DISTRIBUTE BY m.title, r.rating
GROUP BY m.title
SORT BY m.title, r.rating
) A
WHERE year(from_unixtime(time_stamp)) = '2015'
GROUP BY title
LIMIT 10;
Error while compiling statement: FAILED: ParseException line 6:4 missing ) at 'GROUP' near 'GROUP' line 6:10 missing EOF at 'BY' near 'GROUP'
Patrick, it is still SQL.
- You can't select a column that was not a part of the GROUP BY.
- YEAR
returns an integer (P.s. ratings is not partitioned?).
- You should have a really good reason to use DISTRIBUTE BY
and SORT BY
, technical clauses from the beggining time of Hive.
select m.title
,r.var
from (select r.movieid
,variance(r.rating) as var
from ratings as r
where year(from_unixtime(time_stamp)) = 2015
group by r.movieid
order by var desc
limit 10
) as r
join movies as m
on m.movieid =
r.movieid
;