Student Student - 3 months ago 31
Python Question

How you calculate the average rating per genre in python?

I have this lens dataframe. It has columns to classify genres a movie belongs to. The genre categories are column names with binary values in the rows. If a movie belongs to a genre, it has a 1 under the appropriate column and 0 otherwise. I want to calculate the average rating per genre for each user in python pandas.

# pass in column names for each CSV
u_cols = ['user_id', 'age', 'sex', 'occupation', 'zip_code']
users = pd.read_csv('C:/Users/End-User/Desktop/ml-100k/u.user',
sep='|',names=u_cols, encoding='latin-1')

r_cols = ['user_id', 'movie_id', 'rating', 'unix_timestamp']
ratings = pd.read_csv('C:/Users/End-User/Desktop/ml-100k/u.data',
sep='\t', names=r_cols, encoding='latin-1')

# Reading item file:
m_cols = ['movie_id', 'title' ,'release_date','video_release_date', 'imdb_url',
'unknown', 'Action', 'Adventure', 'Animation', 'Children\'s', 'Comedy',
'Crime', 'Documentary', 'Drama', 'Fantasy', 'Film-Noir', 'Horror',
'Musical', 'Mystery', 'Romance', 'Sci-Fi','Thriller', 'War', 'Western']

movies = pd.read_csv('C:/Users/End-User/Desktop/ml-100k/u.item',
sep='|', names=m_cols, encoding='latin-1')

# create one merged DataFrame
movie_ratings = pd.merge(movies, ratings)
lens = pd.merge(movie_ratings, users)

# I have tried this but don't know how to get the average of the ratings for each user.
df = pd.pivot_table(lens, index = ['user_id'],
columns = ['unknown', 'Action', 'Adventure', 'Animation',
'Children\'s', 'Comedy', 'Crime', 'Documentary',
'Drama', 'Fantasy', 'Film-Noir', 'Horror',
'Musical', 'Mystery', 'Romance', 'Sci-Fi',
'Thriller', 'War', 'Western'],
values = ['rating'])
print df

Answer

Consider reshaping your dataframe from wide to long to create a genre column and then run result through pivot_table() using its aggfunc argument, specifically for numpy mean:

import pandas as pd
import numpy as np

#...same code...

lens = pd.merge(movie_ratings, users)

genrecols = ['unknown', 'Action', 'Adventure', 'Animation',
             'Children\'s', 'Comedy', 'Crime', 'Documentary',
             'Drama', 'Fantasy', 'Film-Noir', 'Horror', 
             'Musical', 'Mystery', 'Romance', 'Sci-Fi', 
             'Thriller', 'War', 'Western']

# RESHAPE DF BY MELTING (WIDE TO LONG), SELECTING ONLY NEEDED FIELDS    
mdf = pd.melt(lens[['user_id', 'sex', 'rating'] + genrecols], 
              id_vars=['user_id', 'sex', 'rating'], var_name='genre')

# FILTER FOR VALUE = 1 AND THREE NEEDED COLUMNS
mdf = mdf[mdf['value']==1][['user_id', 'sex', 'rating', 'genre']]

# RUN PIVOTED AGGREGATION    
df = pd.pivot_table(mdf, columns = ['genre'], index = ['user_id', 'sex'], 
                    values = ['rating'], aggfunc = np.mean)

print df