Danny David Leybzon Danny David Leybzon - 6 months ago 76
Python Question

How do I convert a dataframe consisting of a column of sentences and a column of scores into one with a column of words and average scores?

I have a Pandas dataframe that resembles this:

sentence score
"This is a sentence." 5
"Another sentence?" 8


And I want one that resembles this:

word total_score count normalized_score
"sentence" 13 2 6.5
"this" 5 1 5


etc.

How should I go about doing this? My thought is to remove all non-alphanumeric characters, then use split() on all of the cells containing sentences, then combine those words into a set, then use that set to iterate through the original dataframe, counting the number of times a word is used and the corresponding scores. This, however, seems inelegant and potentially incredibly inefficient. Is there a better way to do this?

Note: Don't worry about stop words and assume all words are separated by spaces

Answer
import pandas as pd
df = pd.DataFrame(
    {'score': [5, 8], 'sentence': ["This is a sentence.", "Another sentence?"]})
df['sentence'] = df['sentence'].str.findall(r'\w+')
wide = df.apply(lambda x: pd.Series(x['score'], index=x['sentence']), axis=1)
count = wide.count()
total_score = wide.sum()
result = pd.DataFrame({'count':count, 
                       'total_score':total_score, 
                       'normalized_score':total_score/count}).reset_index()
result = result.rename(columns={'index':'word'})
print(result[['word', 'total_score', 'count', 'normalized_score']])

yields

       word  total_score  count  normalized_score
0   Another          8.0      1               8.0
1      This          5.0      1               5.0
2         a          5.0      1               5.0
3        is          5.0      1               5.0
4  sentence         13.0      2               6.5