MSalty MSalty - 2 years ago 136
Python Question

Python - Counting Unique Labels in a Date Range

I'm attempting to perform sentiment analysis on a pile of text data I have scraped from the internet. I have reached a point where my Pandas DataFrame has the following columns I wish to analyse: "post_date" (in the format dd-mm-yyyy ie. 01-10-2017) & "Sentiment" (in the format "positive","neutral" or "negative").

I wish to be able to count the number of posts by each day/month/year and also the number of positive/neutral/negative posts for each day.

for example like those produced by:

print pd.value_counts(df.Sentiment)


However I am stuck, i've tried many iterations of the groupby command (below), but keep coming up with errors.

df.groupby(df.post_date.dt.year)


Could anyone please help with how I can achieve this?

Ideally the desired output would be something like:

Date, Postive_Posts, Negative_Posts, Neutral_Posts, Total_Posts
01/10/2017, 10, 5, 8, 23
02/10/2017, 5, 20, 5, 30


Where date is how the information has been grouped (day, month, year, etc.), and the pos/neg/neu columns are the total posts corresponding to the count of the labels in that range, finally total_posts is the total count of posts in that range.

the data is currently:

post_date, Sentiment
19/09/2017, positive
19/09/2017, positive
19/09/2017, positive
20/09/2017, negative
20/09/2017, neutral


Please let me know if you need anymore information.

Answer Source

You can use groupby + size + unstack + add_suffix + sum:

df1 = df.groupby(['post_date','Sentiment']).size().unstack(fill_value=0).add_suffix('_Posts')
df1['Total_Posts'] = df1.sum(axis=1)
print (df1)

Sentiment   negative_Posts  neutral_Posts  positive_Posts  Total_Posts
post_date                                                             
19/09/2017               0              0               3            3
20/09/2017               1              1               0            2

One line solution is very similar - only need assign:

df1 = (df.groupby(['post_date','Sentiment'])
        .size()
        .unstack(fill_value=0)
        .add_suffix('_Posts')
        .assign(Total_Posts=lambda x: x.sum(axis=1)))

print (df1)

Sentiment   negative_Posts  neutral_Posts  positive_Posts  Total_Posts
post_date                                                             
19/09/2017               0              0               3            3
20/09/2017               1              1               0            2

And for column from index:

df1 = (df.groupby(['post_date','Sentiment'])
        .size()
        .unstack(fill_value=0)
        .add_suffix('_Posts')
        .assign(Total_Posts=lambda x: x.sum(axis=1))
        .reset_index()
        .rename_axis(None, axis=1))

print (df1)

    post_date  negative_Posts  neutral_Posts  positive_Posts  Total_Posts
0  19/09/2017               0              0               3            3
1  20/09/2017               1              1               0            2
Recommended from our users: Dynamic Network Monitoring from WhatsUp Gold from IPSwitch. Free Download