DDRRpy DDRRpy - 1 month ago 7
Python Question

Pandas: calculating the mean values of duplicate entries in a dataframe

I have been working with a dataframe in python and pandas that contains duplicate entries in the first column. The dataframe looks something like this:

sample_id qual percent
0 sample_1 10 20
1 sample_2 20 30
2 sample_1 50 60
3 sample_2 10 90
4 sample_3 100 20


I want to write something that identifies duplicate entries within the first column and calculates the mean values of the subsequent columns. An ideal output would be something similar to the following:

sample_id qual percent
0 sample_1 30 40
1 sample_2 15 60
2 sample_3 100 20


I have been struggling with this problem all afternoon and would appreciate any help.

Answer

groupby the sample_id column and use mean

df.groupby('sample_id').mean().reset_index()
or
df.groupby('sample_id', as_index=False).mean()

get you

enter image description here

Comments