thescoop thescoop - 5 months ago 18
Python Question

Reducing the rows of a dataframe by adding observed values

I have experimental data for a number of microscope slides. For each slide I have taken a number of photographic images, and on each image I have a number of specimens. I would like to know show many specimens I have for each slide:

eg: On slide 0, I have four specimens in total (three in image 1, one in image2):

from pandas import DataFrame, Series
import seaborn as sns

data = {'Slide' : [0,0,0,0,0,0,0,0,0, 0,0,0, 9,9,],
'Image' : [1,1,1,1,1,1,1,1,1, 2,2,2, 1,7],
'Specimen' : [1,1,1,1,2,2,3,3,3, 1,1,1, 1,1],
'Foci' : [8,9,7,9,6,9,7,9,6, 6,7,6, 9,6]}

df = DataFrame(data, columns=['Slide','Image','Specimen','Foci'])
df


enter image description here

I can get part way to an answer, but this is not quite what I need, as I still need to add up the specimen numbers for each slide:

df.groupby(['Slide','Image']).max()


enter image description here

Answer

The groupby feature allows you to essentially "pivot" the results like you would in excel:

df = df.groupby(['Slide','Image']).Specimen.nunique()

The .nunique() function will give you the number of unique values per image per slide. You can then use .reset_index() on this series to convert it back to a dataframe.

df.reset_index(inplace=True)

df
    Slide   Image   Specimen
0   0       1       3
1   0       2       1
2   9       1       1
3   9       7       1

If you then want to find just the sum of values by slide you can use one final groupby:

df =  df.groupby('Slide').sum()['Specimen']
df
Slide
0    4
9    2