Brian Brian - 4 months ago 17
Python Question

get first and last values in a groupby

I have a dataframe

df


df = pd.DataFrame(np.arange(20).reshape(10, -1),
[['a', 'a', 'a', 'a', 'b', 'b', 'b', 'c', 'c', 'd'],
['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j']],
['X', 'Y'])


How do I get the first and last rows, grouped by the first level of the index?

I tried

df.groupby(level=0).agg(['first', 'last']).stack()


and got

X Y
a first 0 1
last 6 7
b first 8 9
last 12 13
c first 14 15
last 16 17
d first 18 19
last 18 19


This is so close to what I want. How can I preserve the level 1 index and get this instead:

X Y
a a 0 1
d 6 7
b e 8 9
g 12 13
c h 14 15
i 16 17
d j 18 19
j 18 19

Answer

Option 1

def first_last(df):
    return df.ix[[0, -1]]

df.groupby(level=0, group_keys=False).apply(first_last)

enter image description here


Option 2 - only works if index is unique

idx = df.index.to_series().groupby(level=0).agg(['first', 'last']).stack()
df.loc[idx]

Option 3 - per notes below, this only makes sense when there are no NAs

I also abused the agg function. The code below works, but is far uglier.

df.reset_index(1).groupby(level=0).agg(['first', 'last']).stack() \
    .set_index('level_1', append=True).reset_index(1, drop=True) \
    .rename_axis([None, None])

Note

per @unutbu: agg(['first', 'last']) take the firs non-na values.

I interpreted this as, it must then be necessary to run this column by column. Further, forcing index level=1 to align may not even make sense.

Let's include another test

df = pd.DataFrame(np.arange(20).reshape(10, -1),
                  [list('aaaabbbccd'),
                   list('abcdefghij')],
                  list('XY'))

df.loc[tuple('aa'), 'X'] = np.nan

def first_last(df):
    return df.ix[[0, -1]]

df.groupby(level=0, group_keys=False).apply(first_last)

enter image description here

df.reset_index(1).groupby(level=0).agg(['first', 'last']).stack() \
    .set_index('level_1', append=True).reset_index(1, drop=True) \
    .rename_axis([None, None])

enter image description here

Sure enough! This second solution is taking the first valid value in column X. It is now nonsensical to have forced that value to align with the index a.