Ratchainant Thammasudjarit Ratchainant Thammasudjarit - 16 days ago 5x
Python Question

Pandas : Getting unique rows for a given column but conditional on some criteria of other columns

I'm using python 2.7. From a given data as follows:

data = pd.DataFrame({'id':['001','001','001','002','002','003','003','003','004','005'],

The data looks like this:

id status value
001 ground 10
001 unknown -5
001 air 12
002 ground 20
002 unknown -12
003 ground 2
003 unknown -4
003 unknown -1
004 unknown 0
005 ground 6

I would like to get the output in dataframe that has unique id conditional to the following criteria: For a given id

'status': If 'air' does exist, pick 'air'.
If 'air' does not exist, pick 'ground'.
If both 'air' and 'ground' do not exist, pick 'unknown'.

'value': Sum of values for each id
'count': Count the number of rows for each id

Therefore, the expected output is the following.

id status value count
001 air 17 3
002 ground 8 2
003 ground -3 3
004 unknown 0 1
005 ground 6 1

I can do looping for each unique id but it is not elegant enough and computation is also expensive, especially when data becomes large. May i know the better pythonic style and more efficient way to come up with this output? Thank you in advance.


You want to use groupby on id. This is easy for value and count but trickier for the status. We need to write our own function which takes a pandas Series and returns a single attribute.

def group_status(x):
    if (x=='air').any():
        y = 'air'
    elif (x=='ground').any():
        y = 'ground'
        y = 'unknown'
    return y

data = data.groupby(by='id').agg({'value': ['sum', 'count'], 'status': [group_status]})
data.columns = ['status', 'value', 'count']


     status  value  count
001  air      17     3
002  ground   8      2
003  ground   -3     3
004  unknown  0      1
005  ground   6      1

Here we have ensured that the air, ground, unknown order is preserved without the need to change the column type to categorical, as mentioned in ayhan's very elegant answer.

The group_status() function does lay the groundwork should you wish to incorporate more advanced groupby functionality.