Tasos Tasos - 21 days ago 6
Python Question

Get first row of dataframe in Python Pandas based on criteria

Let's say that I have a dataframe like this one

import pandas as pd
df = pd.DataFrame([[1, 2, 1], [1, 3, 2], [4, 6, 3], [4, 3, 4], [5, 4, 5]], columns=['A', 'B', 'C'])

>> df
A B C
0 1 2 1
1 1 3 2
2 4 6 3
3 4 3 4
4 5 4 5


The original table is more complicated with more columns and rows.

I want to get the first row that fulfil some criteria. Examples:


  1. Get first row where A > 3 (returns row 2)

  2. Get first row where A > 4 AND B > 3 (returns row 4)

  3. Get first row where A > 3 AND (B > 3 OR C > 2) (returns row 2)



But, if there isn't any row that fulfil the specific criteria, then I want to get the first one after I just sort it descending by A (or other cases by B, C etc)


  1. Get first row where A > 6 (returns row 4 by ordering it by A desc and get the first one)



I was able to do it by iterating on the dataframe (I know that craps :P). So, I prefer a more pythonic way to solve it.

Answer

This tutorial is a very good one for pandas slicing. Make sure you check it out. Onto some snippets... To slice a dataframe with a condition, you use this format:

>>> df[condition]

This will return a slice of your dataframe which you can index using iloc. Here are your examples:

  1. Get first row where A > 3 (returns row 2)

    >>> df[df.A > 3].iloc[0]
    A    4
    B    6
    C    3
    Name: 2, dtype: int64
    

If what you actually want is the row number, rather than using iloc, it would be df[df.A > 3].index[0].

  1. Get first row where A > 4 AND B > 3:

    >>> df[(df.A > 4) & (df.B > 3)].iloc[0]
    A    5
    B    4
    C    5
    Name: 4, dtype: int64
    
  2. Get first row where A > 3 AND (B > 3 OR C > 2) (returns row 2)

    >>> df[(df.A > 3) & ((df.B > 3) | (df.C > 2))].iloc[0]
    A    4
    B    6
    C    3
    Name: 2, dtype: int64
    

Now, with your last case we can write a function that handles the default case of returning the descending-sorted frame:

>>> def series_or_default(X, condition, default_col, ascending=False):
...     sliced = X[condition]
...     if sliced.shape[0] == 0:
...         return X.sort_values(default_col, ascending=ascending).iloc[0]
...     return sliced.iloc[0]
>>> 
>>> series_or_default(df, df.A > 6, 'A')
A    5
B    4
C    5
Name: 4, dtype: int64

As expected, it returns row 4.