Windstorm1981 Windstorm1981 - 1 year ago 65
Python Question

Pandas Select all Rows with Either of the Top 2 Values in a Particular Column

I have the following sample data:

IND ID value EFFECTIVE DT SYSDATE
8 A 19289 6/30/2017 8/16/2017 10:05
17 A 19289 6/30/2017 8/15/2017 14:25
26 A 19289 6/30/2017 8/14/2017 15:10
7 A 18155 3/31/2017 8/16/2017 10:05
16 A 18155 3/31/2017 8/15/2017 14:25
25 A 18155 3/31/2017 8/14/2017 15:10
6 A 21770 12/31/2016 8/16/2017 10:05
15 A 21770 12/31/2016 8/15/2017 14:25
24 A 21770 12/31/2016 8/14/2017 15:10
5 A 19226 9/30/2016 8/16/2017 10:05
14 A 19226 9/30/2016 8/15/2017 14:25
23 A 19226 9/30/2016 8/14/2017 15:10
4 A 20238 6/30/2016 8/16/2017 10:05
13 A 20238 6/30/2016 8/15/2017 14:25
22 A 20238 6/30/2016 8/14/2017 15:10
3 A 18684 3/31/2016 8/16/2017 10:05
12 A 18684 3/31/2016 8/15/2017 14:25
21 A 18684 3/31/2016 8/14/2017 15:10
2 A 22059 12/31/2015 8/16/2017 10:05
11 A 22059 12/31/2015 8/15/2017 14:25
20 A 22059 12/31/2015 8/14/2017 15:10
1 A 19280 9/30/2015 8/16/2017 10:05
10 A 19280 9/30/2015 8/15/2017 14:25
19 A 19280 9/30/2015 8/14/2017 15:10
0 A 20813 6/30/2015 8/16/2017 10:05
9 A 20813 6/30/2015 8/15/2017 14:25
18 A 20813 6/30/2015 8/14/2017 15:10


It is a set of data I collect every weekday (SYSDATE is a timestamp).

I want to generate a df with only the rows stamped with either of the two most recent timestamp values every day.

So I'm looking to get this (from much larger set with many timestamps) if I were to run the script today:

IND ID Value EFFECTIVE DT SYSDATE
8 A 19289 6/30/2017 8/16/2017 10:05
17 A 19289 6/30/2017 8/15/2017 14:25
7 A 18155 3/31/2017 8/16/2017 10:05
16 A 18155 3/31/2017 8/15/2017 14:25
6 A 21770 12/31/2016 8/16/2017 10:05
15 A 21770 12/31/2016 8/15/2017 14:25
5 A 19226 9/30/2016 8/16/2017 10:05
14 A 19226 9/30/2016 8/15/2017 14:25
4 A 20238 6/30/2016 8/16/2017 10:05
13 A 20238 6/30/2016 8/15/2017 14:25
3 A 18684 3/31/2016 8/16/2017 10:05
12 A 18684 3/31/2016 8/15/2017 14:25
2 A 22059 12/31/2015 8/16/2017 10:05
11 A 22059 12/31/2015 8/15/2017 14:25
1 A 19280 9/30/2015 8/16/2017 10:05
10 A 19280 9/30/2015 8/15/2017 14:25
0 A 20813 6/30/2015 8/16/2017 10:05
9 A 20813 6/30/2015 8/15/2017 14:25


I can't use datetimes because of weekends and holidays.

Suggestions?

Thanks in advance.

Answer Source

You need to first make sure SYSDATE is turned into datetime. I'll do it for EFFECTIVE DT as well.

df[['EFFECTIVE DT', 'SYSDATE']] = \
    df[['EFFECTIVE DT', 'SYSDATE']].apply(pd.to_datetime)

Use groupby.apply with the dataframe method pd.DataFrame.nlargest where you pass the parameters columns='SYSDATE' and n=2 for the largest two 'SYSDATE's.

df.groupby(
    'EFFECTIVE DT', group_keys=False, sort=False
).apply(pd.DataFrame.nlargest, n=2, columns='SYSDATE')

    IND ID  value EFFECTIVE DT             SYSDATE
0     8  A  19289   2017-06-30 2017-08-16 10:05:00
1    17  A  19289   2017-06-30 2017-08-15 14:25:00
3     7  A  18155   2017-03-31 2017-08-16 10:05:00
4    16  A  18155   2017-03-31 2017-08-15 14:25:00
6     6  A  21770   2016-12-31 2017-08-16 10:05:00
7    15  A  21770   2016-12-31 2017-08-15 14:25:00
9     5  A  19226   2016-09-30 2017-08-16 10:05:00
10   14  A  19226   2016-09-30 2017-08-15 14:25:00
12    4  A  20238   2016-06-30 2017-08-16 10:05:00
13   13  A  20238   2016-06-30 2017-08-15 14:25:00
15    3  A  18684   2016-03-31 2017-08-16 10:05:00
16   12  A  18684   2016-03-31 2017-08-15 14:25:00
18    2  A  22059   2015-12-31 2017-08-16 10:05:00
19   11  A  22059   2015-12-31 2017-08-15 14:25:00
21    1  A  19280   2015-09-30 2017-08-16 10:05:00
22   10  A  19280   2015-09-30 2017-08-15 14:25:00
24    0  A  20813   2015-06-30 2017-08-16 10:05:00
25    9  A  20813   2015-06-30 2017-08-15 14:25:00

How It Works
To start, splitting, applying stuff to splits, and recombining your efforts is a key feature of pandas and is explained well here split-apply-combine.

The groupby element should be self-evident. I want to group the data by each day as defined by the dates in the 'EFFECTIVE DT' column. After that, you can do many things with this groupby object. I decided to apply a function that will return the 2 rows that correspond to the largest two values of the 'SYSDATE' column. Those largest values equate to the most recent for the day of the group.

It turns out that there is a dataframe method that performs this task of returning the rows corresponding to the largest values of a column. Namely, pd.DataFrame.nlargest.

Two things to note:

  1. When we use groupby.apply, the object being passed to the function being applied is a pd.DataFrame object.
  2. When using a method like pd.DataFrame.nlargest as a function, the first argument that is expected is a pd.DataFrame object.

Well, that's fortunate, because that's exactly what I'm doing.

Also, groupby.apply allows you to pass additional key word arguments to the applied function via kwargs. So, I can pass n=2 and columns='SYSDATE' easily.


Timing

I produced another function that tries to leverage numpy's version of nlargest. It works faster. However, when I compare it to @jezrael's sort first approach, it is still much slower.

def nlrg(d):
    v = d.HOURS.values
    a = np.argpartition(v, v.size - 2)[-2:]
    return d.iloc[a]

pir1 = lambda d: d.groupby('DAYS', group_keys=False, sort=False).apply(pd.DataFrame.nlargest, n=2, columns='HOURS')
pir2 = lambda d: d.groupby('DAYS', sort=False, group_keys=False).apply(nlrg)
jez1 = lambda d: d.sort_values(['DAYS', 'HOURS']).groupby('DAYS').tail(2).sort_index()

def produce_test_df(i):
    hours = pd.date_range('2000-01-01', periods=i, freq='H')
    days = hours.floor('D')
    return pd.DataFrame(dict(HOURS=hours, DAYS=days))

results = pd.DataFrame(
    index=[100, 300, 1000, 3000, 10000],
    columns='pir1 pir2 jez1'.split()
)

for i in results.index:
    d = produce_test_df(i)
    for j in results.columns:
        stmt = '{}(d)'.format(j)
        setp = 'from __main__ import d, {}'.format(j)
        results.set_value(i, j, timeit(stmt, setp, number=10))

results.plot(loglog=True)

enter image description here

Recommended from our users: Dynamic Network Monitoring from WhatsUp Gold from IPSwitch. Free Download