I have the following sample data:
IND ID value EFFECTIVE DT SYSDATE
8 A 19289 6/30/2017 8/16/2017 10:05
17 A 19289 6/30/2017 8/15/2017 14:25
26 A 19289 6/30/2017 8/14/2017 15:10
7 A 18155 3/31/2017 8/16/2017 10:05
16 A 18155 3/31/2017 8/15/2017 14:25
25 A 18155 3/31/2017 8/14/2017 15:10
6 A 21770 12/31/2016 8/16/2017 10:05
15 A 21770 12/31/2016 8/15/2017 14:25
24 A 21770 12/31/2016 8/14/2017 15:10
5 A 19226 9/30/2016 8/16/2017 10:05
14 A 19226 9/30/2016 8/15/2017 14:25
23 A 19226 9/30/2016 8/14/2017 15:10
4 A 20238 6/30/2016 8/16/2017 10:05
13 A 20238 6/30/2016 8/15/2017 14:25
22 A 20238 6/30/2016 8/14/2017 15:10
3 A 18684 3/31/2016 8/16/2017 10:05
12 A 18684 3/31/2016 8/15/2017 14:25
21 A 18684 3/31/2016 8/14/2017 15:10
2 A 22059 12/31/2015 8/16/2017 10:05
11 A 22059 12/31/2015 8/15/2017 14:25
20 A 22059 12/31/2015 8/14/2017 15:10
1 A 19280 9/30/2015 8/16/2017 10:05
10 A 19280 9/30/2015 8/15/2017 14:25
19 A 19280 9/30/2015 8/14/2017 15:10
0 A 20813 6/30/2015 8/16/2017 10:05
9 A 20813 6/30/2015 8/15/2017 14:25
18 A 20813 6/30/2015 8/14/2017 15:10
IND ID Value EFFECTIVE DT SYSDATE
8 A 19289 6/30/2017 8/16/2017 10:05
17 A 19289 6/30/2017 8/15/2017 14:25
7 A 18155 3/31/2017 8/16/2017 10:05
16 A 18155 3/31/2017 8/15/2017 14:25
6 A 21770 12/31/2016 8/16/2017 10:05
15 A 21770 12/31/2016 8/15/2017 14:25
5 A 19226 9/30/2016 8/16/2017 10:05
14 A 19226 9/30/2016 8/15/2017 14:25
4 A 20238 6/30/2016 8/16/2017 10:05
13 A 20238 6/30/2016 8/15/2017 14:25
3 A 18684 3/31/2016 8/16/2017 10:05
12 A 18684 3/31/2016 8/15/2017 14:25
2 A 22059 12/31/2015 8/16/2017 10:05
11 A 22059 12/31/2015 8/15/2017 14:25
1 A 19280 9/30/2015 8/16/2017 10:05
10 A 19280 9/30/2015 8/15/2017 14:25
0 A 20813 6/30/2015 8/16/2017 10:05
9 A 20813 6/30/2015 8/15/2017 14:25
You need to first make sure SYSDATE
is turned into datetime
. I'll do it for EFFECTIVE DT
as well.
df[['EFFECTIVE DT', 'SYSDATE']] = \
df[['EFFECTIVE DT', 'SYSDATE']].apply(pd.to_datetime)
Use groupby.apply
with the dataframe method pd.DataFrame.nlargest
where you pass the parameters columns='SYSDATE'
and n=2
for the largest two 'SYSDATE'
s.
df.groupby(
'EFFECTIVE DT', group_keys=False, sort=False
).apply(pd.DataFrame.nlargest, n=2, columns='SYSDATE')
IND ID value EFFECTIVE DT SYSDATE
0 8 A 19289 2017-06-30 2017-08-16 10:05:00
1 17 A 19289 2017-06-30 2017-08-15 14:25:00
3 7 A 18155 2017-03-31 2017-08-16 10:05:00
4 16 A 18155 2017-03-31 2017-08-15 14:25:00
6 6 A 21770 2016-12-31 2017-08-16 10:05:00
7 15 A 21770 2016-12-31 2017-08-15 14:25:00
9 5 A 19226 2016-09-30 2017-08-16 10:05:00
10 14 A 19226 2016-09-30 2017-08-15 14:25:00
12 4 A 20238 2016-06-30 2017-08-16 10:05:00
13 13 A 20238 2016-06-30 2017-08-15 14:25:00
15 3 A 18684 2016-03-31 2017-08-16 10:05:00
16 12 A 18684 2016-03-31 2017-08-15 14:25:00
18 2 A 22059 2015-12-31 2017-08-16 10:05:00
19 11 A 22059 2015-12-31 2017-08-15 14:25:00
21 1 A 19280 2015-09-30 2017-08-16 10:05:00
22 10 A 19280 2015-09-30 2017-08-15 14:25:00
24 0 A 20813 2015-06-30 2017-08-16 10:05:00
25 9 A 20813 2015-06-30 2017-08-15 14:25:00
How It Works
To start, splitting, applying stuff to splits, and recombining your efforts is a key feature of pandas
and is explained well here split-apply-combine.
The groupby
element should be self-evident. I want to group the data by each day as defined by the dates in the 'EFFECTIVE DT'
column. After that, you can do many things with this groupby
object. I decided to apply a function that will return the 2 rows that correspond to the largest two values of the 'SYSDATE'
column. Those largest values equate to the most recent for the day of the group.
It turns out that there is a dataframe method that performs this task of returning the rows corresponding to the largest values of a column. Namely, pd.DataFrame.nlargest
.
Two things to note:
groupby.apply
, the object being passed to the function being applied is a pd.DataFrame
object.pd.DataFrame.nlargest
as a function, the first argument that is expected is a pd.DataFrame
object.Well, that's fortunate, because that's exactly what I'm doing.
Also, groupby.apply
allows you to pass additional key word arguments to the applied function via kwargs
. So, I can pass n=2
and columns='SYSDATE'
easily.
Timing
I produced another function that tries to leverage numpy
's version of nlargest
. It works faster. However, when I compare it to @jezrael's sort first approach, it is still much slower.
def nlrg(d):
v = d.HOURS.values
a = np.argpartition(v, v.size - 2)[-2:]
return d.iloc[a]
pir1 = lambda d: d.groupby('DAYS', group_keys=False, sort=False).apply(pd.DataFrame.nlargest, n=2, columns='HOURS')
pir2 = lambda d: d.groupby('DAYS', sort=False, group_keys=False).apply(nlrg)
jez1 = lambda d: d.sort_values(['DAYS', 'HOURS']).groupby('DAYS').tail(2).sort_index()
def produce_test_df(i):
hours = pd.date_range('2000-01-01', periods=i, freq='H')
days = hours.floor('D')
return pd.DataFrame(dict(HOURS=hours, DAYS=days))
results = pd.DataFrame(
index=[100, 300, 1000, 3000, 10000],
columns='pir1 pir2 jez1'.split()
)
for i in results.index:
d = produce_test_df(i)
for j in results.columns:
stmt = '{}(d)'.format(j)
setp = 'from __main__ import d, {}'.format(j)
results.set_value(i, j, timeit(stmt, setp, number=10))
results.plot(loglog=True)