elleciel - 1 month ago 4x
Python Question

# Pandas: keeping only first row of data in each 60 second bin

What's the best way to keep only the first row of each 60 second bin of data in pandas? i.e. For every row that occurs at increasing time

t
, I want to delete all rows that occur up to
t+60
seconds.

I know there's some combination of
groupby().first()
that I can probably use, but the code examples I've seen (e.g. using
pandas.Grouper(freq='60s')
) will discard the original datetimes in favor of every 60 seconds offset from midnight rather than my original datetimes.

For example, the following:

time value
0 2016-05-11 13:00:10.841015028 0.215978
1 2016-05-11 13:02:05.760595780 0.155666
2 2016-05-11 13:02:05.760903860 0.155666
3 2016-05-11 13:02:18.325613076 0.157788
4 2016-05-11 13:02:18.486519052 0.157788
5 2016-05-11 13:02:20.243748548 0.157788
6 2016-05-11 13:02:20.533101692 0.157788
7 2016-05-11 13:02:20.646061652 0.157788
8 2016-05-11 13:02:21.121409820 0.157788
9 2016-05-11 13:04:24.660609068 0.211649
10 2016-05-11 13:04:24.660845612 0.211649
11 2016-05-11 13:04:24.660957596 0.211649
12 2016-05-11 13:04:24.661378132 0.211649
13 2016-05-11 13:04:24.661450628 0.211649
14 2016-05-11 13:04:24.661607044 0.211649

should become this:

time value
0 2016-05-11 13:00:10.841015028 0.215978
1 2016-05-11 13:02:05.760595780 0.155666
3 2016-05-11 13:04:24.660609068 0.211649

### Solution

def td60(ta):
d = np.timedelta64(int(6e10))
tp = ta + d
j = 0
yield j
for i, tx in enumerate(ta):
if tx > tp[j]:
yield i
j = i

def pir(df):
slc = list(td60(df.time.values))
return pd.DataFrame(df.values[slc], df.index[slc])

Example usage

pir(df)

### Setup for timing 500,000 rows

pop_n, smp_n = 1000000, 500000
np.random.seed([3,1415])
tidx = pd.date_range('2016-09-08', periods=pop_n, freq='5s')
tidx = np.random.choice(tidx, smp_n, False)
tidx = pd.to_datetime(tidx).sort_values()

df = pd.DataFrame(dict(time=tidx, value=np.random.rand(smp_n)))

### Timing

Cythonize
In Jupyter

%%cython
import numpy as np
import pandas as pd

def td60(ta):
d = np.timedelta64(int(6e10))
tp = ta + d
j = 0
yield j
for i, tx in enumerate(ta):
if tx > tp[j]:
yield i
j = i

def pir(df):
slc = list(td60(df.time.values))
return pd.DataFrame(df.values[slc], df.index[slc])

After Cythonizing
Not much different

### reference setup for OP example

from StringIO import StringIO
import pandas as pd

text = """time,value
2016-05-11 13:00:10.841015028,0.215978
2016-05-11 13:02:05.760595780,0.155666
2016-05-11 13:02:05.760903860,0.155666
2016-05-11 13:02:18.325613076,0.157788
2016-05-11 13:02:18.486519052,0.157788
2016-05-11 13:02:20.243748548,0.157788
2016-05-11 13:02:20.533101692,0.157788
2016-05-11 13:02:20.646061652,0.157788
2016-05-11 13:02:21.121409820,0.157788
2016-05-11 13:04:24.660609068,0.211649
2016-05-11 13:04:24.660845612,0.211649
2016-05-11 13:04:24.660957596,0.211649
2016-05-11 13:04:24.661378132,0.211649
2016-05-11 13:04:24.661450628,0.211649
2016-05-11 13:04:24.661607044,0.211649"""