elleciel elleciel - 2 months ago 8
Python Question

Pandas: keeping only first row of data in each 60 second bin

What's the best way to keep only the first row of each 60 second bin of data in pandas? i.e. For every row that occurs at increasing time

t
, I want to delete all rows that occur up to
t+60
seconds.

I know there's some combination of
groupby().first()
that I can probably use, but the code examples I've seen (e.g. using
pandas.Grouper(freq='60s')
) will discard the original datetimes in favor of every 60 seconds offset from midnight rather than my original datetimes.

For example, the following:

time value
0 2016-05-11 13:00:10.841015028 0.215978
1 2016-05-11 13:02:05.760595780 0.155666
2 2016-05-11 13:02:05.760903860 0.155666
3 2016-05-11 13:02:18.325613076 0.157788
4 2016-05-11 13:02:18.486519052 0.157788
5 2016-05-11 13:02:20.243748548 0.157788
6 2016-05-11 13:02:20.533101692 0.157788
7 2016-05-11 13:02:20.646061652 0.157788
8 2016-05-11 13:02:21.121409820 0.157788
9 2016-05-11 13:04:24.660609068 0.211649
10 2016-05-11 13:04:24.660845612 0.211649
11 2016-05-11 13:04:24.660957596 0.211649
12 2016-05-11 13:04:24.661378132 0.211649
13 2016-05-11 13:04:24.661450628 0.211649
14 2016-05-11 13:04:24.661607044 0.211649


should become this:

time value
0 2016-05-11 13:00:10.841015028 0.215978
1 2016-05-11 13:02:05.760595780 0.155666
3 2016-05-11 13:04:24.660609068 0.211649

Answer

see Path Dependent Slicing

Solution

def td60(ta):
    d = np.timedelta64(int(6e10))
    tp = ta + d
    j = 0
    yield j
    for i, tx in enumerate(ta):
        if tx > tp[j]:
            yield i
            j = i

def pir(df):
    slc = list(td60(df.time.values))
    return pd.DataFrame(df.values[slc], df.index[slc])

Example usage

pir(df)

enter image description here


Setup for timing 500,000 rows

pop_n, smp_n = 1000000, 500000
np.random.seed([3,1415])
tidx = pd.date_range('2016-09-08', periods=pop_n, freq='5s')
tidx = np.random.choice(tidx, smp_n, False)
tidx = pd.to_datetime(tidx).sort_values()

df = pd.DataFrame(dict(time=tidx, value=np.random.rand(smp_n)))

Timing

enter image description here

Cythonize
In Jupyter

%load_ext Cython

%%cython
import numpy as np
import pandas as pd

def td60(ta):
    d = np.timedelta64(int(6e10))
    tp = ta + d
    j = 0
    yield j
    for i, tx in enumerate(ta):
        if tx > tp[j]:
            yield i
            j = i

def pir(df):
    slc = list(td60(df.time.values))
    return pd.DataFrame(df.values[slc], df.index[slc])

After Cythonizing
Not much different

enter image description here


reference setup for OP example

from StringIO import StringIO
import pandas as pd

text = """time,value
2016-05-11 13:00:10.841015028,0.215978
2016-05-11 13:02:05.760595780,0.155666
2016-05-11 13:02:05.760903860,0.155666
2016-05-11 13:02:18.325613076,0.157788
2016-05-11 13:02:18.486519052,0.157788
2016-05-11 13:02:20.243748548,0.157788
2016-05-11 13:02:20.533101692,0.157788
2016-05-11 13:02:20.646061652,0.157788
2016-05-11 13:02:21.121409820,0.157788
2016-05-11 13:04:24.660609068,0.211649
2016-05-11 13:04:24.660845612,0.211649
2016-05-11 13:04:24.660957596,0.211649
2016-05-11 13:04:24.661378132,0.211649
2016-05-11 13:04:24.661450628,0.211649
2016-05-11 13:04:24.661607044,0.211649"""

df = pd.read_csv(StringIO(text), parse_dates=[0])