Justin Justin - 3 months ago 9
Python Question

Custom function to identify Consecutive Dates bottlenecks program

I am working with a DataFrame where each row observation has an ordinal datetime object attached to it. I've written a function that looks through my DataFrame and identifies consecutively occurring days and the length of run with the following code:

def consecutiveCount(df):
df= df.copy()
cond1 = df['DATE_INT'].shift(-1) - df['DATE_INT'] == 1
cond2 = df['DATE_INT'].shift(1) - df['DATE_INT'] == -1

cond3 = df['DATE_INT'].shift(-2) - df['DATE_INT'] == 2
cond4 = df['DATE_INT'].shift(2) - df['DATE_INT'] == -2


Now I continue making these conditions in the same fashion until the point:

cond55 = df['DATE_INT'].shift(-28) - df['DATE_INT'] == 28
cond56 = df['DATE_INT'].shift(28) - df['DATE_INT'] == -28

cond57 = df['DATE_INT'].shift(-29) - df['DATE_INT'] == 29
cond58 = df['DATE_INT'].shift(29) - df['DATE_INT'] == -29


I then write the length of the 'run' of days in a column variable with the following code:

df.loc[cond1 | cond2, 'CONSECUTIVE_COUNT'] = 2
df.loc[cond3 | cond4, 'CONSECUTIVE_COUNT'] = 3


again I continue until I reach 'runs' of days of length 30.

df.loc[cond55 | cond56, 'CONSECUTIVE_COUNT'] = 29
df.loc[cond57 | cond58, 'CONSECUTIVE_COUNT'] = 30


Finally I apply the function to particular groups of my DataFrame as follows:

df1 = df.groupby(['COUNTY_GEOID_YEAR','TEMPBIN']).apply(consecutiveCount)


I am sure there are much more efficient ways to write this code but when testing it on small subsets of my data it worked exactly how I wanted it. I've identified the bottle neck in my script is when applying the function by printing various strings throughout my script.

Any help on writing the function in a more efficient manner or how to speed up applying the function would be great! Please let me know if I can provide anymore info.

DSM DSM
Answer

IIUC, you can you can use the shift-compare-cumsum pattern after applying your groupby, and then do a transform.

Assuming that your data looks something like this (simplifying a bit)

df = pd.DataFrame({"GEOID_YEAR": [2000]*10 + [2001]*4, "TEMPBIN": [1]*14,
                   "DATE_INT": [1,2,3,4,6,7,9,10,11,14] + list(range(14,18)),
                   "OTHER_COL": [2]*14})

or

    DATE_INT  GEOID_YEAR  OTHER_COL  TEMPBIN
0          1        2000          2        1
1          2        2000          2        1
2          3        2000          2        1
3          4        2000          2        1
4          6        2000          2        1
5          7        2000          2        1
6          9        2000          2        1
7         10        2000          2        1
8         11        2000          2        1
9         14        2000          2        1
10        14        2001          2        1
11        15        2001          2        1
12        16        2001          2        1
13        17        2001          2        1

then

df["cons_id"] = df.groupby(["GEOID_YEAR", "TEMPBIN"])["DATE_INT"].apply(
    lambda x: (x != x.shift() + 1).cumsum())
df["cons_count"] = (df.groupby(["GEOID_YEAR", "TEMPBIN", "cons_id"])
                    ["cons_id"].transform("count"))

gives us

In [78]: df
Out[78]: 
    DATE_INT  GEOID_YEAR  OTHER_COL  TEMPBIN  cons_id  cons_count
0          1        2000          2        1        1           4
1          2        2000          2        1        1           4
2          3        2000          2        1        1           4
3          4        2000          2        1        1           4
4          6        2000          2        1        2           2
5          7        2000          2        1        2           2
6          9        2000          2        1        3           3
7         10        2000          2        1        3           3
8         11        2000          2        1        3           3
9         14        2000          2        1        4           1
10        14        2001          2        1        1           4
11        15        2001          2        1        1           4
12        16        2001          2        1        1           4
13        17        2001          2        1        1           4
Comments