swepab swepab - 7 months ago 18
Python Question

Group processing in Pandas Python

My questions revolves around applying logic in a Pandas DataFrame of the appearance below:


ID yyyymm value1 value2
1 201501 0 123
1 201502 1 113
1 201503 3 115
2 201506 0 0
2 201507 0 0
2 201508 1 115
2 201509 0 0
3 201503 0 0
3 201504 0 0
3 201505 0 0


What I want to do is to count the time diff for each of the IDs dependent on the consecutive order of the time value in the yyyymm-variable. I want to start counting at the beginning of each id and when value1 > 0 then save the time stamp. Same with value2, when it goes > 0 I want to save the time stamp, ending up with table 2 below.


ID time_value1 value1 time_value2 value2
1 1 1 0 123
2 2 1 2 115


I have a SAS background and its straightforward doing this with by group-processing.

Code for table below:

data = pd.DataFrame({'ID':[1,1,1,2,2,2,2,3,3,3],
'yyyymm':[201501,201502,201503,201506,201507,201508,201509,201503,201504,201505],
'value1':[0,1,3,0,0,1,0,0,0,0],
'value2':[123,113,115,0,0,115,0,0,0,0]})


Many thanks in advance for your time!

Answer

This will order by date, group by ID and find a row where either value1 or value2 > 0, save that row into another dataframe and move onto the next ID group. If you want more than 1 save per group ID, just remove the break

I didnt know what value you wanted in the "time_value1" or "time_value2" columns of the final dataframe, but you can easily edit that variable assignment to whatever you want

import pandas as pd

data = pd.DataFrame({'ID':[1,1,1,2,2,2,2,3,3,3],
                'yyyymm':[201501,201502,201503,201506,201507,201508,201509,201503,201504,201505],
                'value1':[0,1,3,0,0,1,0,0,0,0],
                'value2':[123,113,115,0,0,115,0,0,0,0]})

final = pd.DataFrame(columns=["ID", "time_value1", "value1", "time_value2", "value2"])

def findTimes(df):
    for index, row in df.iterrows():
        if row["value1"] > 0 or row["value2"] > 0:

            final.loc[index,"ID"] = row["ID"]
            final.loc[index,"time_value1"] = row["value1"]
            final.loc[index,"value1"] = row["value1"]
            final.loc[index,"time_value2"] = row["value2"]
            final.loc[index,"value2"] = row["value2"]

            break

data.sort_values("yyyymm").groupby("ID").apply(lambda x: findTimes(x))