ch36r5s ch36r5s - 3 months ago 7
Python Question

python & pandas- Calculation bewteen rows based on certain values in columns from DataFrame

I have a large DataFrame (called df_NoMissing) with thousands of rows, and I need to do calculation and analysis with them.

NoDemande NoUsager Sens IdVehiculeUtilise Fait HeureArriveeSurSite HeureEffective Periods
0 42196000013 000001 + 287Véh 1 11/07/2015 08:02:07 11/07/2015 08:02:13 Matin
1 42196000013 000001 - 287Véh 1 11/07/2015 08:17:09 11/07/2015 08:17:13 Matin
2 42196000002 000314 + 263Véh 1 11/07/2015 09:37:43 11/07/2015 09:53:37 Matin
3 42196000016 002372 + 287Véh 1 11/07/2015 09:46:42 11/07/2015 10:01:39 Matin
4 42196000015 000466 + 287Véh 1 11/07/2015 09:46:42 11/07/2015 10:01:39 Matin
5 42196000002 000314 - 263Véh 1 11/07/2015 10:25:17 11/07/2015 10:38:11 Matin
6 42196000015 000466 - 287Véh 1 11/07/2015 10:48:51 11/07/2015 10:51:30 Matin
7 42196000016 002372 - 287Véh 1 11/07/2015 11:40:56 11/07/2015 11:41:01 Matin
8 42196000004 002641 + 263Véh 1 11/07/2015 13:39:29 11/07/2015 13:52:50 Soir
9 42196000004 002641 - 263Véh 1 11/07/2015 13:59:56 11/07/2015 14:07:41 Soir


What I want to do is to have two rows with the same value in the column
NoDemande
,
NoUsager
,
Periods
but different in column
Sens
do the subtraction between column
HeureArriveeSurSite
and
HeureEffective
. And because the result doesn't correspond to current DataFrame, so the result will be saved in a new DataFrame




I tried to separate the DataFrame by identifying
Sens
so I could to the subtraction directly. But it doesn't work at all.

df_new = pd.DataFrame(columns=['NoDemande', 'NoUsager', 'Periods', 'DureeTrajet']
df1 = df_NoMissing[(df_NoMissing['Sens'] == '+') & (df_NoMissing['Periods'] == 'Matin')]
df2 = df_NoMissing[(df_NoMissing['Sens'] == '-') & (df_NoMissing['Periods'] == 'Matin')]

df_new['DureeTrajet'] = df2['HeureArriveeSurSite'].values-df1['HeureEffective'].values


This one returned:
ValueError: operands could not be broadcast together with shapes (1478,) (1479,)


I also tried the loaded way by telling exactly what I want each time:

df1.loc[df1['NoDemande'] == '42196000015','HeureEffective'] - df2.loc[df2['NoDemande'] == '42196000015','HeureArriveeSurSite']


But this one came back with:

4 NaT
6 NaT
dtype: timedelta64[ns]


So what should I do to get what I want?




EDIT

The output will look like:

NoDemande NoUsager Periods DureeTrajet
0 42196000013 000001 Matin 00:14:54
1 42196000002 000314 Matin 00:31:40
2 42196000016 002372 Matin 00:39:23
3 42196000015 000466 Matin 00:47:12
4 42196000004 002641 Soir 00:07:06


Any help will be really appreciated~

Answer

Okay, starting with your DF as provided - let's create an index on the grouping columns and pivot to columns for the Sens action:

temp = df.set_index(['NoDemande', 'NoUsager', 'Periods']).pivot(columns='Sens')

Then - we take the appropriate difference (as according to your code):

duration = (temp['HeureArriveeSurSite', '-'] - temp['HeureEffective', '+']).to_frame(name='DureeTrajet').reset_index()

That then gives you:

     NoDemande  NoUsager Periods  DureeTrajet
0  42196000002       314   Matin     00:31:40
1  42196000004      2641    Soir     00:07:06
2  42196000013         1   Matin     00:14:56
3  42196000015       466   Matin     00:47:12
4  42196000016      2372   Matin     01:39:17