codingknob codingknob - 3 months ago 4
Python Question

+= operation with non existing dataframes

df_pairs:

city1 city2
0 sfo yyz
1 sfo yvr
2 sfo dfw
3 sfo ewr


output of df_pairs.to_dict('records'):

[{'city1': 'sfo', 'city2': 'yyz'},
{'city1': 'sfo', 'city2': 'yvr'},
{'city1': 'sfo', 'city2': 'dfw'},
{'city1': 'sfo', 'city2': 'ewr'}]


data_df:

city 2016-02-02 00:00:00 2016-02-05 00:00:00 2016-02-01 00:00:00 2016-02-04 00:00:00 2016-02-03 00:00:00
0 sfo -33.63 -62.34 -35.70 -31.84 -33.87
1 yyz -24.31 -51.17 -22.07 -31.00 -23.00
2 yvr -24.31 -51.17 -22.07 -31.00 -23.00
3 dfw -32.17 -43.77 -34.84 0.27 -11.49
4 ewr -28.87 -59.66 -28.40 -32.94 -29.06


output of data_df.to_dict('records')

[{'city': 'sfo',
Timestamp('2016-02-02 00:00:00'): -33.63,
Timestamp('2016-02-05 00:00:00'): -62.34,
Timestamp('2016-02-01 00:00:00'): -35.7,
Timestamp('2016-02-04 00:00:00'): -31.84,
Timestamp('2016-02-03 00:00:00'): -33.87},
{'city': 'yyz',
Timestamp('2016-02-02 00:00:00'): -24.31,
Timestamp('2016-02-05 00:00:00'): -51.17,
Timestamp('2016-02-01 00:00:00'): -22.07,
Timestamp('2016-02-04 00:00:00'): -31.0,
Timestamp('2016-02-03 00:00:00'): -23.0},
{'city': 'yvr',
Timestamp('2016-02-02 00:00:00'): -24.31,
Timestamp('2016-02-05 00:00:00'): -51.17,
Timestamp('2016-02-01 00:00:00'): -22.07,
Timestamp('2016-02-04 00:00:00'): -31.0,
Timestamp('2016-02-03 00:00:00'): -23.0},
{'city': 'dfw',
Timestamp('2016-02-02 00:00:00'): -32.17,
Timestamp('2016-02-05 00:00:00'): -43.77,
Timestamp('2016-02-01 00:00:00'): -34.84,
Timestamp('2016-02-04 00:00:00'): 0.27,
Timestamp('2016-02-03 00:00:00'): -11.49},
{'city': 'ewr',
Timestamp('2016-02-02 00:00:00'): -28.87,
Timestamp('2016-02-05 00:00:00'): -59.66,
Timestamp('2016-02-01 00:00:00'): -28.4,
Timestamp('2016-02-04 00:00:00'): -32.94,
Timestamp('2016-02-03 00:00:00'): -29.06}]


So I have a df named
df_pairs
. For every pair in
df_pairs
, I want to lookup city1 and city2 in
data_df
, subtract one from the other, take the sign of the difference time series, separate positive and negative sign values, separate positive and negative difference values and calculate sums on each across the columns of data_df.

diff_df_sign_pos = diff_df_sign_neg = diff_df_pos = diff_df_neg = 0

for i in range(0,len(data_df.columns)):
a = pd.merge(df_pairs[['city1','city2']], data_df.ix[:, [i]], left_on='city1', right_index=True, how='left').set_index(['city1', 'city2'])
b = pd.merge(df_pairs[['city1','city2']], data_df.ix[:, [i]], left_on='city2', right_index=True, how='left').set_index(['city1', 'city2'])
diff_df = b - a
diff_df_sign = np.sign(diff_df)
diff_df_sign_pos+= diff_df_sign.clip(lower=0)
diff_df_sign_neg+= diff_df_sign.clip(upper=0)
diff_df_pos+= diff_df.clip(lower=0)
diff_df_neg+= diff_df.clip(upper=0)


If you run the above code, you will see that the final values for
diff_df_sign_pos
,
diff_df_sign_neg
,
diff_df_pos
and
diff_df_neg
are NaN's.

For example, the end result for
diff_df_sign_pos
should look like:

2016-02-03 00:00:00
city1 city2
sfo yyz 5.0
yvr 5.0
dfw 5.0
ewr 4.0


This tells us that all 5 of the differences between yyz, yvr, dfw and sfo were positive.

Answer

Why don't you simply do this:

df_city1 = pd.merge(df_pairs['city1'], data_df, left_on='city1', right_on='city', how='left')
df_city2 = pd.merge(df_pairs['city2'], data_df, left_on='city2', right_on='city', how='left')
diff = df_city2.subtract(df_city1, fill_value=0)
pos_sum = diff[diff >= 0].sum(axis=1)
neg_sum = diff[diff <  0].sum(axis=1)

Instead of looping over all your columns, merging 2*(number of columns) times, not to mention indexing, then that complicated bit with np.sign and .clip... Your df_pairs and data_df have a one-to-one correspondence, right?

Comments