Charles Charles - 24 days ago 6
Python Question

Subtract multiple columns based on foreign key in pandas

I'm trying to calculate the difference between an object and it's benchmark. I have a dataset containing daily records for all of the objects and their corresponding values that looks like this:

obj_df
date id value_a value_b value_c value_d benchmark_id
01/21/2015 abc 10 41 19 22 efg
01/22/2015 abc 15 43 11 21 efg
01/21/2015 xyz 16 45 13 26 tuv
01/22/2015 xyz 13 48 12 22 tuv
01/21/2015 tru 10 39 15 21 efg
01/21/2015 tru 11 37 13 20 efg


I also have the data about the benchmarks. The value columns are shared between dataframes. The id in the benchmark set corresponds to a benchmark id in the original object dataframe.

bm_df
date id value_a value_b value_c value_d
01/21/2015 efg 12 40 12 20
01/22/2015 efg 15 41 14 21
01/21/2015 tuv 14 42 11 19
01/22/2015 tuv 13 43 19 17


I'm trying to find a simple way to return a dataframe that gives me the difference between the object values and the corresponding benchmark value to get a dataframe that looks something like this.

diff_df
date id diff_a diff_b diff_c diff_d benchmark_id
01/21/2015 abc -2 1 7 2 efg
01/22/2015 abc 0 2 -3 0 efg
01/21/2015 xyz 2 3 2 7 tuv
01/22/2015 xyz 0 5 -7 5 tuv
01/21/2015 tru -4 -3 4 2 efg
01/21/2015 tru -2 -6 -6 3 efg


A few things to note:

- There are more objects than benchmarks, so the indexes will not be the same size.

- Every object has a benchmark.

- I don't particularly care about the original values. Just the difference.

- Some benchmarks correspond to more than one object. For example both 'abc' and 'tru' use 'efg' as the benchmark.

Answer

Steps:

Perform merge:

df = obj_df.merge(bm_df, left_on=['benchmark_id', 'date'], right_on=['id', 'date'])    \
           .drop(['id_y'], 1).set_index(['date'])

Helper function to find column index locations by inputting the starting and ending column names:

def col_locate(df, start, end):
    start_loc = df.columns.get_loc(start)
    end_loc = df.columns.get_loc(end)
    return list(range(start_loc, end_loc+1))

fir, sec = col_locate(df,'value_a_x','value_d_x'), col_locate(df,'value_a_y','value_d_y')

Subtract the values from the objectDFand the benchmarkDF:

df_diff = pd.DataFrame(df.iloc[:, fir].values - df.iloc[:, sec].values, 
                       columns=list('abcd'), index=df.index).add_prefix('diff_')

Finally, Concatenate them columnwise:

pd.concat([df[['id_x', 'benchmark_id']], df_diff], axis=1)

enter image description here

Note: Updated DF used to arrive at the result.