JonghoKim JonghoKim - 3 months ago 24
Python Question

The fastest way to update (partial sum of elements with complex conditions) the pandas dataframe

I try to update a pandas dataframe which has 3 million rows. At the below, I reduced my problem into a more simple problem. In short, it does add values in a cummulative sense.

But, this function takes too long time for me like more than 10 hours in a real problem. Is there any room for speeds up? Should I update it only at the last?

Can we update the pandas dataframe with a more faster way than a iterrows()?

Can we select multiple rows by their index and then updates?

def set_r(group, i, colname, add):
if colname in group:
prev = group.iloc[i][colname]
if math.isnan(prev):
group.set_value(i, colname, add)
else:
group.set_value(i, colname, prev+add)
else:
group.set_value(i, colname, add)

def set_bl_info(group, i, r, bl_value, timeframe, clorca, bl_criteria):

group.set_value(i, timeframe + '_' + bl_criteria, True)
colname = timeframe + '_' + clorca + '_' + 'bb_count_'+ bl_criteria
set_r(group, i, colname, 1)

def bl_assign(days, bl_key, bl_value, group, bl_p05, bl_p01):
print bl_key
sub_group = group[(group.pledged_date >= bl_value[0]) & (group.pledged_date <= bl_value[1])]
coexisting_icl = sub_group[(sub_group.project_category == bl_value[2]) & (sub_group.cluster == bl_value[3])]

for i, r in coexisting_icl.iterrows():
set_bl_info(group, i, r, bl_value, 'coexisting', 'icl','p1')


# main function
bl_assign(days, bl_key, bl_value, group, bl_p05, bl_p01)


For more simplicity, my problem is something like below:

A B C
0 0 0 False
1 7 0 True
2 8 0 True
3 5 0 True


Update B column if C is true with sum of A column's elements

A B C
0 0 0 False
1 7 20 True
2 8 20 True
3 5 20 True


After then, if D is also true then update B with sum of E in cumulatively

A B C D E
0 0 0 False False 1
1 7 20 True False 1
2 8 20 True True 1
3 5 20 True True 1

A B C D E
0 0 0 False False 1
1 7 20 True False 1
2 8 22 True True 1
3 5 22 True True 1

Answer

Update B column if C is true with sum of A column's elements

import numpy as np

df['B'] = np.where(df.C, df.A.sum(), 0)

After then, if D is also tru then update B with the sum of E (using the comment to the question above)

df.B = df.B + np.where(df.D, (df.E * df.D.astype(int)).sum(), 0)

So, at the end you have

>>> df
   A      C   B  E      D
0  0  False   0  1  False
1  7   True  20  1  False
2  8   True  22  1   True
3  5   True  22  1   True