Luke Nisbet Luke Nisbet - 1 year ago 95
Python Question

Looping IF statements for each row in a dataframe in pandas

Hi I am new to using pandas coming from a SAS background and I am trying to segment a continuous variable into bands using the following code.

var_range = df['BILL_AMT1'].max() - df['BILL_AMT1'].min()
a= 10
for i in range(1,a):
inc = var_range/a
lower_bound = df['BILL_AMT1'].min() + (i-1)*inc
print('Lower bound is '+str(lower_bound))
upper_bound = df['BILL_AMT1'].max() + (i)*inc
print('Upper bound is '+str(upper_bound))
if (lower_bound <= df['BILL_AMT1'] < upper_bound):
df['bill_class'] = i

I am expecting the code to check if the value of
is within the current loops boundings and set a

I get the following error:

ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().

I think that the if condition is evaluating correctly but the error is due to assigning a new column a value of the for loops counter.

Can anyone explain what is going wrong or suggest an alternative.

Thanks for any help.

Answer Source

To avoid the ValueError, change

if (lower_bound <= df['BILL_AMT1'] < upper_bound):
    df['bill_class'] = i


mask = (lower_bound <= df['BILL_AMT1']) & (df['BILL_AMT1'] < upper_bound)
df.loc[mask, 'bill_class'] = i

The chained comparison (lower_bound <= df['BILL_AMT1'] < upper_bound) is equivalent to

(lower_bound <= df['BILL_AMT1']) and (df['BILL_AMT1'] < upper_bound)

The and operator causes the two boolean Series (lower_bound <= df['BILL_AMT1']), (df['BILL_AMT1'] < upper_bound) to be evaluated in a boolean context -- i.e. reduced to a single boolean value. Pandas refuses to reduce Series to a single boolean value.

Instead, to return a boolean Series, use the & operator instead of and:

mask = (lower_bound <= df['BILL_AMT1']) & (df['BILL_AMT1'] < upper_bound)

and then to assign values to the bill_class column where mask is True, use df.loc:

df.loc[mask, 'bill_class'] = i

To bin the data in df['BILL_AMT1'], you could remove the Python for-loop entirely, and as DSM suggests, use pd.cut:

df['bill_class'] = pd.cut(df['BILL_AMT1'], bins=10, labels=False)+1
Recommended from our users: Dynamic Network Monitoring from WhatsUp Gold from IPSwitch. Free Download