kit kit - 2 months ago 8
Python Question

How to compare decimal numbers available in columns of pandas dataframe?

I want to compare decimal values which are available in two columns of pandas dataframe.

I have a dataframe:

data = {'AA' :{0:'-14.35',1:'632.0',2:'619.5',3:'352.35',4:'347.7',5:'100'},
'BB' :{0:'-14.3500',1:'632.0000',2:'619.5000',3:'352.3500',4:'347.7000',5:'200'}
}
df1 = pd.DataFrame(data)
print df1


dataframe look like this :

AA BB
0 -14.35 -14.3500
1 632.0 632.0000
2 619.5 619.5000
3 352.35 352.3500
4 347.7 347.7000
5 100 200


I want to compare
AA
and
BB
columns. As shown in above dataframe, values of both columns are same except 5th row. only issue is the trailing zeros.

If both
AA
and
BB
columns are same then I want result of these comparison in third column as a
Result
i.e.
True
or
False
.

Expected Result :

AA BB Result
0 -14.35 -14.35 True
1 632.0 632.0 True
2 619.5 619.5 True
3 352.35 352.35 True
4 347.7 347.7 True
5 100 200 False


How I can compare these decimal values?

Answer

You need cast column to float by astype and then compare columns, because type of values in columns is string. Then use mask and as condition use boolean column Result:

print (type(df1.ix[0,'AA']))
<class 'str'>

print (type(df1.ix[0,'BB']))
<class 'str'>

df1['Result'] = df1.AA.astype(float) == df1.BB.astype(float)
df1.BB = df1.BB.mask(df1.Result,df1.AA)
print (df1)
       AA      BB Result
0  -14.35  -14.35   True
1   632.0   632.0   True
2   619.5   619.5   True
3  352.35  352.35   True
4   347.7   347.7   True
5     100     200  False

Another solution with ix:

df1['Result'] = df1.AA.astype(float) == df1.BB.astype(float)
df1.ix[df1.Result, 'BB'] = df1.AA
print (df1)
       AA      BB Result
0  -14.35  -14.35   True
1   632.0   632.0   True
2   619.5   619.5   True
3  352.35  352.35   True
4   347.7   347.7   True
5     100     200  False

Timings:

#len(df) = 6k
df1 = pd.concat([df1]*1000).reset_index(drop=True)

In [31]: %timeit df1.ix[df1.Result, 'BB'] = df1.AA
The slowest run took 4.88 times longer than the fastest. This could mean that an intermediate result is being cached.
1000 loops, best of 3: 1.19 ms per loop

In [33]: %timeit df1.BB = df1.BB.mask(df1.Result,df1.AA)
1000 loops, best of 3: 900 ┬Ás per loop