chinmaykelkar chinmaykelkar - 4 months ago 7
Python Question

Sum a column of values including letters in Python

I have an input CSV file and need to add all the values in one of the columns, but the values are not plain integers and I'm not sure how to go about it.

The total output should be around 15k, which is the sum of the entire column. I am using pandas dataframe to store .csv file.

Here is the one of the columns in my input

.csv
file:

DAMAGE_PROPERTY
0K
0K
2.5K
2.5K
.25K
.25K
2.5K
25K
2.5K
.25K
25K
25K
250K
2.5K
25K
2.5K
2.5K
2.5K
0K
2.5K
.25K
2.5K
25K

Answer

I think you need first remove K by str.replace, then cast to float by astype and last sum:

print (df.DAMAGE_PROPERTY.str.replace('K','').astype(float).sum())
401.0

Then can multiple by 1000:

print (df.DAMAGE_PROPERTY.str.replace('K','').astype(float).sum() * 1000)
401000.0

If need add K:

print (str(df.DAMAGE_PROPERTY.str.replace('K','').astype(float).sum()) + 'K')
401.0K

EDIT by comment:

If need output in K:

print (df)
  DAMAGE_PROPERTY
0            2.5K
1            2.5K
2             25M

#create mask where values `M`
mask = df.DAMAGE_PROPERTY.str.contains('M')
print (mask)
0    False
1    False
2     True
Name: DAMAGE_PROPERTY, dtype: bool

#multiple by 1000 where is mask
df['DAMAGE_PROPERTY'] = df.DAMAGE_PROPERTY.str.replace(r'[KM]','').astype(float)
df['DAMAGE_PROPERTY'] = df.DAMAGE_PROPERTY.mask(mask, df.DAMAGE_PROPERTY*1000)
print (df)
   DAMAGE_PROPERTY
0              2.5
1              2.5
2          25000.0

print (df['DAMAGE_PROPERTY'].sum())
25005.0

print (str(df['DAMAGE_PROPERTY'].sum()) + 'K' )
25005.0K

If need output as number:

df['DAMAGE_PROPERTY'] = df.DAMAGE_PROPERTY.str.replace(r'[KM]','').astype(float)
df['DAMAGE_PROPERTY'] = df.DAMAGE_PROPERTY.mask(mask, df.DAMAGE_PROPERTY*1000) * 1000
print (df)
   DAMAGE_PROPERTY
0           2500.0
1           2500.0
2       25000000.0

print (df['DAMAGE_PROPERTY'].sum())
25005000.0

EDIT1:

If there are values with B:

print (df)
  DAMAGE_PROPERTY
0            2.5K
1            2.5B
2             25M

maskM = df.DAMAGE_PROPERTY.str.contains('M')
print (maskM)
0    False
1    False
2     True
Name: DAMAGE_PROPERTY, dtype: bool

maskB = df.DAMAGE_PROPERTY.str.contains('B')
print (maskB)
0    False
1     True
2    False
Name: DAMAGE_PROPERTY, dtype: bool

df['DAMAGE_PROPERTY'] = df.DAMAGE_PROPERTY.str.replace(r'[KMB]','').astype(float)
df['DAMAGE_PROPERTY'] = df.DAMAGE_PROPERTY.mask(maskM, df.DAMAGE_PROPERTY*1000)
df['DAMAGE_PROPERTY'] = df.DAMAGE_PROPERTY.mask(maskB, df.DAMAGE_PROPERTY*1000000)
print (df)
   DAMAGE_PROPERTY
0              2.5
1        2500000.0
2          25000.0

print (df['DAMAGE_PROPERTY'])
0          2.5
1    2500000.0
2      25000.0
Name: DAMAGE_PROPERTY, dtype: float64