SerialDev SerialDev - 3 months ago 16
Python Question

Unexpected behaviour when grouping outliers in pandas [Python]

My dataframe is in this format

df

Count
DateTime
2015-01-16 10
2015-01-17 28
2015-01-18 26
2015-01-19 10
2015-01-20 24
2015-01-21 25


Im experimenting with this function to eliminate outliers using groupby

def replaceit(group):
mean, std = group.mean(), group.std()
outliers = (group - mean).abs() > 3*std
group[outliers] = mean # or "group[~outliers].mean()"
return group


Creating a copy of that dataframe as I want to use it elsewhere:

df2 = df


Lets see the output of df2

df2

Count
DateTime
2015-01-16 10
2015-01-17 28
2015-01-18 26
2015-01-19 10
2015-01-20 24
2015-01-21 25


lets use the function

df2 = replaceit(df2)

df2

DateTime
2015-01-16 10.000000
2015-01-17 28.000000
2015-01-18 26.000000
2015-01-19 10.000000
2015-01-20 24.000000
2015-01-21 25.000000


BUT now lets see the output of df:

df

Count
DateTime
2015-01-16 10.000000
2015-01-17 28.000000
2015-01-18 26.000000
2015-01-19 10.000000
2015-01-20 24.000000
2015-01-21 25.000000


My question is, why is this happening?
How can I solve this issue?

Answer

Problem is if use df2 = df it is reference to the initial DataFrame. Thus, changing df2 will change the initial DataFrame df.

You need copy:

df2 = df.copy()