Luke Ackerknecht - 1 year ago 117

Python Question

I need to winsorize two columns in my dataframe of 12 columns.

Say, I have columns 'A', 'B', 'C', and 'D', each with a series of values. Given that I cleaned some NaN columns, the number of columns was reduced from 100 to 80, but they are still indexed to 100 with gaps (e.g. row 5 is missing).

I want to transform only columns 'A' and 'B' via winsorize method. To do this, I must convert my columns to a np.array.

`import scipy.stats`

df['A','B','C','D'] = #some values per each column

ab_df = df['A','B']

X = scipy.stats.mstats.winsorize(ab_df.values, limits=0.01)

new_ab_df = pd.DataFrame(X, columns = ['A','B'])

df = pd.concat([df['C','D'], new_ab_df], axis=1, join='inner', join_axes=[df.index])

When I convert to a np.array, then back to a pd.DataFrame, it's len() is correct at 80 but my indexes have been reset to be 0->80. How can I ensure that my transform 'A' and 'B' columns are indexed correctly? I don't think I can use the apply(), which would preserve index order and simply swap out the values instead of my approach, which creates a transformed copy of my df with only 2 columns, then concats them to the rest of my non-transformed columns.

Answer Source

You can do this inplace to the original dataframe.

From the description of your question, it sounds like you are confusing rows and columns (i.e. you first say your dataframe has 12 columns, and then say the number of columns was reduced from 100 to 80).

It is always best to provide a minimal example of data in your question. Lacking this, here is some data based on my assumptions:

```
import numpy as np
import scipy.stats
import pandas as pd
np.random.seed(0)
df = pd.DataFrame(np.random.randn(7, 5), columns=list('ABCDE'))
df.iat[1, 0] = np.nan
df.iat[3, 1] = np.nan
df.iat[5, 2] = np.nan
>>> df
A B C D E
0 1.764052 0.400157 0.978738 2.240893 1.867558
1 NaN 0.950088 -0.151357 -0.103219 0.410599
2 0.144044 1.454274 0.761038 0.121675 0.443863
3 0.333674 NaN -0.205158 0.313068 -0.854096
4 -2.552990 0.653619 0.864436 -0.742165 2.269755
5 -1.454366 0.045759 NaN 1.532779 1.469359
6 0.154947 0.378163 -0.887786 -1.980796 -0.347912
```

My assumption is to drop any row with a NaN, and then winsorize.

```
mask = df.notnull().all(axis=1), ['A', 'B']
df.loc[mask] = scipy.stats.mstats.winsorize(df.loc[mask].values, limits=0.4)
```

I applied a high limit to the winsorize function so that the results are more obvious on this small dataset.

```
>>> df
A B C D E
0 0.400157 0.400157 0.978738 2.240893 1.867558
1 NaN 0.950088 -0.151357 -0.103219 0.410599
2 0.378163 0.400157 0.761038 0.121675 0.443863
3 0.333674 NaN -0.205158 0.313068 -0.854096
4 0.378163 0.400157 0.864436 -0.742165 2.269755
5 -1.454366 0.045759 NaN 1.532779 1.469359
6 0.378163 0.378163 -0.887786 -1.980796 -0.347912
```