FranGoitia - 8 months ago 54

Python Question

I need to replace nan values in a column considering the distribution. The normalized frequency of the values are 0.5879336282383848, 0.37367433995536975 and 0.03839203180624546.

I was going to do something like:

`for idx in nan_cols:`

if random() < 0.03839203180624546:

df[idx][col] = 0

elif random() < 0.03839203180624546 + 0.37367433995536975:

df[idx][col] = 1

else:

df[idx][col] = 2

I'm thinking there must be some pandas's method or something that I'm missing.

Answer Source

If I've understood the question, you have a DataFrame containing some null data which you want to replace with values distributed in some known way.

The following approach, which utilises the `DataFrame.stack`

and `DataFrame.unstack`

methods, together with `pd.cut`

, does what you want.

First, generate some dummy data representing your problem (here we have a 10x3 DataFrame populated by random data with `np.nan`

values inserted at random locations):

```
_nr, _nc = 10, 3
_rnd = np.random.random((_nr, _nc))
_data = pd.DataFrame(columns=['col{}'.format(x) for x in range(_nc)], data=_rnd)
_stacked = _data.stack()
_nullidx = np.random.randint(0, _stacked.size-1, int(.80*_stacked.size))
_stacked.iloc[_nullidx] = np.nan
data = _stacked.unstack()
```

This generates something like the following:

```
print(data)
col0 col1 col2
0 0.415485 NaN NaN
1 NaN NaN 0.799004
2 NaN 0.359693 NaN
3 NaN NaN 0.536442
4 0.733635 NaN NaN
5 NaN NaN NaN
6 0.574215 NaN NaN
7 NaN 0.949404 NaN
8 NaN 0.384289 0.633768
9 0.607698 0.266648 0.639140
```

What we want to do is populate these `np.nan`

values with either 0, 1 or 2, depending on draws from a uniform distribution on [0,1] with respect to the following frequencies:

```
freqs = 0.03839203180624546, 0.37367433995536975, 0.5879336282383848
```

To achieve this we stack, cut and then unstack:

```
stacked = data.copy().stack(dropna=False)
stacked[stacked.isnull()] = \
pd.cut(np.random.random(stacked.isnull().sum()),
np.cumsum(np.insert(freqs,0,0.)), labels=(0,1,2))
result = stacked.unstack()
```

This gives:

```
print(result)
col0 col1 col2
0 0.415485 2.000000 2.000000
1 2.000000 2.000000 0.799004
2 1.000000 0.359693 2.000000
3 1.000000 2.000000 0.536442
4 0.733635 0.000000 1.000000
5 2.000000 2.000000 2.000000
6 0.574215 2.000000 2.000000
7 1.000000 0.949404 2.000000
8 2.000000 0.384289 0.633768
9 0.607698 0.266648 0.639140
```