FranGoitia FranGoitia - 13 days ago 6
Python Question

Pandas. Replace NAN values according to variable distribution

I need to replace nan values in a column considering the distribution. The normalized frequency of the values are 0.5879336282383848, 0.37367433995536975 and 0.03839203180624546.
I was going to do something like:

for idx in nan_cols:
if random() < 0.03839203180624546:
df[idx][col] = 0
elif random() < 0.03839203180624546 + 0.37367433995536975:
df[idx][col] = 1
else:
df[idx][col] = 2


I'm thinking there must be some pandas's method or something that I'm missing.

Answer

If I've understood the question, you have a DataFrame containing some null data which you want to replace with values distributed in some known way.

The following approach, which utilises the DataFrame.stack and DataFrame.unstack methods, together with pd.cut, does what you want.

First, generate some dummy data representing your problem (here we have a 10x3 DataFrame populated by random data with np.nan values inserted at random locations):

_nr, _nc = 10, 3
_rnd = np.random.random((_nr, _nc))
_data = pd.DataFrame(columns=['col{}'.format(x) for x in range(_nc)], data=_rnd)
_stacked = _data.stack()
_nullidx = np.random.randint(0, _stacked.size-1, int(.80*_stacked.size))
_stacked.iloc[_nullidx] = np.nan
data = _stacked.unstack()

This generates something like the following:

print(data)

       col0      col1      col2
0  0.415485       NaN       NaN
1       NaN       NaN  0.799004
2       NaN  0.359693       NaN
3       NaN       NaN  0.536442
4  0.733635       NaN       NaN
5       NaN       NaN       NaN
6  0.574215       NaN       NaN
7       NaN  0.949404       NaN
8       NaN  0.384289  0.633768
9  0.607698  0.266648  0.639140

What we want to do is populate these np.nan values with either 0, 1 or 2, depending on draws from a uniform distribution on [0,1] with respect to the following frequencies:

freqs = 0.03839203180624546, 0.37367433995536975, 0.5879336282383848

To achieve this we stack, cut and then unstack:

stacked = data.copy().stack(dropna=False)
stacked[stacked.isnull()] = \
            pd.cut(np.random.random(stacked.isnull().sum()), 
                   np.cumsum(np.insert(freqs,0,0.)), labels=(0,1,2))
result = stacked.unstack()

This gives:

print(result)

       col0      col1      col2
0  0.415485  2.000000  2.000000
1  2.000000  2.000000  0.799004
2  1.000000  0.359693  2.000000
3  1.000000  2.000000  0.536442
4  0.733635  0.000000  1.000000
5  2.000000  2.000000  2.000000
6  0.574215  2.000000  2.000000
7  1.000000  0.949404  2.000000
8  2.000000  0.384289  0.633768
9  0.607698  0.266648  0.639140
Comments