piRSquared piRSquared - 1 month ago 8
Python Question

most efficient way to randomly null out values in dataframe

consider

df


df = pd.DataFrame(np.ones((10, 10)) * 2,
list('abcdefghij'), list('ABCDEFGHIJ'))
df


enter image description here

How can I nullify ~20% of these values at random?

enter image description here

Answer

You could use a numpy.random.choice to generate a mask:

import numpy as np

mask = np.random.choice([True, False], size=(10,10), p=[.2,.8])

df.mask(mask)

In one line (and with size based on the df as @root suggests):

df.mask(np.random.choice([True, False], size=df.shape, p=[.2,.8]))

Speed tested using timeit at ~770μs:

>>> python -m timeit -n 10000 
        -s "import pandas as pd;import numpy as np;df=pd.DataFrame(np.ones((10,10))*2)"
        "df.mask(np.random.choice([True,False], size=df.shape, p=[.2,.8]))"
10000 loops, best of 3: 770 usec per loop