xvtk xvtk - 1 year ago 143
Python Question

Random Number from Histogram

Suppose I create a histogram using scipy/numpy, so I have two arrays: one for the bin counts, and one for the bin edges. If I use the histogram to represent a probability distribution function, how can I efficiently generate random numbers from that distribution?


It's probably what np.random.choice does in @Ophion's answer, but you can construct a normalized cumulative density function, then choose based on a uniform random number:

from __future__ import division
import numpy as np
import matplotlib.pyplot as plt

data = np.random.normal(size=1000)
hist, bins = np.histogram(data, bins=50)

bin_midpoints = bins[:-1] + np.diff(bins)/2
cdf = np.cumsum(hist)
cdf = cdf / cdf[-1]
values = np.random.rand(10000)
value_bins = np.searchsorted(cdf, values)
random_from_cdf = bin_midpoints[value_bins]

plt.hist(data, 50)
plt.hist(random_from_cdf, 50)

enter image description here

A 2D case can be done as follows:

data = np.column_stack((np.random.normal(scale=10, size=1000),
                        np.random.normal(scale=20, size=1000)))
x, y = data.T                        
hist, x_bins, y_bins = np.histogram2d(x, y, bins=(50, 50))
x_bin_midpoints = x_bins[:-1] + np.diff(x_bins)/2
y_bin_midpoints = y_bins[:-1] + np.diff(y_bins)/2
cdf = np.cumsum(hist.ravel())
cdf = cdf / cdf[-1]

values = np.random.rand(10000)
value_bins = np.searchsorted(cdf, values)
x_idx, y_idx = np.unravel_index(value_bins,
random_from_cdf = np.column_stack((x_bin_midpoints[x_idx],
new_x, new_y = random_from_cdf.T

plt.subplot(121, aspect='equal')
plt.hist2d(x, y, bins=(50, 50))
plt.subplot(122, aspect='equal')
plt.hist2d(new_x, new_y, bins=(50, 50))

enter image description here