qazplok11 - 1 year ago 67

Python Question

Not sure if this belongs in statistics (I may be missing some background knowledge on this topic), but I am trying to use python to achieve this. I essentially just have a list of integers:

`data = [300,244,543,1011,300,125,300 ... ]`

And I would like to know the probability of a value occurring given this data.

I graphed histograms of the data using matplotlib and obtained these:

I know that for a continuous distribution the probability of any exact point is supposed to be zero, but given a stream of new values, I need be able to say how likely each value is. I've looked through some of the numpy/scipy probability density functions, but I'm not sure which to choose from or how to query for new values once I run something like scipy.stats.norm.pdf(data). It seems like different probability density function will fit the data differently. Given the shape of the histograms I'm not sure how to decide which to use.

Sorry again if this belongs in math or statistics, I can move it if necessary.

Edit: Adding info from earlier comment, thanks @Kevin J. Chase.

In the first graph, the numbers represent the amount of characters in a sequence. In the second graph, it's a measured amount of time in miliseconds. The minimum is greater than zero, but there isn't necessarily a maximum. The graphs were created using millions of examples, but I'm not sure I can make any other assumptions about the distribution. I want to know the probability of a new value given that I have a few million examples of values. In graph1, I have a few million sequences of different lengths. Would like to know probability of a 200 length, for example.

Answer Source

Since you don't seem to have a specific distribution in mind, but you might have a lot of data samples, I suggest using a non-parametric density estimation method. One of the data types you describe (time in ms) is clearly continuous, and one method for non-parametric estimation of a probability density function (PDF) for continuous random variables is the histogram that you already mentioned. However, as you will see below, Kernel Density Estimation (KDE) can be better. The second type of data you describe (number of characters in a sequence) is of the discrete kind. Here, kernel density estimation can also be useful and can be seen as a smoothing technique for the situations where you don't have a sufficient amount of samples for all values of the discrete variable.

The example below shows how to first generate data samples from a mixture of 2 Gaussian distributions and then apply kernel density estimation to find the probability density function:

```
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.mlab as mlab
from sklearn.neighbors import KernelDensity
# Generate random samples from a mixture of 2 Gaussians
# with modes at 5 and 10
data = np.concatenate((5 + np.random.randn(10, 1),
10 + np.random.randn(30, 1)))
# Plot the true distribution
x = np.linspace(0, 16, 1000)[:, np.newaxis]
norm_vals = mlab.normpdf(x, 5, 1) * 0.25 + mlab.normpdf(x, 10, 1) * 0.75
plt.plot(x, norm_vals)
# Plot the data using a normalized histogram
plt.hist(data, 50, normed=True)
# Do kernel density estimation
kd = KernelDensity(kernel='gaussian', bandwidth=0.75).fit(data)
# Plot the estimated densty
kd_vals = np.exp(kd.score_samples(x))
plt.plot(x, kd_vals)
# Show the plots
plt.show()
```

This will produce the following plot, where the true distribution is shown in blue, the histogram is shown in green, and the PDF estimated using KDE is shown in red:

As you can see, in this situation, the PDF approximated by the histogram is not very useful, while KDE provides a much better estimate. However, with a larger number of data samples and a proper choice of bin size, histogram might produce a good estimate as well.

The parameters you can tune in case of KDE are the *kernel* and the *bandwidth*. You can think about the kernel as the building block for the estimated PDF, and several kernel functions are available in Scikit Learn: gaussian, tophat, epanechnikov, exponential, linear, cosine. Changing the bandwidth allows you to adjust the bias-variance trade-off. Larger bandwidth will result in increased bias, which is good if you have less data samples. Smaller bandwidth will increase variance (fewer samples are included into the estimation), but will give a better estimate when more samples are available.

To get probability from the estimated PDF:

```
# Get probability for range of values
start = 5 # Start of the range
end = 6 # End of the range
N = 100
step = (end - start) / (N - 1)
x = np.linspace(start, end, N)[:, np.newaxis]
kd_vals = np.exp(kd.score_samples(x))
probability = np.sum(kd_vals * step)
print(probability)
```