I'm trying to smooth a set of n-gram probabilities with Kneser-Ney smoothing using the Python NLTK.
Unfortunately, the whole documentation is rather sparse.
What I'm trying to do is this: I parse a text into a list of tri-gram tuples. From this list I create a FreqDist and then use that FreqDist to calculate a KN-smoothed distribution.
I'm pretty sure though, that the result is totally wrong. When I sum up the individual probabilities I get something way beyond 1. Take this code example:
ngrams = nltk.trigrams("What a piece of work is man! how noble in reason! how infinite in faculty! in \
form and moving how express and admirable! in action how like an angel! in apprehension how like a god! \
the beauty of the world, the paragon of animals!")
freq_dist = nltk.FreqDist(ngrams)
kneser_ney = nltk.KneserNeyProbDist(freq_dist)
prob_sum = 0
for i in kneser_ney.samples():
prob_sum += kneser_ney.prob(i)
The Kneser-Ney (also have a look at Goodman and Chen for a great survey on different smoothing techniques) is a quite complicated smoothing which only a few package that I am aware of got it right. Not aware of any python implementation, but you can definitely try SRILM if you just need probabilities, etc.