Kurt Peek Kurt Peek - 3 months ago 20
Python Question

In NLTK, get the number of occurrences of a trigram

I'd like to get the "commonly used phrases" from a text, defined as the trigrams which occur more than once. Till now I have this:

import nltk

def get_words(string):
tokenizer = nltk.tokenize.RegexpTokenizer(r'\w+')
return tokenizer.tokenize(string)

string = "Hello, world. This is a dog. This is a cat."

words = get_words(string)

finder = nltk.collocations.TrigramCollocationFinder.from_words(words)
scored = finder.score_ngrams(nltk.collocations.TrigramAssocMeasures().raw_freq)

The resulting

[(('This', 'is', 'a'), 0.2), (('Hello', 'world', 'This'), 0.1), (('a', 'dog', 'This'), 0.1), (('dog', 'This', 'is'), 0.1), (('is', 'a', 'cat'), 0.1), (('is', 'a', 'dog'), 0.1), (('world', 'This', 'is'), 0.1)]

I've noticed that the number in the elements of
is the number of occurrences of the trigram divided by the total word count (in this case, 10). Is there a way to get the number of occurrences directly, without 'post-multiplying' by the word count?


You can get number of occurrences using finder.ngram_fd.items()

# To get Trigrams with occurrences
trigrams = finder.ngram_fd.items()
print trigrams

# To get Trigrams with occurrences in descending order
trigrams = sorted(finder.ngram_fd.items(), key=lambda t: (-t[1], t[0]))
print trigrams

You can check more related examples at : NLTK Collocations