gran_profaci gran_profaci - 1 year ago 96
Python Question

Computing N Grams using Python

I needed to compute the Unigrams, BiGrams and Trigrams for a text file containing text like:

"Cystic fibrosis affects 30,000 children and young adults in the US alone
Inhaling the mists of salt water can reduce the pus and infection that fills the airways of cystic fibrosis sufferers, although side effects include a nasty coughing fit and a harsh taste.
That's the conclusion of two studies published in this week's issue of The New England Journal of Medicine."

I started in Python and used the following code:

#!/usr/bin/env python
# File:
def N_Gram(N,text):
NList = [] # start with an empty list
if N> 1:
space = " " * (N-1) # add N - 1 spaces
text = space + text + space # add both in front and back
# append the slices [i:i+N] to NList
for i in range( len(text) - (N - 1) ):
return NList # return the list
# test code
for i in range(5):
print N_Gram(i+1,"text")
# more test code
nList = N_Gram(7,"Here is a lot of text to print")
for ngram in iter(nList):
print '"' + ngram + '"'

But it works for all the n-grams within a word, when I want it from between words as in CYSTIC and FIBROSIS or CYSTIC FIBROSIS. Can someone help me out as to how I can get this done?

Answer Source

Assuming input is a string contains space separated words, like x = "a b c d" you can use the following function (edit: see the last function for a possibly more complete solution):

def ngrams(input, n):
  input = input.split(' ')
  output = []
  for i in range(len(input)-n+1):
  return output

ngrams('a b c d', 2) # [['a', 'b'], ['b', 'c'], ['c', 'd']]

If you want those joined back into strings, you might call something like:

[' '.join(x) for x in ngrams('a b c d', 2)] # ['a b', 'b c', 'c d']

Lastly, that doesn't summarize things into totals, so if your input was 'a a a a', you need to count them up into a dict:

for g in (' '.join(x) for x in ngrams(input, 2)):
   grams.setdefault(g, 0)
   grams[g] += 1

Putting that all together into one final function gives:

def ngrams(input, n):
  input = input.split(' ')
  output = {}
  for i in range(len(input)-n+1):
    g = ' '.join(input[i:i+n])
    output.setdefault(g, 0)
    output[g] += 1
  return output

ngrams('a a a a', 2) # {'a a': 3}