Nil Nil - 11 months ago 59
Python Question

Fastest way to count a list of words in an article using python and/or NLTK

I am looking for how many times all words in a bag of words are found in an article. I am not interested in the frequency of each word but the total amount of times all of them are found in the article. I have to analyse hundreds of articles, as I retrieve them from the internet. My algorithm takes long since each article is about 800 words.

Here is what I do (where amount is the number of times the words were found in a single article, article contains a string with all the words forming the article content, and I use NLTK to tokenize.)

bag_of_words = tokenize(bag_of_words)
tokenized_article = tokenize(article)

occurrences = [word for word in tokenized_article
if word in bag_of_words]

amount = len(occurrences)

Where the tokenized_article looks like:

[u'sarajevo', u'bosnia', u'herzegovi', u'war', ...]

And so does the bag_of_words.

I was wondering if there's any more efficient/faster way of doing it using NLTK or lambda functions, for instance.

Many thanks for your help!

Answer Source

I suggest using a set for the words you are counting - a set has constant-time membership test and so, is faster than using a list (which has a linear-time membership test).

For example:

occurrences = [word for word in tokenized_article
                    if word in set(bag_of_words)]

amount = len(occurrences)

Some timing tests (with an artificially created list, repeated ten times):

In [4]: words = s.split(' ') * 10

In [5]: len(words)
Out[5]: 1060

In [6]: to_match = ['NTLK', 'all', 'long', 'I']

In [9]: def f():
   ...:     return len([word for word in words if word in to_match])

In [13]: timeit(f, number = 10000)
Out[13]: 1.0613768100738525

In [14]: set_match = set(to_match)

In [15]: def g():
    ...:     return len([word for word in words if word in set_match])

In [18]: timeit(g, number = 10000)
Out[18]: 0.6921310424804688

Some other tests:

In [22]: p = re.compile('|'.join(set_match))

In [23]: p
Out[23]: re.compile(r'I|all|NTLK|long')

In [24]: p = re.compile('|'.join(set_match))

In [28]: def h():
    ...:     return len(filter(p.match, words))

In [29]: timeit(h, number = 10000)
Out[29]: 2.2606470584869385