Nil - 1 year ago 90
Python Question

# Fastest way to count a list of words in an article using python

I am looking for how many times all words in a bag of words are found in an article. I am not interested in the frequency of each word but the total amount of times all of them are found in the article. I have to analyse hundreds of articles, as I retrieve them from the internet. My algorithm takes long since each article is about 800 words.

Here is what I do (where amount is the number of times the words were found in a single article, article contains a string with all the words forming the article content, and I use NLTK to tokenize.)

``````bag_of_words = tokenize(bag_of_words)
tokenized_article = tokenize(article)

occurrences = [word for word in tokenized_article
if word in bag_of_words]

amount = len(occurrences)
``````

Where the tokenized_article looks like:

``````[u'sarajevo', u'bosnia', u'herzegovi', u'war', ...]
``````

And so does the bag_of_words.

I was wondering if there's any more efficient/faster way of doing it using NLTK or lambda functions, for instance.

I suggest using a `set` for the words you are counting - a `set` has constant-time membership test and so, is faster than using a list (which has a linear-time membership test).

For example:

``````occurrences = [word for word in tokenized_article
if word in set(bag_of_words)]

amount = len(occurrences)
``````

Some timing tests (with an artificially created list, repeated ten times):

``````In [4]: words = s.split(' ') * 10

In [5]: len(words)
Out[5]: 1060

In [6]: to_match = ['NTLK', 'all', 'long', 'I']

In [9]: def f():
...:     return len([word for word in words if word in to_match])

In [13]: timeit(f, number = 10000)
Out[13]: 1.0613768100738525

In [14]: set_match = set(to_match)

In [15]: def g():
...:     return len([word for word in words if word in set_match])

In [18]: timeit(g, number = 10000)
Out[18]: 0.6921310424804688
``````

Some other tests:

``````In [22]: p = re.compile('|'.join(set_match))

In [23]: p
Out[23]: re.compile(r'I|all|NTLK|long')

In [24]: p = re.compile('|'.join(set_match))

In [28]: def h():
...:     return len(filter(p.match, words))

In [29]: timeit(h, number = 10000)
Out[29]: 2.2606470584869385
``````
Recommended from our users: Dynamic Network Monitoring from WhatsUp Gold from IPSwitch. Free Download