I am looking for how many times all words in a bag of words are found in an article. I am not interested in the frequency of each word but the total amount of times all of them are found in the article. I have to analyse hundreds of articles, as I retrieve them from the internet. My algorithm takes long since each article is about 800 words.
Here is what I do (where amount is the number of times the words were found in a single article, article contains a string with all the words forming the article content, and I use NLTK to tokenize.)
bag_of_words = tokenize(bag_of_words)
tokenized_article = tokenize(article)
occurrences = [word for word in tokenized_article
if word in bag_of_words]
amount = len(occurrences)
[u'sarajevo', u'bosnia', u'herzegovi', u'war', ...]
I suggest using a
set for the words you are counting - a
set has constant-time membership test and so, is faster than using a list (which has a linear-time membership test).
occurrences = [word for word in tokenized_article if word in set(bag_of_words)] amount = len(occurrences)
Some timing tests (with an artificially created list, repeated ten times):
In : words = s.split(' ') * 10 In : len(words) Out: 1060 In : to_match = ['NTLK', 'all', 'long', 'I'] In : def f(): ...: return len([word for word in words if word in to_match]) In : timeit(f, number = 10000) Out: 1.0613768100738525 In : set_match = set(to_match) In : def g(): ...: return len([word for word in words if word in set_match]) In : timeit(g, number = 10000) Out: 0.6921310424804688
Some other tests:
In : p = re.compile('|'.join(set_match)) In : p Out: re.compile(r'I|all|NTLK|long') In : p = re.compile('|'.join(set_match)) In : def h(): ...: return len(filter(p.match, words)) In : timeit(h, number = 10000) Out: 2.2606470584869385