user36476 user36476 - 1 year ago 62
Python Question

Speed up millions of regex replacements in Python 3

I'm using Python 3.5.2

I have two lists

  • a list of about 750,000 "sentences" (long strings)

  • a list of about 20,000 "words" that I would like to delete from my 750,000 sentences

So, I have to loop through 750,000 sentences and perform about 20,000 replacements, but ONLY if my words are actually "words" and are not part of a larger string of characters.

I am doing this by pre-compiling my words so that they are flanked by the

compiled_words = [re.compile(r'\b' + word + r'\b') for word in my20000words]

Then I loop through my "sentences"

import re

for sentence in sentences:
for word in compiled_words:
sentence = re.sub(word, "", sentence)
# put sentence into a growing list

This nested loop is processing about 50 sentences per second, which is nice, but it still takes several hours to process all of my sentences.

  • Is there a way to using the
    method (which I believe is faster), but still requiring that replacements only happen at word boundaries?

  • Alternatively, is there a way to speed up the
    method? I have already improved the speed marginally by skipping over
    if the length of my word is > than the length of my sentence, but it's not much of an improvement.

Thank you for any suggestions.

Answer Source

One thing you can try is to compile one single pattern like "\b(word1|word2|word3)\b".

Because re relies on C code to do the actually matching, the saving can be dramatic.

Recommended from our users: Dynamic Network Monitoring from WhatsUp Gold from IPSwitch. Free Download