user36476 user36476 - 9 months ago 20
Python Question

Speed up millions of regex replacements in Python 3

I'm using Python 3.5.2

I have two lists


  • a list of about 750,000 "sentences" (long strings)

  • a list of about 20,000 "words" that I would like to delete from my 750,000 sentences



So, I have to loop through 750,000 sentences and perform about 20,000 replacements, but ONLY if my words are actually "words" and are not part of a larger string of characters.

I am doing this by pre-compiling my words so that they are flanked by the
\b
metacharacter

compiled_words = [re.compile(r'\b' + word + r'\b') for word in my20000words]


Then I loop through my "sentences"

import re

for sentence in sentences:
for word in compiled_words:
sentence = re.sub(word, "", sentence)
# put sentence into a growing list


This nested loop is processing about 50 sentences per second, which is nice, but it still takes several hours to process all of my sentences.


  • Is there a way to using the
    str.replace
    method (which I believe is faster), but still requiring that replacements only happen at word boundaries?

  • Alternatively, is there a way to speed up the
    re.sub
    method? I have already improved the speed marginally by skipping over
    re.sub
    if the length of my word is > than the length of my sentence, but it's not much of an improvement.



Thank you for any suggestions.

Answer Source

One thing you can try is to compile one single pattern like "\b(word1|word2|word3)\b".

Because re relies on C code to do the actually matching, the saving can be dramatic.