Vikash Singh Vikash Singh - 1 month ago 10
Python Question

Regex replace is taking time for millions of documents, how to make it faster?

I have documents like:

documents = [
"I work on c programing.",
"I work on c coding.",
]


I have synonym file as:

synonyms = {
"c programing": "c programing",
"c coding": "c programing"
}


I want to replace all synonyms for which i wrote this code:

# added code to pre-compile all regex to save compilation time. credits alec_djinn

compiled_dict = {}
for value in synonyms:
compiled_dict[value] = re.compile(r'\b' + re.escape(value) + r'\b')

for doc in documents:
document = doc
for value in compiled_dict:
lowercase = compiled_dict[value]
document = lowercase.sub(synonyms[value], document)
print(document)


output:

I work on c programing.
I work on c programing.


But since the number of documents is a few million and the number of synonym terms are in 10s of thousands.

The expected time for this code to finish is 10 days approx.

Is their a faster way to do this?

PS: with the output I want to train word2vec model.

Any help is greatly appreciated. I was thinking of writing some cpython code and putting it in parallel threads.

Answer Source

I have done string replacement jobs like this before, also for training word2vec models on very large text corpora. When the number of terms to replace (your "synonym terms") is very large, it can make sense to do string replacement using the Aho-Corasick algorithm instead of looping over many single string replacements. You can take a look at my fsed utility (written in Python), which might be useful to you.