Ryan Saxe Ryan Saxe - 1 year ago 81
Python Question

Optimizing regular expression techniques

I am wondering about optimization techniques for regex

So I am trying to parse every instance of money out of a 400k line corpus. I needed to also include lines such as

as well as
"one billion and six hundred twenty five thousand dollars"
and everything in between. This required a very lengthy regular expression with multiple instances of groups like

MONEYEXPRESSION = '(?:\d\d?\d?(?:,?\d{3})*(?:\.\d+)?)'
(one|two|...|ninety[\s-]?nine|hundred|a hundred|MONEYEXPRESSION)((\s*and\s*|\s*-\s*|\s*)(one|two|...|ninety[\s-]?nine|hundred|a hundred|MONEYEXPRESSION))*

Even more than that, In order to require it to be an instance of money and avoid matching lines such as
"five hundred people were at the event"
I have 4 OR'd options that require
"$", "dollars?", or "cents?"
to be in specific places in the sentence at least once.

the regular expression is almost 20k characters! :(

You can imagine with an expression this extensive, with any bad practices, it really adds to the time. I have been running this on the corpus for the past 2 hours and it has still not completed the matching. I was wondering what some of the best practices for optimizing and trimming unnecessary regex is. What operations I am using are expensive and can be supplemented for better ones. And if maybe there is just a better way to solve this problem?

Answer Source

You're asking about optimizing performance, so let's focus on that. What makes the regexp engine really slow is backtracking, and what causes backtracking is parts that might succeed in different places in the string, without clear ways to decide. So try these rules of thumb:

  1. From the backtracking link above: "When nesting repetition operators, make absolutely sure that there is only one way to match the same match."

  2. Avoid large optional components. Instead of something like (<number>? <number>)? <number> to match a sequence with space-separated elements, write (<number> ?)+.

  3. Avoid components that can be satisfied by the empty string-- the engine will be trying to satisfy them at every position.

  4. Ensure that unconstrained parts in your regexps are bounded in length, especially if the later part cannot be reliably recognized. Things like A.*B? are asking for trouble-- this can match anything that starts with A.

  5. Don't use lookahead/lookbehind. Almost always there's a simpler way.

More generally, keep it simple. I'm not sure how you managed to get to 20K characters for this task, but I'll bet there are ways to simplify it. One consideration is that it's ok to match things that you will not see anyway.

For example, why match all numbers from one to ninety nine, and not just their components? Yeah, you'll match nonsense like "nine ninety and dollars", but that does no harm. You are searching for money amounts, not validating input. E.g, this should match all written out dollar amounts less than a million dollars:

((one|two|three|...|twenty|thirty|...|ninety|hundred|thousand|and) ?)+ (dollars?|euros?)\b

Since this is tagged "python", here are two more suggestions:

  1. If the task (or assignment) allows you to break up the search in steps, do so. A regexp that does everything often has to be so complex, that it is slower than simply running several searches in sequence.

  2. Even if you are constrained to use one monster regexp, write it in pieces and use python to assemble it into one string. It won't make any difference when executing, but will be a lot easier to work with.

Recommended from our users: Dynamic Network Monitoring from WhatsUp Gold from IPSwitch. Free Download