brittenb brittenb - 6 months ago 8
Python Question

Need a Goldilocks regex pattern - not too greedy and not too selfish

I've got a set of strings that might look something like this:

lines_ = ["04/04 1,000.00 Some word132:11bdkljas 14235262634235",
"04/04 500.00 A simpler phrase 19058453049854",
"04/04 1,000,000.00 Apply//erklj//1324:123"]


I'm trying to write a regex that will pull out those first three "elements" of the string. I realize that based on this example, I could simply use
re.split("\s{2,}")
and then just grab the first three elements, but I can't guarantee that the input will always have two or more spaces separating the pieces I want. So I'd rather have a more robust regex to grab it.

I tried using this:

r"(\d{2}/\d{2})\s+([\d,]+\.\d\d)\s+(.+)(\s+\d+)"


Which works for the first two elements, but not the third since there's no set of digits there. So then I tweaked it to this:

r"(\d{2}/\d{2})\s+([\d,]+\.\d\d)\s+(.+)(\s+\d+)?"


This works for the third one, but for the first two, it includes that fourth element as part of the third element. So then I tweaked it further to look like this:

r"(\d{2}/\d{2})\s+([\d,]+\.\d\d)\s+(.+?)(\s+\d+)?"


Thinking that the
?
inside of the
(.+)
would make it less greedy and not gobble up the last element. Instead, it gives me the first letter of the first word in the third element and that's it.

What I would like to end up with is an output like the following:

groups_ = [("04/04", "1,000.00", "Some word132:11bdkljas"),
("04/04", "500.00", "A simpler phrase"),
("04/04", "1,000,000.00", "Apply//erklj//1324:123")]


Any advice on what I'm missing in my regex would be appreciated.

Answer

Use this pattern with mg options

(\d{2}/\d{2})\s+([\d,]+\.\d\d)\s+(.+?)(?:\s+(\d+)|,|$)  

Demo