user521990 user521990 - 2 months ago 7
Python Question

python regex matching between multiple lines and every other match

So I've been playing around with this for a few days and here is what I am looking for and the regex I have now. I have a file in this format (there are some other fields but I have omitted those:

I just want to match the bold text

ADDR 1 - XXXXXX ADDR 1 - XXXXXX

ADDR 2 - XXXXXX ADDR 2 - XXXXXX

ADDR 1 - XXXXXX ADDR 1 - XXXXXX

ADDR 2 - XXXXXX ADDR 2 - XXXXXX

The regex I have written only matches the first ADDR 1 - XXXXX, but I need to match all instances of the bolded XXXXX.

re.findall(r'ADDR 1- .*? ADDR 1-(.*?)(?=ADDR 2-)', lines, re.DOTALL)


Any suggestions? I feel like I might be missing something simple, but not sure.

Answer

Code:

import re

str= """
ADDR 1 - XXXXXX ADDR 1 - ABCDEF

ADDR 2 - XXXXXX ADDR 2 - XXXXXX

ADDR 1 - XXXXXX ADDR 1 - UVWXYZ

ADDR 2 - XXXXXX ADDR 2 - XXXXXX
"""

m = re.findall(r".*ADDR\s+1\s+-\s+(.*)",str)
print m

Output:

C:\Users\dinesh_pundkar\Desktop>python c.py
['ABCDEF', 'UVWXYZ']

C:\Users\dinesh_pundkar\Desktop>

How it works:

.*ADDR\s+1\s+-\s+(.*)

Regular expression visualization

Debuggex Demo

Lets take a line - ADDR 1 - XXXXXX ADDR 1 - ABCDEF

  • .*ADDR will match ADDR 1 - XXXXXX ADDR. Since .* match anything and by nature regex are greedy, so to stop I have add ADDR after .*
  • \s+1\s+-\s+(.*) will match rest 1 - ABCDEF. \s+1\s+-\s+ is required since we need to match ADDR 1 and not ADDR 2. (.*) will match ABCDEF and store it.