user1419579 user1419579 - 1 year ago 55
Python Question

Substitute regex match groups where match groups may overlap

I am working in Python. I have a String that matches my regex and would like to substitute all the match groups (The end goal is to wrap each group in an HTML span).

I know there are good ways to do this with the re module however I don't know if my case can be handled with that since I know some of my matches overlap.

I've looked at the re module and String templates but I don't think either help me in this situation. I've also tried implementing a solution myself but I've yet to have any luck with that and it feels like there should be a better solution.

E.g. Let's say I have the String:

"This is my cat her name is Alice"

and I'm using the pattern:

"This is my cat (her name is (\w+)).

In this case I should have:

match 0: "This is my cat her name is Alice"
match 1: "her name is Alice"
march 2: "Alice"

I want to end with something that looks like this

"This is my cat <span class=\"class1\">is <span class=\"class2\">Alice</span></span>

Answer Source
  1. Create a list of indices where groups begin and end. You can use the .start([group]) and .end([group]) functions for this. (Make sure you have some way of distinguishes group starts from group ends.)
  2. Sort the list by descending index.
  3. For each index in the list, insert </span> if it's an end index or <span class="whatever"> if it's a start index.


match= re.match(p, s)
indices= sorted([(match.start(index),True) for index,group in enumerate(match.groups(),1)]+ \
                [(match.end(index),False) for index,group in enumerate(match.groups(),1)], reverse=True)
for index,is_start in indices:
    if is_start:
        s= s[:index]+'<span class="class1">'+s[index:]
        s= s[:index]+'</span>'+s[index:]
print s
# output: This is my cat <span class="class1">her name is <span class="class1">Alice</span></span>