jimo337 jimo337 - 1 year ago 44
Python Question

using regular expressions to match a block of text with random tags inside of it

I'm attempting to match a block of text that may or may not contain tags in it already.

I am working with a large dataset in which I need to tag specific parts of the dataset, and am given the specific strings that need to be tagged. However, when I tag something in one of the blocks of text, I can no longer use my regexes to find it.

Essentially what I need is to be able to match a string regardless of whether or not it has tags between it.

an example would be this:

I need to search for the data block

in a file containing


, and I need to tag
, and
. But clearly, when i initially tag
and create the string
, I can no longer find the string
, nor the string

I need to find
first because there may be multiple
s in the data file.

this is an example string from the dataset.

CDATA[Bvhhg Iebhe:<br /> <br />8/15/73 dc eqedhethv dy tgjp teyzuvj aggmc ej jpdc jdmv.<br /> <br />Ujjeopvf nhvecv xdyf gua 1673 kvffdyr neoivj kdjp gua mvyuc, nadodyr, Lqvyj Fghdodvc, eyf xavzuvyjhb ecivf zuvcjdgyc. <br /> <br />Uxjva avqdvkdyr jpv dyxgamejdgy, dx bgu peqv eyb zuvcjdgyc, nhvecv ogyjeoj mv.  Thank you.<br /> <br />Ovtgaep Jerygy<br />Kehvc eyf Mejvadyr Yeyerva<br />339 922-1323 vlj. 1576<br />vqvyjc@vfrvkggfjepgv.ogm<br /> <br /> <br /> <br /> <br /> <br /> <br />

as you can see, pretty ugly, and contains raw formatting tags as well.
in this example I may need to have
Thank you.
tagged by itself, but also contained in a larger tag that only excludes the data found after
Thank you.

I am really at a loss for how to do this. I may just be thinking in the wrong direction, but I have not even gotten close to a solution.

I am working in python 2.7, but as this is just a regex issue I do not believe it is particularly relevant.

Answer Source

As best I can understand the requirements from the comment thread above, I believe this code does what is expected. The idea is that each piece of text being searched for becomes a regular expression that ignores XML-style tags wherever they exist within the search term. E.g. 'abc' becomes a regular expression like ((<[^>*)>)*a(<[^>*)>)*b(<[^>*)>)*c(<[^>*)>)*).

import itertools
import re

def tag(document, text, tagname):
    tagre = '(<[^>]*>)*'

    regex = '(' + tagre + ''.join(
            itertools.cycle([tagre])))) + ')'

    return re.sub(regex,

document = 'abc123xyz'
document = tag(document, 'abc', 'tag1')
document = tag(document, 'abc12', 'tag2')
document = tag(document, '123', 'tag3')
document = tag(document, 'abc123xyz', 'tag4')


# Output:
# <tag4><tag2><tag1>abc<tag3></tag1>12</tag2>3</tag3>xyz</tag4>