Andy Watkins Andy Watkins - 3 years ago 253
Python Question

python re.sub non-greed substitute fails with a newline in the string

I've struck a problem with a regular expression in Python (2.7.9)

I'm trying to strip out HTML

<span>
tags using a regex like so:

re.sub(r'<span[^>]*>(.*?)</span>', r'\1', input_text, re.S)


(the regex reads thusly:
<span
, anything that's not a
>
, then a
>
, then non-greedy-match anything, followed by a
</span>
, and use re.S (re.DOTALL) so
.
matches newline characters

This seems to work unless there is a newline in the text. It looks like re.S (DOTALL) doesn't apply within a non-greedy match.

Here's the test code; remove the newline from text1 and the re.sub works. Put it back in, and the re.sub fails. Put the newline char outside the
<span>
tag, and the re.sub works.

#!/usr/bin/env python
import re
text1 = '<body id="aa">this is a <span color="red">test\n with newline</span></body>'
print repr(text1)
text2 = re.sub(r'<span[^>]*>(.*?)</span>', r'\1', text1, re.S)
print repr(text2)


For comparison, I wrote a Perl script to do the same thing; the regex works as I expect here.

#!/usr/bin/perl
$text1 = '<body id="aa">this is a <span color="red">test\n with newline</span></body>';
print "$text1\n";
$text1 =~ s/<span[^>]*>(.*?)<\/span>/\1/s;
print "$text1\n";


Any ideas?

Tested in Python 2.6.6 and Python 2.7.9

Answer Source

The 4th parameter of re.sub is a count, not a flags.

re.sub(pattern, repl, string, count=0, flags=0)¶

You need to use keyword argument to explicitly specify the flags:

re.sub(r'<span[^>]*>(.*?)</span>', r'\1', input_text, flags=re.S)
                                                      ↑↑↑↑↑↑

Otherwise, re.S will be interpreted replacement count (maximum 16 times) instead of S (or DOTALL flags):

>>> import re
>>> re.S
16

>>> text1 = '<body id="aa">this is a <span color="red">test\n with newline</span></body>'

>>> re.sub(r'<span[^>]*>(.*?)</span>', r'\1', text1, re.S)
'<body id="aa">this is a <span color="red">test\n with newline</span></body>'

>>> re.sub(r'<span[^>]*>(.*?)</span>', r'\1', text1, flags=re.S)
'<body id="aa">this is a test\n with newline</body>'
Recommended from our users: Dynamic Network Monitoring from WhatsUp Gold from IPSwitch. Free Download