I've struck a problem with a regular expression in Python (2.7.9)
I'm trying to strip out HTML
<span>
re.sub(r'<span[^>]*>(.*?)</span>', r'\1', input_text, re.S)
<span
>
>
</span>
.
<span>
#!/usr/bin/env python
import re
text1 = '<body id="aa">this is a <span color="red">test\n with newline</span></body>'
print repr(text1)
text2 = re.sub(r'<span[^>]*>(.*?)</span>', r'\1', text1, re.S)
print repr(text2)
#!/usr/bin/perl
$text1 = '<body id="aa">this is a <span color="red">test\n with newline</span></body>';
print "$text1\n";
$text1 =~ s/<span[^>]*>(.*?)<\/span>/\1/s;
print "$text1\n";
The 4th parameter of re.sub
is a count
, not a flags
.
re.sub(pattern, repl, string, count=0, flags=0)¶
You need to use keyword argument to explicitly specify the flags
:
re.sub(r'<span[^>]*>(.*?)</span>', r'\1', input_text, flags=re.S)
↑↑↑↑↑↑
Otherwise, re.S
will be interpreted replacement count (maximum 16 times) instead of S
(or DOTALL
flags):
>>> import re
>>> re.S
16
>>> text1 = '<body id="aa">this is a <span color="red">test\n with newline</span></body>'
>>> re.sub(r'<span[^>]*>(.*?)</span>', r'\1', text1, re.S)
'<body id="aa">this is a <span color="red">test\n with newline</span></body>'
>>> re.sub(r'<span[^>]*>(.*?)</span>', r'\1', text1, flags=re.S)
'<body id="aa">this is a test\n with newline</body>'