Chris Nielsen Chris Nielsen - 7 months ago 6
Python Question

Python regex to find multiple consecutive punctuations

I am streaming plain text records via MapReduce and need to check each plain text record for 2 or more consecutive punctuation symbols. The 12 symbols I need to check for are:

-/\()!"+,'&.
.

I have tried translating this punctuation list into an array like this:
punctuation =
[r'-', r'/', r'\\', r'\(', r'\)', r'!', r'"', r'\+', r',', r"'", r'&', r'\.']


I can find individual characters with nested for loops, for example:

for t in test_cases:
print t
for p in punctuation:
print p
if re.search(p, t):
print 'found a match!', p, t
else:
print 'no match'


However, the single backslash character is not found when I test this and I don't know how to get only results that are 2 or more consecutive occurrences in a row. I've read that I need to use the + symbol, but don't know the correct syntax to use this.

Here are some test cases:

The quick '''brown fox
The &&quick brown fox
The quick\brown fox
The quick\\brown fox
The -quick brown// fox
The quick--brown fox
The (quick brown) fox,,,
The quick ++brown fox
The "quick brown" fox
The quick/brown fox
The quick&brown fox
The ""quick"" brown fox
The quick,, brown fox
The quick brown fox…
The quick-brown fox
The ((quick brown fox
The quick brown)) fox
The quick brown fox!!!
The 'quick' brown fox


Which when translated into a Pythonic list looks like this:

test_cases = [
"The quick '''brown fox",
'The &&quick brown fox',
'The quick\\brown fox',
'The quick\\\\brown fox',
'The -quick brown// fox',
'The quick--brown fox',
'The (quick brown) fox,,,',
'The quick ++brown fox',
'The "quick brown" fox',
'The quick/brown fox',
'The quick&brown fox',
'The ""quick"" brown fox',
'The quick,, brown fox',
'The quick brown fox...',
'The quick-brown fox',
'The ((quick brown fox',
'The quick brown)) fox',
'The quick brown fox!!!',
"The 'quick' brown fox" ]


How do I use Python regex to identify and report all matches where the punctuation symbol appears 2 or more times in a row?

Answer

The punctuation characters can be put into a character class is square brackets. Then it depends, whether the series of two or more punctuation characters consists of any punctuation character or whether the punctuation characters are the same.

In the first case curly braces can be appended to specify the number of minimum (2) and maximum repetitions. The latter is unbounded and left empty:

[...]{2,} # min. 2 or more

If only repetitions of the same character needs to be found, then the first matched punctuation character is put into a group. Then the same group (= same character) follows one or more:

([...])\1+

The back reference \1 means the first group in the expression. The groups, represented by the opening parentheses are numbered from left to right.

The next issue is escaping. There are escaping rules for Python strings and additional escaping is needed in the regular expression. The character class does not require much escaping, but the backslash must be doubled. Thus the following example quadruplicates the backslash, one doubling because of the string, the second because of the regular expression.

Raw strings r'...' are useful for patterns, but here both the single and double quotation marks are needed.

>>> import re
>>> test_cases = [
    "The quick '''brown fox",
    'The &&quick brown fox',
    'The quick\\brown fox',
    'The quick\\\\brown fox',
    'The -quick brown// fox',
    'The quick--brown fox',
    'The (quick brown) fox,,,',
    'The quick ++brown fox',
    'The "quick brown" fox',
    'The quick/brown fox',
    'The quick&brown fox',
    'The ""quick"" brown fox',
    'The quick,, brown fox',
    'The quick brown fox...',
    'The quick-brown fox',
    'The ((quick brown fox',
    'The quick brown)) fox',
    'The quick brown fox!!!',
    "The 'quick' brown fox" ]
>>> pattern_any_punctuation = re.compile('([-/\\\\()!"+,&\']{2,})')
>>> pattern_same_punctuation = re.compile('(([-/\\\\()!"+,&\'])\\2+)')
>>> for t in test_cases:
    match = pattern_same_punctuation.search(t)
    if match:
        print("{:24} => {}".format(t, match.group(1)))
    else:
        print(t)

The quick '''brown fox   => '''
The &&quick brown fox    => &&
The quick\brown fox
The quick\\brown fox     => \\
The -quick brown// fox   => //
The quick--brown fox     => --
The (quick brown) fox,,, => ,,,
The quick ++brown fox    => ++
The "quick brown" fox
The quick/brown fox
The quick&brown fox
The ""quick"" brown fox  => ""
The quick,, brown fox    => ,,
The quick brown fox...
The quick-brown fox
The ((quick brown fox    => ((
The quick brown)) fox    => ))
The quick brown fox!!!   => !!!
The 'quick' brown fox
>>> 
Comments