Sundeep Sundeep - 3 years ago 195
Python Question

Replacing repetitive words, case-insensitively


>>> line = 'the the, To to'
>>> re.findall(r'\b(\w+) \1', line)
>>> re.findall(r'\b(\w+) \1', line, re.I)
['the', 'To']

>>> re.sub(r'\b(\w+) \1', r'\1', line, re.I)
'the, To to'


'the, To'

The regex works in other places like

  • Vim:
    s/\v<(\w+) \1/\1/gi

  • Perl:
    s/\b(\w+) \1/$1/gi

  • sed:
    -r 's/\b(\w+) \1/\1/gi'

Is this a known behavior? What is a workaround? My Python version is
if that makes a difference.

Answer Source

Read the definition of re.sub:

re.sub(pattern, repl, string, count=0, flags=0)

You are passing re.I as count (where it is allowing at most 2 replacements), not as flags. Instead, try:

>>> re.sub(r'\b(\w+) \1', r'\1', s, flags=re.I)
                                  # ^ note
'the, To'
Recommended from our users: Dynamic Network Monitoring from WhatsUp Gold from IPSwitch. Free Download