stevenjackson121 stevenjackson121 - 1 year ago 63
Python Question

Regular Expression Matching First Non-Repeated Character

TL;DR"(.)(?!.*\1)", text).group()
doesn't match the first non-repeating character contained in text (it always returns a character at or before the first non-repeated character, or before the end of the string if there are no non-repeated characters. My understanding is that should return None if there were no matches).
I'm only interested in understanding why this regex is not working as intended using the Python
module, not in any other method of solving the problem

Full Background

The problem description comes from I've already solved this problem using a non-regex method, but revisited it to expand my understanding of Python's
The regular expressions i thought would work (named vs unnamed backreferences) are:

(same results in python2 and python3)

My entire program looks like this

import re
import sys
with open(sys.argv[1], 'r') as test_cases:
for test in test_cases:

and some input/output pairs are:

refurbished | f
substantially | u
cardiff | c
kangaroo | k
rain | r
teetthing | e
god | g
newtown | e
taxation | x

According to what I've read at

  • (.)
    creates a named group that matches any character and allows later backreferences to it as

  • (?!...)
    is a negative lookahead which restricts matches to cases where
    does not match.

  • .*\1
    means any number (including zero) of characters followed by whatever was matched by

  •, string)
    returns only the first location where the regex pattern produces a match (and would return None if no match could be found)

  • .group()
    is equivalent to
    which returns the entire match

I think these pieces together should solve the stated problem, and it does work like I think it should for most inputs, but failed on
. Throwing similar problems at it reveals that it seems to ignore repeated characters if they are consecutive:

tooth | o # fails on consecutive repeated characters
aardvark | d # but does ok if it sees them later
aah | a # verified last one didn't work just because it was at start
heh | e # but it works for this one
hehe | h # What? It thinks h matches (lookahead maybe doesn't find "heh"?)
heho | e # but it definitely finds "heh" and stops "h" from matching here
hahah | a # so now it won't match h but will match a
hahxyz | a # but it realizes there are 2 h characters here...
hahxyza | h # ... Ok time for StackOverflow

I know lookbehind and negative lookbehind are limited to 3-character-max fixed length strings, and cannot contain backreferences even if they evaluate to a fixed length string, but I didn't see the documentation specify any restrictions on negative lookahead.

Answer Source

Well let's take your tooth example - here is what the regex-engine does (a lot simplified for better understanding)

Start with t then look ahead in the string - and fail the lookahead, as there is another t.

^  °

Next take o, look ahead in the string - and fail, as there is another o.


Next take the second o, look ahead in the string - no other o present - match it, return it, work done.


So your regex doesn't match the first unrepeated character, but the first one, that has no further repetitions towards the end of the string.