JCode JCode - 3 months ago 11
Python Question

Python: "unexpected end of regular expression" during re.compile, empty brackets

To summarize i have

re.compile
statement like so:

markers = ['x'] # some list
re.compile(r" *[{}].*(?=\n|$)".format('\\'.join([''] + markers)))


For most cases it works fine unless
markers
is empty
and RegEx pattern looks like so:

pattern = ' *[].*(?=\\n|$)'


Why does it have problem with empty character set? What is the workaround to make it work for empty
markers
list?

SOLUTION



Credits for: Martijn Pieters, Wiktor Stribiżew and Amadan.

To summarize:


  • empty character set doesn't exist in RegEx.
    []
    is parsed like
    [a
    so interpreter expects closing
    ]
    and that causes error,

  • checking for empty
    markers
    must be done before compiling this pattern, to avoid invalid empty brackets
    []
    ,

  • .*(?=\n|$)
    has redundant
    (?=\n|$)
    and can be simplified to
    .*
    ,

  • to escape special characters efficiently inside brackets
    []
    it it's better to use
    re.escape()
    .



Adding thigs up the solution for my problem is:

if markers:
re.compile(r" *[{}].*".format(re.escape(''.join(markers))))
else:
# something

Answer

You may check if the markers list is not empty at the very beginning, then, only escape the characters that must be escaped in the character class: ^, \, ], [, -.

Note that if the markers list is empty, the pattern becomes *.*, basically accepting any line. You can match it with "^.*$".

Here is my suggestion:

import re
markers = ['x', ']', '[', '-', '^', '\\'] # some list
global p
#markers = [] # some list
if markers:
    escaped = [re.sub(r"[][^\\-]", r"\\\g<0>", x) for x in markers]
    pat = r" *[{}].*".format("".join(escaped))
    p = re.compile(pat)
else:
    p = re.compile("^.*$")

print(p.pattern)

See the Python demo

Also, the .*(?=\n|$) can be actually reduced to .* since . matches any character but a newline (it also can match a CR symbol) and .* will always match all chars up to the \n or end of string.

Comments