Trevor Thackston Trevor Thackston - 5 months ago 19
Python Question

Backslashes in Python Regex

I'm writing a quick Python script to do a bit of inspection on some of our Hibernate mapping files. I'm trying to use this bit of Python to get the table name of a POJO, whether or not its class path is fully defined:

searchObj = re.search(r'<class name="(.*\\.|)' + pojo + '".*table="(.*?)"', contents)


However - say pojo is 'MyObject' - the regex is not matching it to this line:

<class name="com.place.package.MyObject" table="my_cool_object" dynamic-insert="true" dynamic-update="true">


If I print the string (while stopped in Pdb) I'm searching with, I see this:

'<class name="(.*\\\\.|)MyObject".*table="(.*?)"'


I'm quite confused as to what's going wrong here. For one, I was under the impression that the 'r' prefix made it so that the backslashes wouldn't be escaped. Even so, if I remove one of the backslashes such that my search string is this:

searchObj = re.search(r'<class name="(.*\.|)' + pojo + '".*table="(.*?)"', contents)


And the string searched becomes

'<class name="(.*\\.|)MyObject".*table="(.*?)"'


It still doesn't return a match. What's going wrong here? The regex expression I'm intending to use works on regex101.com (with just one backslash in the apparently problematic area.) Any idea what is going wrong here?

Answer

Given this:

re.search(r'<class name="(.*\\.|)' + pojo + '".*table="(.*?)"', contents)

The the first part of the pattern is interpreted like this:

1. class name="    a literal string beginning with c and ending with "
2. (               the beginning of a group
3.   .*                zero or more of any characters
4.   \\                a literal single slash
5.   .                 any single character
6. OR
7.                     nothing
8. )               end of the group

Since the string you're searching for does not have a literal backslash, it won't match.

If what you intend is for \\. to mean "a literal period", you need a single backslash since it is inside a raw string: \.

Also, ending the group with a pipe seems weird. I'm not sure what you think that's accomplishing. If you mean to say "any number of characters ending in a dot, or nothing", you can do that with (.*\.)?, since the ? means "zero or one of the preceding match".

This seems to work for me:

import re
contents1 = '''<class name="com.place.package.MyObject" table="my_cool_object" dynamic-insert="true" dynamic-update="true">'''
contents2 = '''<class name="MyObject" table="my_cool_object" dynamic-insert="true" dynamic-update="true">'''
pojo="MyObject"

pattern = r'<class name="(.*\.)?' + pojo + '.*table="(.*?)"'

assert(re.search(pattern, contents1))
assert(re.search(pattern, contents2))