Trevor Thackston Trevor Thackston - 1 year ago 76
Python Question

Backslashes in Python Regex

I'm writing a quick Python script to do a bit of inspection on some of our Hibernate mapping files. I'm trying to use this bit of Python to get the table name of a POJO, whether or not its class path is fully defined:

searchObj ='<class name="(.*\\.|)' + pojo + '".*table="(.*?)"', contents)

However - say pojo is 'MyObject' - the regex is not matching it to this line:

<class name="" table="my_cool_object" dynamic-insert="true" dynamic-update="true">

If I print the string (while stopped in Pdb) I'm searching with, I see this:

'<class name="(.*\\\\.|)MyObject".*table="(.*?)"'

I'm quite confused as to what's going wrong here. For one, I was under the impression that the 'r' prefix made it so that the backslashes wouldn't be escaped. Even so, if I remove one of the backslashes such that my search string is this:

searchObj ='<class name="(.*\.|)' + pojo + '".*table="(.*?)"', contents)

And the string searched becomes

'<class name="(.*\\.|)MyObject".*table="(.*?)"'

It still doesn't return a match. What's going wrong here? The regex expression I'm intending to use works on (with just one backslash in the apparently problematic area.) Any idea what is going wrong here?


Given this:'<class name="(.*\\.|)' + pojo + '".*table="(.*?)"', contents)

The the first part of the pattern is interpreted like this:

1. class name="    a literal string beginning with c and ending with "
2. (               the beginning of a group
3.   .*                zero or more of any characters
4.   \\                a literal single slash
5.   .                 any single character
6. OR
7.                     nothing
8. )               end of the group

Since the string you're searching for does not have a literal backslash, it won't match.

If what you intend is for \\. to mean "a literal period", you need a single backslash since it is inside a raw string: \.

Also, ending the group with a pipe seems weird. I'm not sure what you think that's accomplishing. If you mean to say "any number of characters ending in a dot, or nothing", you can do that with (.*\.)?, since the ? means "zero or one of the preceding match".

This seems to work for me:

import re
contents1 = '''<class name="" table="my_cool_object" dynamic-insert="true" dynamic-update="true">'''
contents2 = '''<class name="MyObject" table="my_cool_object" dynamic-insert="true" dynamic-update="true">'''

pattern = r'<class name="(.*\.)?' + pojo + '.*table="(.*?)"'

assert(, contents1))
assert(, contents2))