José José - 6 months ago 9
Python Question

Convert string searched with regexp into lowercase with Python

I have some broken XML (the example is not broken, but the document it is) that a part of it looks like this:

<sp who="#FERN">
<speaker>FERNANDO</speaker>
<p>Un instante. Soy un hombre. Huir sería cobarde... ¡Sin defenderse! ¡Sin salvarte!... Va a venir... con la vara del guardia. ¡Ay, que ya la conoces! ¡Ah, maldito!... ¡Y me dices que ese hombre es bueno!...</p>
</sp>


I want to convert the value of the attribute who into lowercase:

<sp who="#fern">


Normally I work with the functions \U and \L in regular expressions, but I think this is not supported by python. I tried this regexp:

text = re.sub(r'(who="#.*?")', r'\L\1', text)


But the output is:

<sp \Lwho="#FERN">


Which is not what I want... Any help, my dear stackoverflowers? Thanks in advance!

Answer

You may use anonymous function inside re.sub

>>> s = '''<sp who="#FERN">
<speaker>FERNANDO</speaker>
<p>Un instante. Soy un hombre. Huir sería cobarde... ¡Sin defenderse! ¡Sin salvarte!... Va a venir... con la vara del guardia. ¡Ay, que ya la conoces! ¡Ah, maldito!... ¡Y me dices que ese hombre es bueno!...</p>
</sp>'''
>>> print re.sub(r'\b(who="#)([^"]*)', lambda m: m.group(1) + m.group(2).lower(), s)
<sp who="#fern">
<speaker>FERNANDO</speaker>
<p>Un instante. Soy un hombre. Huir sería cobarde... ¡Sin defenderse! ¡Sin salvarte!... Va a venir... con la vara del guardia. ¡Ay, que ya la conoces! ¡Ah, maldito!... ¡Y me dices que ese hombre es bueno!...</p>
</sp>
>>> 

Explanation:

If you want to apply some operations on a captured characters then you must use lambda function in the replacement part of re.sub.

  • \b(who="#) matches thye exact string who="# note that \b in regex called word boundary which matches between word and non-word chars or vice-versa. And the brackets (pattern) used to capture the characters matched by the pattern present inside the brackets. So the first group contains who="#

  • ([^"]*) matches any character but not of ", zero or more times. So this should capture FERN.

  • In the replacement part, I just return the first group + lowered second group.