arvindpdmn arvindpdmn - 2 months ago 9
Python Question

Using regex, extract quoted strings that may contain nested quotes

I have the following string:

'Well, I've tried to say "How Doth the Little Busy Bee," but it all came different!' Alice replied in a very melancholy voice. She continued, 'I'll try again.'


Now, I wish to extract the following quotes:

1. Well, I've tried to say "How Doth the Little Busy Bee," but it all came different!
2. How Doth the Little Busy Bee,
3. I'll try again.


I tried the following code but I'm not getting what I want. The
[^\1]*
is not working as expected. Or is the problem elsewhere?

import re

s = "'Well, I've tried to say \"How Doth the Little Busy Bee,\" but it all came different!' Alice replied in a very melancholy voice. She continued, 'I'll try again.'"

for i, m in enumerate(re.finditer(r'([\'"])(?!(?:ve|m|re|s|t|d|ll))(?=([^\1]*)\1)', s)):
print("\nGroup {:d}: ".format(i+1))
for g in m.groups():
print(' '+g)

Answer

If you really need to return all the results from a single regular expression applied only once, it will be necessary to use lookahead ((?=findme)) so the finding position goes back to the start after each match - see this answer for a more detailed explanation.

To prevent false matches, some clauses are also needed regarding the quotes that add complexity, e.g. the apostrophe in I've shouldn't count as an opening or closing quote. There's no single clear-cut way of doing this but the rules I've gone for are:

  1. An opening quote must not be immediately preceeded by a word character (e.g. letter). So for example, A" would not count as an opening quote but ," would count.
  2. A closing quote must not be immediately followed by a word character (e.g. letter). So for example, 'B would not count as a closing quote but '. would count.

Applying the above rules leads to the following regular expression:

(?=(?:(?<!\w)'(\w.*?)'(?!\w)|"(\w.*?)"(?!\w)))

Regular expression visualization

Debuggex Demo

A good quick sanity check test on any possible candidate regular expression is to reverse the quotes. This has been done in the demo here: https://regex101.com/r/vX4cL9/1

Comments