Anton Melnikov Anton Melnikov - 1 month ago 10
Python Question

Python: optional group in regular expression

I'm trying to parse HTML

img
tags in a certain document, in particular I want to find all
src
,
alt
and
title
attributes of an image. Attributes are always in the same order, however
title
and
alt
are optional, they could be absent.

I've tried to make groups optional with
(?:title="(.*?)")?
in my regular expression, but it doesn't work. Any help would be appreciated.

example = '<img class="alignnone wp-image-4170 size-full" title="example_title" src="http://www.example.com/wp-content/uploads/2016/07/example.jpg" alt="example_alt" width="300" height="430" />'
re.search(r'(?:title="(.*?)")?.*?src="(.*?)".*?(?:alt="(.*?)")?', example).groups()
>>> (None, 'http://www.example.com/wp-content/uploads/2016/07/example.jpg', None)


Expected result would be:

('example_title', 'http://www.example.com/wp-content/uploads/2016/07/example.jpg', 'example_alt')

Answer

You can get the title to match by moving your first .*? inside your first non-capturing group:

>>> re.search(r'(?:title="(.*?)".*?)?src="(.*?)".*?(?:alt="(.*?)")?', example).groups()
('example_title',
 'http://www.example.com/wp-content/uploads/2016/07/example.jpg',
 None)

The problem with your regex is that it includes .* after an optional group. This means that right at the beginning of the string, the regex is "allowed" to not match the optional group (since it's optional), and instead move on to match what comes after it. Since what comes after it is .*?, which will match anything, this always succeeds, and it has no need to match your title group. It just uses the .*? to match everything from the beginning of the string up to the "src", and then matches the "src". Moving the .*? inside the non-capturing group forces it to not match the "anything" unless it first matches the title; then it will only match the "src" if it advances its search position all the way there without finding the title first.

As was mentioned in a comment, parsing HTML this way is not a great idea. Your question is actually an illustration of why. When you wrote (?:title="(.*?)")?.*? you probably were thinking in terms of "an optional title followed by anything", but the problem is that the "anything" can also include a title, so what it actually means is "either a title right at the beginning of the string and followed by anything, or just anything (including a title that we will ignore)". When you try to combine specific matches like title= with generic matches like .*, what you are trying to capture may be slurped up by a .* instead of captured with your more specific group. In addition, your code assumes that title, src, and alt will always occur in that order, but they may occur in any order, in which case your regex will fail to capture them correctly.

Comments