nickpick nickpick - 3 months ago 13
Python Question

How to capture two substrings from html text?

I have the following string:

data-event-title="Yuichi Sugita* vs Adrian Mannarino">
<span class="odds-container">
<b class="odds">1/12</b>
</a>


And I would like to capture
Yuichi Sugita
and
1/12
. For that I created the following regex:
ata-event-title="(.+)".+ class="odds">(.+)<

which has two capture groups in parenthesis (when I use them separately they work fine), but the problem is that the
.+
in between them does not work as expected.

Any suggestions are appreciated.

Answer Source

You use of dots is "greedy" so they capture as much as they possibly can (and you don't actually want that in this case).

You can change the capture group quantifiers to be "lazy", but it will be more efficient to use negated character classes (syntax [^character]) for your capture groups.

The dot between your two capture groups is fine to be "greedy" because it will start matching when it encounter class="odds"> anyhow.

Assuming you have linebreaks as your sample input shows, your dot will stop on newline characters unless you use the s flag with your pattern. Use this:

r"data-event-title=\"([^*]+).*class=\"odds\">([^<]+)"s

This will capture:

  1. the substring that follows data-event-title=" ending just before the first occurrence of *.
  2. the substring that follows class="odds"> ending just before the first < is found.

Here is the Python regex pattern demo.


If you want the full data-event-title attribute value, this will capture Yuichi Sugita* vs Adrian Mannarino:

r"data-event-title=\"([^\"]+).*class=\"odds\">([^<]+)"s