Maurizio Maurizio - 4 months ago 16
Python Question

RegEx - Match optional groups

I know RegEx is not the best way to scrape HTMLs, but this is it...
I have some something like:

<td> Writing: <a href="creator.php?c=CCh">Carlo Chendi</a> Art: <a href="creator.php?c=LBo">Luciano Bottaro</a> </td>


And I need to match the Writing and Art parts. But it is not said they're there, and there could be other parts like Ink and Pencils...

How do I do this? I need to use pure RegEx, no additional Python libs.

PP. PP.
Answer

Maybe there are two patterns to recognise.

  1. your keywords exist within a <td>...</td>
  2. your keywords are followed by a <a>...</a> section

So.. first extract everything within <td>s... (psuedo code)

while ( match( "<td[^>]*>(.*?)</td[^>]*>" ) ) {
    inner = match[1];
    ...
}

The (.*?) means match non-greedily, i.e. match the minimum possible. Otherwise you would match everything from the first <td> to the last </td> (instead of the next </td>).

Then you can move on to processing the inner portion!

Comments