Mr Squer Mr Squer - 5 months ago 13
Python Question

Parsing HTML with str.split in Python

I'm parsing a website with the requests module and I'm trying to get specific URLs inside tags (but a table of data as the tags are used more than once) without using BeautifulSoup. Here's part of the code I'm trying to parse:

<td class="notranslate" style="height:25px;">
<a class="post-list-subject" href="/Forum/ShowPost.aspx?PostID=80631954">
<div class="thread-link-outer-wrapper">
<div class="thread-link-container notranslate">
Forum Rule: Don&#39;t Spam in Any Way

I'm trying to get the text inside the tag:


The thing is, because I'm parsing a forum site, there are multiple uses of those divider tags. I'd like to retrieve a table of post URLs using string.split using code similar to this:

htmltext.split('<a class="post-list-subject" href="')[1].split('"><div class="thread-link-outer-wrapper">')[0]

There is nothing in the HTML code to indicate a post number on the page, just links.


In my opinion there are better ways to do this. Even if you don't want to use BeautifulSoup, I would lean towards regular expressions. However, the task can definitely be accomplished using the code you want. Here's one way, using a list comprehension:

 results = [chunk.split('">')[0] for chunk in htmltext.split('<a class="post-list-subject" href="')[1:]]

I tried to model it as closely off of your base code as possible, but I did simplify one of the split arguments to avoid whitespace issues.

In case regular expressions are fair game, here's how you could do it:

import re
target = '<a class="post-list-subject" href="(.*)">'
results = re.findall(target, htmltext)