Alexandros Alexandros - 12 days ago 6
Python Question

Parsing a website with Python

So I managed to get the page source as a string but my problem is that now I need to parse it, eg. find each instance of a word and save the next few lines in an array.

the text I have looks something like this

<div class="searchResult">
<table id="ctl00_lp_ctl01_lst" class="searchResultList" cellspacing="0" border="0" style="border-collapse:collapse;">
<tr>
<td class="searchResultI">
<div class="date">
13:07
&nbsp;&nbsp;
17 July
</div>
<div class="sTitle">
<a href="www.example1.com/result1">
Link Description</a></div>
<div class="sSubTitle">
</div>
</td>
</tr><tr>
<td class="searchResultAI">
<div class="date">
20:07
&nbsp;&nbsp;
16 July
</div>
<div class="sTitle">
<a href="www.example2.com/result2">
Link Description<</a></div>
<div class="sSubTitle">
</div>
</td>
</tr><tr>

and so on


and I would like to get the href link and link description and put them in an array. I don't know why this is so trivial for me as I did several parsing projects with other languages. I already searched the web but with nothing helpful.

sgp sgp
Answer

You should not be using regex for parsing HTML. Python comes with lots of parsers for HTML parsing. A good choice here would be Beautiful soup. This is how easy getting href links gets using soup.

import urllib2
from bs4 import BeautifulSoup
url = urllib2.urlopen("http://www.example.com/").read()
soup = BeautifulSoup(url)
for line in soup.find_all('a'):
        print(line.get('href'))
Comments