Hugh Spry Hugh Spry - 6 months ago 10
Python Question

Need to extract data from a website and store in list using regex

So I have a task which requires me to extract data from a website to form a 'top 10 list'. I have chosen IMDB top 250 page http://www.imdb.com/chart/top.

In other words I need a little help using regex to isolate the names of the films and then store them in a list. I already have the HTML stored in a variable as a string (if this is the wrong way of approaching it let me know).

Also, I am limited to use of modules urlopen, re and htmlparser

import HTMLParser
from urllib import urlopen
import re

site = urlopen("http://www.imdb.com/chart/top?tt0468569")
content = site.read()

print content

Answer

You really shouldn't use regex but you stated in your comment you have to, so here it is with regex:

import requests

respText = requests.get("http://www.imdb.com/chart/top").text

for title in re.findall(r'<td class="titleColumn">.+?>(.+?)<', respText, re.DOTALL):
    print(title)


In BeautifulSoup (Which you can't use)

soup = BeautifulSoup(respText, "html.parser")
for item in soup.find_all("td", {"class" : "titleColumn"}):
    print(item.find("a").text)