i.h4d35 i.h4d35 - 6 months ago 25
Python Question

regex pattern in python for parsing HTML title tags

I am learning to use both the

re
module and the
urllib
module in python and attempting to write a simple web scraper. Here's the code I've written to scrape just the title of websites:

#!/usr/bin/python

import urllib
import re

urls=["http://google.com","https://facebook.com","http://reddit.com"]

i=0

these_regex="<title>(.+?)</title>"
pattern=re.compile(these_regex)

while(i<len(urls)):
htmlfile=urllib.urlopen(urls[i])
htmltext=htmlfile.read()
titles=re.findall(pattern,htmltext)
print titles
i+=1


This gives the correct output for Google and Reddit but not for Facebook - like so:

['Google']
[]
['reddit: the front page of the internet']


This is because, I found that on Facebook's page the
title
tag is as follows:
<title id="pageTitle">
. To accomodate for the additional
id=
, I modified the
these_regex
variable as follows:
these_regex="<title.+?>(.+?)</title>"
. But this gives the following output:

[]
['Welcome to Facebook \xe2\x80\x94 Log in, sign up or learn more']
[]


How would I combine both so that I can take into account any additional parameters passed within the
title
tag?

Answer

You are using a regular expression, and matching HTML with such expressions get too complicated, too fast.

Use a HTML parser instead, Python has several to choose from. I recommend you use BeautifulSoup, a popular 3rd party library.

BeautifulSoup example:

from bs4 import BeautifulSoup

response = urllib2.urlopen(url)
soup = BeautifulSoup(response.read(), from_encoding=response.info().getparam('charset'))
title = soup.find('title').text

Since a title tag itself doesn't contain other tags, you can get away with a regular expression here, but as soon as you try to parse nested tags, you will run into hugely complex issues.

Your specific problem can be solved by matching additional characters within the title tag, optionally:

r'<title[^>]*>([^<]+)</title>'

This matches 0 or more characters that are not the closing > bracket. The '0 or more' here lets you match both extra attributes and the plain <title> tag.