Kitchi Kitchi - 5 months ago 37
Python Question

Handling Indian Languages in BeautifulSoup

I'm trying to scrape the NDTV website for news titles. This is the page I'm using as a HTML source. I'm using BeautifulSoup (bs4) to handle the HTML code, and I've got everything working, except my code breaks when I encounter the hindi titles in the page I linked to.

My code so far is :

import urllib2
from bs4 import BeautifulSoup

htmlUrl = ""
FileName = "NDTV_2012_01.txt"

fptr = open(FileName, "w")

page = urllib2.urlopen(htmlUrl)
soup = BeautifulSoup(page, from_encoding="UTF-8")

li = soup.findAll( 'li')
for link_tag in li:
hypref = link_tag.find('a').contents[0]
strhyp = str(hypref)

The error I get is :

Traceback (most recent call last):
File "./", line 30, in <module>
strhyp = str(hypref)
UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-5: ordinal not in range(128)

I got the same error even when I didn't include the
parameter. I initially used it as
, but python warned me that it was deprecated usage.

How do I fix this? From what I've read I need to either avoid the hindi titles or explicitly encode it into non-ascii text, but I don't know how to do that. Any help would be greatly appreciated!


What you see is a NavigableString instance (which is derived from the Python unicode type):

(Pdb) hypref.encode('utf-8')
(Pdb) hypref.__class__
<class 'bs4.element.NavigableString'>
(Pdb) hypref.__class__.__bases__
(<type 'unicode'>, <class 'bs4.element.PageElement'>)

You need to convert to utf-8 using