Om Prakash Om Prakash - 1 year ago 124
Python Question

BeautifulSoup unable to parse Goa University site

I am working on a parsing project which require me to parse educational website. While doing so, my code is unable to parse University of Goa site. It does not return as expected.
My code:

from bs4 import BeautifulSoup
import requests

hdrs = {'User-Agent': 'Mozilla / 5.0 (X11 Linux x86_64) AppleWebKit / 537.36 (\
KHTML, like Gecko) Chrome / 52.0.2743.116 Safari / 537.36'}

r = requests.get(url, verify=True, headers=hdrs)
result = BeautifulSoup(r.content)

It prints:

<html><head><script type="text/javascript">

instead of raw html parsed tree. I tried passing explicity parser
to BeautifulSoup but it also does not work as expected. Kindly help me.
Thanks in advance.

Answer Source

You need to create a session then parse and use the redirect url:

with requests.Session() as s:
    r = s.get("")
    result = BeautifulSoup(r.content)
    redirect = result.find("script").text.split("=")[1].strip('";\r\n')
    r2 = s.get(redirect)

r2.text will give you the html you see on the home page.

Recommended from our users: Dynamic Network Monitoring from WhatsUp Gold from IPSwitch. Free Download