Om Prakash Om Prakash - 1 month ago 13
Python Question

BeautifulSoup unable to parse Goa University site

I am working on a parsing project which require me to parse educational website. While doing so, my code is unable to parse University of Goa site. It does not return as expected.
My code:

from bs4 import BeautifulSoup
import requests

hdrs = {'User-Agent': 'Mozilla / 5.0 (X11 Linux x86_64) AppleWebKit / 537.36 (\
KHTML, like Gecko) Chrome / 52.0.2743.116 Safari / 537.36'}

r = requests.get(url, verify=True, headers=hdrs)
result = BeautifulSoup(r.content)
print(result)


It prints:

<html><head><script type="text/javascript">
document.location="https://www.unigoa.ac.in/result_redirect.php";
</script>
</head></html>


instead of raw html parsed tree. I tried passing explicity parser
lxml
and
html5lib
to BeautifulSoup but it also does not work as expected. Kindly help me.
Thanks in advance.

Answer

You need to create a session then parse and use the redirect url:

with requests.Session() as s:
    s.headers.update(hdrs)
    r = s.get("https://www.unigoa.ac.in")
    result = BeautifulSoup(r.content)
    redirect = result.find("script").text.split("=")[1].strip('";\r\n')
    r2 = s.get(redirect)
    print(r2.text)

r2.text will give you the html you see on the home page.

Comments