ssokhey ssokhey - 1 month ago 11
HTML Question

Why is BeautifulSoup not extracting all of HTML from a webpage?

I am trying to extract text from this website: searchgurbani. This website has some old scripture translated in English and Punjabi (an Indian Language) line-by-line. It makes a very good parallel corpus. I have successfully extracted all the English translations in a separate text file. But when I go for Punjabi, It returns nothing.

This is the Inspect element screenshot: (Highlighted text is the translated Punjabi language)

Screenshot 1

In Screenshot 1, highlighted text which belongs to class=lang_16 is not listed in the soup object beautiful which should contain all of the HTML. Here is the Python code:

outputFilePunjabi = open("1.txt","w",newline="",encoding="utf-16")
r=urlopen("")
beautiful = BeautifulSoup(r.read().decode('utf-8'),"html5lib")
#beautiful = BeautifulSoup(r.read().decode('utf-8'),"lxml")
punjabi_text = beautiful.find_all(class_="lang_16")
for i in punjabi_text:
outputFilePunjabi.write(i.get_text())
outputFilePunjabi.write('\n')


If I run the same code with class_=lang_4 it does the work.

Please do the following to see lang_16 in inspect element:

Please do the following on that web page: Go to preferences --> Tick "translation of Sri Guru Granth Sahib ji (by S. Manmohan Singh) - Punjabi" under Additional Translations available on Guru Granth Shahib: --> scroll down - submit changes -> reopen page

Please guide me where I am going wrong.

(python version = 3.5)

PS: I have very less experience in web scrapping.

Answer

Remember you've suggested to do the following:

Please do the following on that web page: Go to preferences -> Tick "ranslation of Sri Guru Granth Sahib ji (by S. Manmohan Singh) - Punjabi" under Additional Translations available on Guru Granth Shahib: -> scroll down - submit changes

Now, this is also required when you download the page in Python. In other words, use requests and set the lang_16="yes" cookie to enable the Punjabi translation:

import requests
from bs4 import BeautifulSoup


with requests.Session() as session:
    response = session.get("https://www.searchgurbani.com/guru_granth_sahib/ang_by_ang", cookies={
        "lang_16": "yes"
    })
    soup = BeautifulSoup(response.content, "html5lib")
    for item in soup.select(".lang_16"):
        print(item.get_text())

Prints:

ਵਾਹਿਗੁਰੂ ਕੇਵਲ ਇਕ ਹੈ। ਸੱਚਾ ਹੈ ਉਸ ਦਾ ਨਾਮ, ਰਚਨਹਾਰ ਉਸ ਦੀ ਵਿਅਕਤੀ ਅਤੇ ਅਮਰ ਉਸ ਦਾ ਸਰੂਪ। ਉਹ ਨਿਡਰ, ਕੀਨਾ-ਰਹਿਤ, ਅਜਨਮਾ ਤੇ ਸਵੈ-ਪ੍ਰਕਾਸ਼ਵਾਨ ਹੈ। ਗੁਰਾਂ ਦੀ ਦਯਾ ਦੁਆਰਾ ਉਹ ਪਰਾਪਤ ਹੁੰਦਾ ਹੈ।
ਉਸ ਦਾ ਸਿਮਰਨ ਕਰ।
ਪਰਾਰੰਭ ਵਿੱਚ ਸੱਚਾ, ਯੁਗਾਂ ਦੇ ਸ਼ੁਰੂ ਵਿੱਚ ਸੱਚਾ,
ਅਤੇ ਸੱਚਾ ਉਹ ਹੁਣ ਭੀ ਹੈ, ਹੇ ਨਾਨਕ! ਨਿਸਚਿਤ ਹੀ, ਉਹ ਸੱਚਾ ਹੋਵੇਗਾ।
...
ਕਈ ਇਕ ਗਾਇਨ ਕਰਦੇ ਹਨ ਕਿ ਵਾਹਿਗੁਰੂ ਪ੍ਰਾਣ ਲੈ ਲੈਂਦਾ ਹੈ ਤੇ ਮੁੜ ਵਾਪਸ ਦੇ ਦਿੰਦਾ ਹੈ।
ਕਈ ਗਾਇਨ ਕਰਦੇ ਹਨ ਕਿ ਹਰੀ ਦੁਰੇਡੇ ਮਲੂਮ ਹੁੰਦਾ ਅਤੇ ਸੁੱਝਦਾ ਹੈ।