CL. L CL. L - 3 months ago 34
Python Question

Python 3.5: Web-scraping with Stripping html codes

I am scraping the web content but stuck with a problem. After a series of processing to strip the scope that I want, I cannot strip the html code to make it plain text in a list. I have tried using the function of replace, re.compile and join (try to change the list to text for stripping). All doesn't work as they are designed for string or pops out errors when running.

Could anyone give me some hint how to do that. For example, I want the output from the following code change from

<p class="course-d-title">Instructor</p>


to
Instructor
.

import tkinter as tk
import re

def test():
from bs4 import BeautifulSoup
import urllib.request
from urllib.parse import urljoin

'''for layer 0'''
url_text = 'http://www.scs.cuhk.edu.hk/en/part-time/accounting-and-finance/accounting-and-finance/fundamental-accounting/162-610441-01'
resp = urllib.request.urlopen(url_text)
soup = BeautifulSoup(resp, from_encoding=resp.info().get_param('charset'))
a = soup.find_all('p')

k=0
for item in a[:]:
if 'Instructor' in item:
a=a[k:]
break
k+=1

j=0
for item in a[:]:
if 'Enquiries' in item:
a=a[:j-1]
break
j+=1

for i in range(0,a.__len__()):
print (a[i])

if __name__ == '__main__':
test()

Answer

use .text to extract text from bs4 element

>>> a = soup.find_all('p')
>>> data = [ item for item in a if 'Instructor' in item]
[<p class="course-d-title">Instructor</p>]

>>> data[0].text
'Instructor'