Burak Burak - 1 month ago 15
Python Question

web scraping with beautifulsoup

I'm trying to parsing the website only particular part. Here is my code below. Is there anyway to do it more efficient.

from bs4 import BeautifulSoup
import requests
import urllib.request
import json

soup = BeautifulSoup(requests.get("http://www.example.com").content, "html.parser")

for d in soup.select("script[type=text/javascript]"):
print(d.text[2300:2600])


Here is the output what i need

> dataLayer = [{
> 'page':'ProductPage',
> 'OAM':'False',
> 'storeNum':'075',
> 'brand':'Seagate',
> 'productPrice':'69.99',
> 'SKU':'106674',
> 'productID':'467336',
> 'mpn':'ST2000DM006',
> 'ean':'763649110218',
> 'category':'Internal Hard Drives',
> 'isMobile':'False' }];

Answer

It can change on other page - (I didn't check it with other pages)

for d in soup.select("script[type=text/javascript]")[27].text.split('\n')[51:62]:
    print(d.strip())

result

'page':'ProductPage',
'OAM':'False',
'storeNum':'029',
'brand':'Microsoft',
'productPrice':'129.99',
'SKU':'883785',
'productID':'456088',
'mpn':'QC7-00001',
'ean':'889842010060',
'category':'Tablet Accessories',
'isMobile':'False'

EDIT: other version:

text = soup.select("head script[type=text/javascript]")[-1].text

start = text.find('dataLayer = [{') + len('dataLayer = [{') 
end = text.rfind('}];')

rows = text[start:end].strip().split('\n')

for d in rows:
    print(d.strip())
Comments