corrado1972 corrado1972 - 3 months ago 8
Python Question

python beautifoulsoup wrong parsing table

I tried to parse the table and write its data to csv, but beautifoulsoup doesn't parse the table correctly.
This is the page:
http://projects.fivethirtyeight.com/2016-election-forecast/arizona/

This is the code I'm using:

date=[]
pollster=[]
grade=[]
sample=[]
weight=[]
clinton=[]
trump=[]
johnson=[]
leader=[]
adjusted=[]

import requests
from bs4 import BeautifulSoup
url='http://projects.fivethirtyeight.com/2016-election-forecast/florida/'
r = requests.get(url)
soup=BeautifulSoup(r.content,"lxml")
the_table=soup.find("table", attrs={"class":"t-desktop t-polls"})
rows = the_table.tbody.find_all('tr')
for row in rows:
if 'data-created' in row.attrs:
cols = row.find_all('td')
text_cols = [ele.text.strip() for ele in cols]
date.append(text_cols[2])
pollster.append(text_cols[3])
grade.append(text_cols[4])
sample.append(text_cols[5])
weight.append(text_cols[6])
clinton.append(text_cols[7])
trump.append(text_cols[8])
johnson.append(text_cols[9])
leader.append(text_cols[10])
adjusted.append(text_cols[11])

import pandas as pd
df=pd.DataFrame(date,columns=['date'])
df['pollster']=pollster
df['grade']=grade
df['sample']=sample
df['weight']=weight
df['clinton']=clinton
df['trump']=trump
df['johnson']=johnson
df['leader']=leader
df['adjusted']=adjusted
from urllib.parse import urlparse
s=urlparse(url)
import os
f=os.getcwd()+"/"+s.path.split('/')[-2] + '.csv'
df.to_csv(f)


It saves a csv with wrong data:

,date ,pollster ,grade,sample ,weight,clinton,trump,johnson,leader ,adjusted
0,Aug. 21-27,USC Dornsife/LA Times, ,"2,545",LV ,44% ,44% , ,Clinton +1 ,Clinton +4
1,Aug. 24-26,Morning Consult , ,"2,007",RV ,39% ,37% ,8% ,Clinton +2 ,Clinton +2
2,Aug. 20-26,USC Dornsife/LA Times, ,"2,460",LV ,45% ,43% , ,Clinton +1 ,Clinton +5
3,Aug. 19-25,Ipsos ,A- ,334 ,LV ,50% ,43% , ,Clinton +7 ,Clinton +7
4,Aug. 19-25,Ipsos ,A- ,500 ,LV ,53% ,31% , ,Clinton +22,Clinton +22
5,Aug. 19-25,Ipsos ,A- ,443 ,LV ,32% ,45% , ,Trump +13 ,Trump +13
6,Aug. 19-25,Ipsos ,A- ,518 ,LV ,61% ,25% , ,Clinton +36,Clinton +36
7,Aug. 19-25,Ipsos ,A- ,392 ,LV ,47% ,41% , ,Clinton +7 ,Clinton +7
8,Aug. 19-25,Ipsos ,A- ,666 ,LV ,49% ,42% , ,Clinton +7 ,Clinton +7
and so on.....


If I change the beautifoulsoup parser, still wrong parse.
If I save manually the table copied with chrome inspector or firefox firebug, it works. Here's the correct data csv generated:

,date ,pollster,grade ,sample,weight,clinton,trump,johnson,leader ,adjusted
0 ,Ipsos ,A- ,362 ,LV ,0.67 ,43% ,46% , ,Trump +3 ,Trump +3
1 ,CNN/Opinion Research Corp. ,A- ,809 ,LV ,1.40 ,38% ,45% ,12% ,Trump +7 ,Trump +7
2 ,Ipsos ,A- ,438 ,LV ,0.25 ,39% ,47% , ,Trump +8 ,Trump +8
3 ,YouGov ,B ,"1,095",LV ,0.65 ,42% ,44% ,5% ,Trump +2 ,Trump +1
4 ,OH Predictive Insights / MBQF,C+ ,996 ,LV ,0.44 ,45% ,42% ,4% ,Clinton +3,Clinton +2
5 ,Integrated Web Strategy , ,679 ,LV ,0.35 ,41% ,49% ,3% ,Trump +8 ,Trump +5
6 ,Public Policy Polling ,B+ ,691 ,V ,0.49 ,40% ,44% , ,Trump +4 ,Trump +1
7 ,OH Predictive Insights / MBQF,C+ ,"1,060",LV ,0.16 ,47% ,42% , ,Clinton +4,Clinton +4
8 ,Greenberg Quinlan Rosner ,B- ,300 ,LV ,0.23 ,39% ,45% ,10% ,Trump +6 ,Trump +6
9 ,Public Policy Polling ,B+ ,896 ,V ,0.20 ,38% ,40% ,6% ,Trump +2 ,Tie
10,Behavior Research Center ,A ,564 ,RV ,0.16 ,42% ,35% , ,Clinton +7,Clinton +5
11,Merrill Poll ,B ,701 ,LV ,0.11 ,38% ,38% , ,Tie ,Tie
12,Strategies 360 ,B ,504 ,LV ,0.03 ,42% ,44% , ,Trump +2 ,Tie


Why the whole html from the web makes beatifulsoup wrong parsing?

[EDITING: SOLVED]
This code extract the json object
race.stateData
from the script tag using a regular expression. Data will be finally parsed.

r = requests.get(url)
soup = BeautifulSoup(r.content, "lxml")
script = soup.body.script.text
script = script.replace("\n", "")
re_match = re.match('.*race\.stateData = (.*);race\.path', script)
str_json = re_match.group(1)
j = json.loads(str_json)
#parsing data code not relevant..

Answer

As you can see into the comments, I solved it extracting the json object race.stateData from the script tag using a regular expression. Data will be finally parsed.

Comments