Eric Gorlin Eric Gorlin - 10 days ago 6
Python Question

BeautifulSoup Parses Table Incorrectly

Having trouble getting Beautiful Soup to process a large table of play-by-play basketball data properly. Code:

import urllib.request
from bs4 import BeautifulSoup

request = urllib.request.Request('http://www.basketball-reference.com/boxscores/pbp/201611220LAL.html')
result = urllib.request.urlopen(request)
resulttext = result.read()
soup = BeautifulSoup(resulttext, "html.parser")

pbpTable = soup.find('table', id="pbp")


If you run this example yourself, you will find that the table is not fully parsed- all we get out is this:

<table class="suppress_all sortable stats_table" data-cols-to-freeze="1" id="pbp">
<caption>Play-By-Play Table</caption>
<tr class="thead" id="q1">
<th colspan="6">1st Q</th></tr></table>


The problem is in the parsing itself printing the soup variable gives (among other things)

</div>
<div class="table_wrapper" id="all_pbp">
<div class="section_heading">
<span class="section_anchor" data-label="Play-By-Play" id="pbp_link"></span>
<h2>Play-By-Play</h2> <div class="section_heading_text">
<ul> <li>  Jump to: <a href="#q1">1st</a> | <a href="#q2">2nd</a> | <a href="#q3">3rd</a> | <a href="#q4">4th</a> <br> <span class="bbr-play-score key">scoring play</span> <span class="bbr-play-tie key">tie</span> <span class="bbr-play-leadchange key">lead change</span></br></li>
</ul>
</div>
</div> <div class="table_outer_container">
<div class="overthrow table_container" id="div_pbp">
<table class="suppress_all sortable stats_table" data-cols-to-freeze="1" id="pbp"><caption>Play-By-Play Table</caption><tr class="thead" id="q1">
<th colspan="6">1st Q</th></tr></table></div></div></div></div></div></body></html>


Most importantly, a /table tag appears out of nowhere. Viewing the page source of the relevant link we can see that the table is not closed there- it goes on for a while. Is there any fix for this besides implementing my own HTML parsing code?

Answer

Use "lxml" or "html5lib" instead of "html.parser" in

soup = BeautifulSoup(resulttext, "lxml")` 

and you get more data.

But you may have to install lxml or html5lib if you don't have yet.

pip install lxml

pip install html5lib

lxml may need C/C++ compiler, libxml library (libxml.dll on Windows), etc.

Comments