j.zheng j.zheng - 1 month ago 8
Python Question

retrieve a table from a website when data from each row in 'data-append-csv'

I'm trying to scrap data using beautifulsoup from the website as below:

http://www.basketball-reference.com/players/a/


which contains a table of all basketball players data. When I inspect the html source elements. It seems that in each table row ('tr'), the player data is contained in 'data-append-csv'. Here is a snapshot of one of the tr of the player's table.

<tr data-row="0"><th scope="row" class="left " data-append-csv="abdelal01" data-stat="player"></th></tr>


How should I extract data from each table row?

def make_soup(url):
thePage = urllib.request.urlopen(url)
soup = BeautifulSoup(thePage, 'html.parser')
return(soup)
r='http://www.basketball-reference.com/players/a/'
soup = make_soup(r)
for record in soup.find_all('tr')[1:]:
print(record.text)


this is the first record shown:

Alaa Abdelnaby19911995F-C6-10240June 24, 1968Duke University


All data are in a single string with no separation.
How should I extract all the data table? thanks a lot for help!

Answer

I'm not sure if I've understood what you are trying to do, here is an example of extracting all the data for players from the page you have provided. It's not the most beautiful (soup :) ) one but should give you an idea of how to handle things. Also, it's not the only way to do it, just the one that came first to my mind.

import requests
from bs4 import BeautifulSoup

page = requests.get("http://www.basketball-reference.com/players/a/")
soup = BeautifulSoup(page.content,'html.parser')

for record in soup.find_all('tr'):
    try: #Crude way of handling NavigableString Error that pop ups with these multi tag lines
        print record.contents[0].text
        print record.contents[1].text
        print record.contents[2].text
        print record.contents[3].text
        print record.contents[4].text
        print record.contents[5].text
        print record.contents[6].text
        print record.contents[7].text
    except:
        pass
    print '\n'

EDIT

Here's how the code work.

First, 'for' is looking for all occurrences of <tr></tr>. . Every line that is returned will be opening with <tr> tag and closing with </tr> tag - example below

for record in soup.find_all('tr'):
     print record

<tr><th class="left " data-append-csv="abdelal01" data-stat="player" scope="row"><a href="/players/a/abdelal01.html">Alaa Abdelnaby</a></th><td class="right " data-stat="year_min">1991</td><td class="right " data-stat="year_max">1995</td><td class="center " data-stat="pos">F-C</td><td class="right " csk="82.0" data-stat="height">6-10</td><td class="right " data-stat="weight">240</td><td class="left " csk="19680624" data-stat="birth_date"><a href="/friv/birthdays.cgi?month=6&amp;day=24">June 24, 1968</a></td><td class="left " data-stat="college_name"><a href="/friv/colleges.cgi?college=duke">Duke University</a></td></tr>

So we end up with a complete <tr></tr> line. Now we use .contents to turn returned string into a list

for record in soup.find_all('tr'):
     print record.content

[<th class="left " data-append-csv="abdelal01" data-stat="player" scope="row"><a href="/players/a/abdelal01.html">Alaa Abdelnaby</a></th>, <td class="right " data-stat="year_min">1991</td>, <td class="right " data-stat="year_max">1995</td>, <td class="center " data-stat="pos">F-C</td>, <td class="right " csk="82.0" data-stat="height">6-10</td>, <td class="right " data-stat="weight">240</td>, <td class="left " csk="19680624" data-stat="birth_date"><a href="/friv/birthdays.cgi?month=6&amp;day=24">June 24, 1968</a></td>, <td class="left " data-stat="college_name"><a href="/friv/colleges.cgi?college=duke">Duke University</a></td>]

It just got lot easir since we are working with a list not a long string. Using [n] we can access an n-th item from within the list. Let's print the very first item

for record in soup.find_all('tr'):
    print record.content[0]

<th class="left " data-append-csv="abdelal01" data-stat="player" scope="row"><a href="/players/a/abdelal01.html">Alaa Abdelnaby</a></th>

As you can see we got two tags. <th> and <a> and text that is between tags and is not a tag itself. That's what .text do - it omits all the tags and grabs only the actual text displayed on the site.

for record in soup.find_all('tr'):
    print record.content[0].text

Alaa Abdelnaby

Hope it helps :)