GBR24 GBR24 - 4 months ago 39
Python Question

Click button, then scrape data on seemingly static webpage?

I'm trying to scrape the player statistics in the

Totals
table at this link: http://www.basketball-reference.com/players/j/jordami01.html. It's much more difficult to scrape the data as-is when you first appear on that site, so you have the option of clicking 'CSV' right above the table. This format would be much easier to digest.

I'm having trouble

import urllib2
from bs4 import BeautifulSoup
from selenium import webdriver

player_link = "http://www.basketball-reference.com/players/j/jordami01.html"

browser = webdriver.Firefox()
browser.get(player_link)
elem = browser.find_element_by_xpath("//span[@class='tooltip' and @onlick='table2csv('totals')']")
elem.click()


When I run this, a Firefox window pops up, but the code never changes the table from its original format to CSV. The CSV table only pops up in the source code after I click CSV (obviously). How can I get
selenium
to click that CSV button and then BS to scrape the data?

Answer

You don't need BeautifulSoup here. Click the CSV button with selenium, extract the contents of the appeared pre element with CSV data and parse it with built-in csv module:

import csv
from StringIO import StringIO

from selenium import webdriver
from selenium.common.exceptions import TimeoutException
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

player_link = "http://www.basketball-reference.com/players/j/jordami01.html"

browser = webdriver.Firefox()
wait = WebDriverWait(browser, 10)
browser.set_page_load_timeout(10)

# stop load after a timeout
try:
    browser.get(player_link)
except TimeoutException:
    browser.execute_script("window.stop();")

# click "CSV"
elem = wait.until(EC.presence_of_element_located((By.XPATH,  "//div[@class='table_heading']//span[. = 'CSV']")))
elem.click()

# get CSV data
csv_data = wait.until(EC.presence_of_element_located((By.CSS_SELECTOR, "pre#csv_totals"))).text.encode("utf-8")
browser.close()

# read CSV
reader = csv.reader(StringIO(csv_data))
for line in reader:
    print(line)
Comments