TrolliOlli TrolliOlli - 7 days ago 6
Python Question

Scraping Web data with Python

sorry if this is not the place for this question, but I'm not sure where else to ask.

I'm trying to scrape data from rotogrinders.com and I'm running into some challenges.

In particular, I want to be able to scrape previous NHL game data using urls of this format (obviously you can change the date for other day's data):
https://rotogrinders.com/game-stats/nhl-skater?site=draftkings&date=11-22-2016

However, when I get to the page, I notice that the data is broken up into pages, and I'm unsure what to do to get my script to get the data that's presented after clicking the "all" button at the bottom of the page.

Is there a way to do this in python? Perhaps some library that will allow button clicks? Or is there some way to get the data without actually clicking the button by being clever about the URL/request?

Answer

Actually, things are not that complicated in this case. When you click "All" no network requests are issued. All the data is already there - inside a script tag in the HTML, you just need to extract it.

Working code using requests (to download the page content), BeautifulSoup (to parse HTML and locate the desired script element), re (to extract the desired "player" array from the script) and json (to load the array string into a Python list):

import json
import re

import requests
from bs4 import BeautifulSoup

url = "https://rotogrinders.com/game-stats/nhl-skater?site=draftkings&date=11-22-2016"
response = requests.get(url)

soup = BeautifulSoup(response.content, "html.parser")
pattern = re.compile(r"var data = (\[.*?\]);$", re.MULTILINE | re.DOTALL)

script = soup.find("script", text=pattern)

data = pattern.search(script.text).group(1)
data = json.loads(data)

# printing player names for demonstration purposes
for player in data:
    print(player["player"])

Prints:

Jeff Skinner
Jordan Staal
...
William Carrier
A.J. Greer
Comments