CSO90 CSO90 - 25 days ago 8
Python Question

Webscraping a Table that does not use <table> (Python)

I am very new to programming so I apologize if this is actually simple. I've got a very basic knowledge of Python and have been trying to learn how to extract the table seen here on this website: https://rotogrinders.com/grids/nfl-targets-1402017?site=draftkings. The problem is that the Table is not set up as a traditional HTML table but is actually made from

<div>
s and seem to be populated via a script? I've been searching around trying my hardest to find a similar situation that was solved but I'm not sure if I'm searching correctly. Here is my code so far:

import requests
from bs4 import BeautifulSoup

page = requests.get("https://rotogrinders.com/grids/nfl-targets-1402017?site=draftkings")

soup = BeautifulSoup(page.content, 'html.parser')

table = soup.find('div', attrs={'class': 'bat'})

print(table.prettify())


I did not get very far since I ran right into this problem. If you know of a possible solution or an example I can learn from please let me know, thank you!

Answer Source

This is a situation where selenium comes handy, combined with BeautifulSoup. Besides these two, usually you need to carefully inspect elements using your browser.

In this case, I used Firefox (which requires geckodriver to be properly installed and placed in the appropriate location), but you can use Chrome or whatever your browser of choice is, as well.

from selenium import webdriver
from bs4 import BeautifulSoup
from collections import OrderedDict
import more_itertools

# open Firefox to get the data

driver = webdriver.Firefox()
driver.get('https://rotogrinders.com/grids/nfl-targets-1402017?site=draftkings')
soup = BeautifulSoup(driver.page_source, 'lxml')
driver.quit()

# extract data from BeautifulSoup object

player_data = soup.find_all('div', attrs={'class':'rgt-col'})
text = [y.text for x in player_data for y in x.descendants if y.name == 'div']

indices_to_delete = [i for i in range(0, len(text), 250)]
keys = [text[k] for k in indices_to_delete]

new_text = [x for x in text if not x in keys]
text = list(more_itertools.sliced(new_text, 249))
new_text = list(zip(*text))

# build the dict

players = OrderedDict()

for x in new_text:
    y = list(zip(keys, x))
    for key, val in y:
        if key == 'Player':
            players[val] = {}
            current_player = val
        else:
            players[current_player][key] = val

... so, when you print(players), you get a nice OrderedDict:

OrderedDict([
    ('DeAndre Hopkins', {
        'Salary': '$6200', 
        'Pos': 'WR', 
        'Opp': 'NEP', 
        'Team': 'HOU', 
        'GP': '2', 
        'Targets': '29', 
        'RzTar': '3', 
        'PoW Tar': '48.33%', 
        'Week 1': '16', 
        'Week 2': '13', 
        'Week 3': '\xa0', 
        'Week 4': '\xa0', 
        'Yards': '128', 
        'YPT': '4.41', 
        'Rec': '14', 
        'Rec Rate': '48.28%'}), 
    ('Dez Bryant', {
        'Salary': '$6800', 
        'Pos': 'WR', 
        'Opp': 'ARI', 
        'Team': 'DAL', 
        'GP': '2', 
        'Targets': '25', 
        'RzTar': '5', 
        'PoW Tar': '28.74%', 
        'Week 1': '9', 
        'Week 2': '16', 
        'Week 3': '\xa0', 
        'Week 4': '\xa0', 
        'Yards': '102', 
        'YPT': '4.08', 
        'Rec': '9', 
        'Rec Rate': '36.00%'}
     ) ... ])

... which means you can do something like:

>>> players['DeAndre Hopkins']
{'Salary': '$6200', 'Pos': 'WR' ... }

Ta-da!