C2H50H C2H50H - 4 months ago 15
JSON Question

Python - Selecting Specific Results to Place in Excel

EDIT: I'm making a lot of head way, I am now trying to parse out single columns from my JSON file, not the entire row. I am getting an error however whenever I try to manipulate my DataFrame to get the results I want.

The error is:
line 52, in
df = pd.DataFrame.from_dict(mlbJson['stats_sortable_player']['queryResults']['name_display_first_last'])
KeyError: 'name_display_first_last'

It only happens when I try to add another parameter, for instance i took out ['row'] and added ['name_display_first_last'] to get the first and last name of each player. If I leave in ['row'] it compiles, but gives me all the data, I only want certain snippets.

Any help would be greatly appreciated! Thanks.

import requests
import pandas as pd
from bs4 import BeautifulSoup

# Scraping data from MLB.com

target = [MLB JSON][1]
mlbResponse = requests.get(target)
mlbJson = mlbResponse.json()
# Placing response in variable


# Collecting data and giving it names in pandas
data = {'Team': team, 'Line': line}
# places data table format, frames the data
table = pd.DataFrame(data)
# Creates excel file named Scrape
writer = pd.ExcelWriter('Scrape.xlsx')
# Moves table to excel taking in Parameters , 'Name of DOC and Sheet on that Doc'
table.to_excel(writer, 'Lines 1')


#stats = {'Name': name, 'Games': games, 'AtBats': ab, 'Runs': runs, 'Hits': hits, 'Doubles': doubles, 'Triples': triples, 'HR': hr, 'RBI': rbi, 'Walks': walks, 'SB': sb}
df = pd.DataFrame.from_dict(mlbJson['stats_sortable_player']['queryResults']['row'])
df.to_excel(writer, 'Batting 2')

# Saves File

writer.save()

Answer

It looks like the website loads the data asynchronously through another request to a different URL. The response you're getting has empty <datagrid><\datagrid> tag, and soup2.select_one("#datagrid").find_next("table") returns None.

You can use the developer tools in your browser under the network tab to find the URL to actually load data, it looks like :

http://mlb.mlb.com/pubajax/wf/flow/stats.splayer?season=2016&sort_order=%27desc%27&sort_column=%27avg%27&stat_type=hitting&page_type=SortablePlayer&game_type=%27R%27&player_pool=ALL&season_type=ANY&sport_code=%27mlb%27&results=1000&recSP=1&recPP=50

You can modify your code to make a request to this URL, which returns json

mlbResponse = requests.get(url)
mlbJson = mlbResponse.json() # python 3, otherwise use json.loads(mlbResponse.content) 
df = pd.DataFrame(doc['stats_sortable_player']['queryResults']['row'])

The DataFrame has 54 columns, so I can't display it here, but you should be able to pick and rename the columns you need.