NFB NFB - 1 month ago 8
HTML Question

Extracting HTML results using XPath fail in Scrapy because content is loaded dynamically

Related to but different from a previous question of mine, Extracting p within h1 with Python/Scrapy, I've come across a situation where Scrapy (for Python) will not extract a span tag within an h4 tag.

Example HTML is:

<div class="event-specifics">
<div class="event-location">
<h3> Gourmet Matinee </h3>
<h4>
<span id="spanEventDetailPerformanceLocation">Knight Grove</span>
</h4>
</div>
</div>


I'm attempting to grab the text "Knight Grove" within the span tags. When using scrapy shell on the command line,

response.xpath('.//div[@class="event-location"]//span//text()').extract()


returns:

['Knight Grove']


And

response.xpath('.//div[@class="event-location"]/node()')


returns the entire node, viz:

['\n ', '<h3>\n Gourmet Matinee</h3>', '\n ', '<h4><span id="spanEventDetailPerformanceLocation"><p>Knight Grove</p></span></h4>', '\n ']


BUT, when then same Xpath is run within a spider, nothing is returned. Take for instance the following spider code, written to scrape the page from which the above sample HTML was taken, https://www.clevelandorchestra.com/17-blossom--summer/1718-gourmet-matinees/2017-07-11-gourmet-matinee/. (Some of the code is removed since it doesn't relate to the question):

# -*- coding: utf-8 -*-
import scrapy
from scrapy.spiders import CrawlSpider, Rule
from scrapy.loader import ItemLoader
from concertscraper.items import Concert
from scrapy.contrib.loader import XPathItemLoader
from scrapy import Selector
from scrapy.http import XmlResponse

class ClevelandOrchestra(CrawlSpider):
name = 'clev2'
allowed_domains = ['clevelandorchestra.com']

start_urls = ['https://www.clevelandorchestra.com/']

rules = (
Rule(LinkExtractor(allow=''), callback='parse_item', follow=True),
)

def parse_item(self, response):
thisconcert = ItemLoader(item=Concert(), response=response)
for concert in response.xpath('.//div[@class="event-wrap"]'):

thisconcert.add_xpath('location','.//div[@class="event-location"]//span//text()')

return thisconcert.load_item()


This returns no item['location']. I've also tried:

thisconcert.add_xpath('location','.//div[@class="event-location"]/node()')


Unlike in the question above regarding p within h, span tags are permitted within h tags in HTML, unless I am mistaken?

For clarity, the 'location' field is defined within the Concert() object, and I have all pipelines disabled in order to troubleshoot.

Is is possible that span within h4 is in some way invalid HTML; if not, what could be causing this?

Interestingly, going about the same task using add_css(), like this:

thisconcert.add_css('location','.event-location')


yields a node with the span tags present but the internal text missing:

['<div class="event-location">\r\n'
' <h3>\r\n'
' BLOSSOM MUSIC FESTIVAL </h3>\r\n'
' <h4><span '
'id="spanEventDetailPerformanceLocation"></span></h4>\r\n'
' </div>']


To confirm this is not a duplicate: It is true on this particular example there is a p tag inside of a span tag which is inside of the h4 tag; however, the same behavior occurs when there is no p tag involved, such as at: https://www.clevelandorchestra.com/1718-concerts-pdps/1718-rental-concerts/1718-rentals-other/2017-07-21-cooper-competition/?performanceNumber=16195.

Answer Source

This content loaded via Ajax call. In order to get data, you need to make similar POST request and don't forget to add headers with content type: headers = {'content-type': "application/json"} and you get Json file in response.enter image description here

import requests

url = "https://www.clevelandorchestra.com/Services/PerformanceService.asmx/GetToolTipPerformancesForCalendar"
payload = {"startDate": "2017-06-30T21:00:00.000Z", "endDate": "2017-12-31T21:00:00.000Z"}
headers = {'content-type': "application/json"}

json_response = requests.post(url, json=payload, headers=headers).json()
for performance in json_response['d']:
    print(performance["performanceName"], performance["dateString"])

# Star-Spangled Spectacular Friday, June 30, 2017
# Blossom: Tchaikovskys Spectacular 1812 Overture Saturday, July 1, 2017
# Blossom: Tchaikovskys Spectacular 1812 Overture Sunday, July 2, 2017
# Blossom: A Salute to America Monday, July 3, 2017
# Blossom: A Salute to America Tuesday, July 4, 2017