NFB NFB - 1 year ago 43
HTML Question

Extracting HTML results using XPath fail in Scrapy because content is loaded dynamically

Related to but different from a previous question of mine, Extracting p within h1 with Python/Scrapy, I've come across a situation where Scrapy (for Python) will not extract a span tag within an h4 tag.

Example HTML is:

<div class="event-specifics">
<div class="event-location">
<h3> Gourmet Matinee </h3>
<span id="spanEventDetailPerformanceLocation">Knight Grove</span>

I'm attempting to grab the text "Knight Grove" within the span tags. When using scrapy shell on the command line,



['Knight Grove']



returns the entire node, viz:

['\n ', '<h3>\n Gourmet Matinee</h3>', '\n ', '<h4><span id="spanEventDetailPerformanceLocation"><p>Knight Grove</p></span></h4>', '\n ']

BUT, when then same Xpath is run within a spider, nothing is returned. Take for instance the following spider code, written to scrape the page from which the above sample HTML was taken, (Some of the code is removed since it doesn't relate to the question):

# -*- coding: utf-8 -*-
import scrapy
from scrapy.spiders import CrawlSpider, Rule
from scrapy.loader import ItemLoader
from concertscraper.items import Concert
from scrapy.contrib.loader import XPathItemLoader
from scrapy import Selector
from scrapy.http import XmlResponse

class ClevelandOrchestra(CrawlSpider):
name = 'clev2'
allowed_domains = ['']

start_urls = ['']

rules = (
Rule(LinkExtractor(allow=''), callback='parse_item', follow=True),

def parse_item(self, response):
thisconcert = ItemLoader(item=Concert(), response=response)
for concert in response.xpath('.//div[@class="event-wrap"]'):


return thisconcert.load_item()

This returns no item['location']. I've also tried:


Unlike in the question above regarding p within h, span tags are permitted within h tags in HTML, unless I am mistaken?

For clarity, the 'location' field is defined within the Concert() object, and I have all pipelines disabled in order to troubleshoot.

Is is possible that span within h4 is in some way invalid HTML; if not, what could be causing this?

Interestingly, going about the same task using add_css(), like this:


yields a node with the span tags present but the internal text missing:

['<div class="event-location">\r\n'
' <h3>\r\n'
' <h4><span '
' </div>']

To confirm this is not a duplicate: It is true on this particular example there is a p tag inside of a span tag which is inside of the h4 tag; however, the same behavior occurs when there is no p tag involved, such as at:

Answer Source

This content loaded via Ajax call. In order to get data, you need to make similar POST request and don't forget to add headers with content type: headers = {'content-type': "application/json"} and you get Json file in response.enter image description here

import requests

url = ""
payload = {"startDate": "2017-06-30T21:00:00.000Z", "endDate": "2017-12-31T21:00:00.000Z"}
headers = {'content-type': "application/json"}

json_response =, json=payload, headers=headers).json()
for performance in json_response['d']:
    print(performance["performanceName"], performance["dateString"])

# Star-Spangled Spectacular Friday, June 30, 2017
# Blossom: Tchaikovskys Spectacular 1812 Overture Saturday, July 1, 2017
# Blossom: Tchaikovskys Spectacular 1812 Overture Sunday, July 2, 2017
# Blossom: A Salute to America Monday, July 3, 2017
# Blossom: A Salute to America Tuesday, July 4, 2017
Recommended from our users: Dynamic Network Monitoring from WhatsUp Gold from IPSwitch. Free Download