Ambushes Ambushes - 4 months ago 15
Python Question

How do I scrape using Python and BeautifulSoup - Dealing with a Table using Javascript

I'm trying to learn how to scrape info using Python, and unfortunately i'm having a lot of trouble here. The issue i'm dealing with is that the information I want doesn't seem to be contained in the page source, it only appears after you check one of the boxes.

The url in question is: http://www.ncix.com/openbox/

and for example, i want all the information that appears on the page after you check "Video Cards (20)" under "Categories." When i look at the page source it appears there's a script called submitformfilter() that looks like this:

function submitformfilter()
{
var querystring = "dofilter=1";
$("input:checkbox:checked").each(function()
{
querystring = querystring + '&'+$(this).attr("name")+'='+$(this).val()
}
);
if($("#promokw").val() !="")
{
querystring = querystring+'&promokw='+ $("#promokw").val();
}
$.getJSON("http://www.ncix.com/promo/openboxfilter.cfm?jsoncallback=?&"+querystring);
}
function dosearch()
{
if($("#promokw").val() =="")
{
alert("Please enter the keyword.");
return false;
}
submitformfilter();
return false;
}


I have no idea how to parse the data i want in this case. Any help would be appreciated.

Answer

You need to post data, in particular the minorcatid which relates to the video cards:

import requests
from bs4 import BeautifulSoup

data = {"dofilter": "1",
        "minorcatid": ""}

# not necessarily essential but good to at least add a user-agent headers = { "User-Agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/51.0.2704.106 Safari/537.36", "X-Requested-With": "XMLHttpRequest"}

with requests.Session() as s:
    # use bs4 to get the minorcatid programmatically
    soup = BeautifulSoup(s.get("http://www.ncix.com/openbox/").content, "lxml")
    _id = soup.select_one("img[alt^=Video]")["src"].rsplit("/", 1)[1][:-4]
    data["minorcatid"] = _id
    resp = s.post("http://www.ncix.com/promo/openboxfilter.cfm", data=data, headers=headers).text

    print(resp)

Which would give you the data from the callback, there is more form data but we get away with just passing the id.

You can see in chrome tools exactly what happens when you select the box:

enter image description here

The response is the same as you see in the dev console:

enter image description here

We pull the id from the img tag relating to Video Cards by parsing the src attribute from the original page:

<img src="http://img.ncix.com/categoryimages/108.jpg" width="110" height="55" title="Video Cards" alt="Video Cards">

The data is escaped when we get it so to get it into a nicer format we can decode using unicode_escape, after doing that we can parse the table and get whatever you want, for example the anchor tags than contain the link to each and the description:

In [6]: with requests.Session() as s:
   ...:         soup = BeautifulSoup(s.get("http://www.ncix.com/openbox/").content, "lxml")
   ...:         _id = soup.select_one("img[alt^=Video]")["src"].rsplit("/", 1)[1][:-4]
   ...:         print(_id)
   ...:         data["minorcatid"] = _id
   ...:         resp = s.post("http://www.ncix.com/promo/openboxfilter.cfm", data=data, headers=headers).content.decode("unicode_escape")
   ...:         soup2 = BeautifulSoup(resp.replace(r'\""', ''), "lxml")
   ...:         for tr in soup2.select("table tr"):
   ...:                 print(tr.select_one("div a[title^=SKU]"))   ...:         


108
<a href="http://www.ncix.com/detail/gigabyte-radeon-r9-fury-x-42-110438.htm#openbox" title="SKU: 110438
Mfr part #: GV%2DR9FURYX%2D4GD%2DB">GIGABYTE Radeon R9 Fury X 1050MHZ 4GB 1GHz HBM HDMI DISPLAYPORTX3 PCI-E Video Card</a>
<a href="http://www.ncix.com/detail/gigabyte-geforce-gtx-970-oc-6d-102012.htm#openbox" title="SKU: 102012
Mfr part #: GV%2DN970WF3OC%2D4GD">GIGABYTE GeForce GTX 970 OC 1253 MHz 4GB 7.0GHZ GDDR5 2xDVI HDMI DisplayPort PCI-E Video Card</a>
<a href="http://www.ncix.com/detail/sapphire-radeon-x1950-pro-dual-96-24458.htm#openbox" title="SKU: 24458
Mfr part #: 11095%2D08%2D40R">Sapphire Radeon X1950 Pro Dual 580MHZ 1GB 1.4GHZ GDDR3 PCI-E 2XDVI-I TV Out Dual GPU Video Card</a>
<a href="http://www.ncix.com/detail/zotac-geforce-gt-730-zone-54-129574.htm#openbox" title="SKU: 129574
Mfr part #: ZT%2D71114%2D20L">Zotac GeForce GT 730 Zone Edition 1GB 902MHZ 1600MHz DDR3 DirectX 12 DVI + HDMI + VGA Video Card</a>
<a href="http://www.ncix.com/detail/club3d-radeon-r9-285-royal-43-101460.htm#openbox" title="SKU: 101460
Mfr part #: CGAX%2DR92856">CLUB3D Radeon R9 285 Royal Queen 945MHZ 2GB 5.5GHZ GDDR5 2xDVI HDMI DisplayPort PCI-E Video Card</a>
<a href="http://www.ncix.com/detail/soltek-refurbished-nvidia-fx5600-agp8x-c6-11182.htm#openbox" title="SKU: 11182
Mfr part #: SL%2D5600%2DXD%2DR">SOLTEK REFURBISHED nVIDIA FX5600 AGP8X /128BIT/128MB DDR *30DAY WARRANTY* CARD ONLY</a>
<a href="http://www.ncix.com/detail/chaintech-apogee-geforce-6800-gt-ad-12556.htm#openbox" title="SKU: 12556
Mfr part #: AA6800G">Chaintech APOGEE GeForce 6800 GT 256MB DDR3 AGP8X VGA DVI-I TV Out Video Card</a>
<a href="http://www.ncix.com/detail/sapphire-radeon-x800-gto2-256mb-1f-17166.htm#openbox" title="SKU: 17166
Mfr part #: 102%2DA47466%2D11%2DAT%20%2821067%2D01%2D20%29">Sapphire Radeon X800 GTO2 256MB GDDR3 PCI-E Dual DVI VIVO OEM Video Card</a>
<a href="http://www.ncix.com/detail/sapphire-radeon-x1600-pro-advantage-de-21813.htm#openbox" title="SKU: 21813
Mfr part #: 88%2D8C87%2D11%2DSA">Sapphire Radeon X1600 Pro Advantage PCI-E 256MB DDR VGA DVI-I TV Out Video Card</a>
<a href="http://www.ncix.com/detail/sapphire-radeon-hd-5850-725mhz-96-50790.htm#openbox" title="SKU: 50790
Mfr part #: 21162%2D00%2D40R">Sapphire Radeon HD 5850 725MHZ 1GB 4.0GHZ GDDR5 PCI-E Display Port 2XDVI HDMI DirectX 11 Video Card</a>
<a href="http://www.ncix.com/detail/sapphire-radeon-hd-6570-650mhz-1b-82646.htm#openbox" title="SKU: 82646
Mfr part #: 11191%2D03%2D20G">Sapphire Radeon HD 6570 650MHZ 512MB 4GHZ GDDR5 DVI HDMI PCI-E Video Card</a>
<a href="http://www.ncix.com/detail/sapphire-radeon-r9-fury-core-93-122143.htm#openbox" title="SKU: 122143
Mfr part #: 11247%2D03%2D40G">Sapphire Radeon R9 Fury Core 1050MHZ 4G HBM PCI-E HDMI/DVI-D Triple DP TRI-X OC+ (UEFI) Graphic Card</a>
<a href="http://www.ncix.com/detail/bfg-geforce-7600gs-oc-420mhz-cc-18204.htm#openbox" title="SKU: 18204
Mfr part #: BFGR76256GSOCE">BFG GeForce 7600GS OC 420MHZ PCI-E 256MB 800MHZ DDR2 VGA DVI-I HDTV Out Video Card</a>
<a href="http://www.ncix.com/detail/xfx-geforce-7800gtx-450mhz-256mb-f4-15636.htm#openbox" title="SKU: 15636
Mfr part #: PV%2DT70F%2DUNF7">XFX GeForce 7800GTX 450MHZ 256MB 256BIT 1.25GHZ DDR3 PCI-E Dual DVI-I TV-OUT Video Card</a>
<a href="http://www.ncix.com/detail/xfx-geforce-7600-gs-400mhz-6c-21791.htm#openbox" title="SKU: 21791
Mfr part #: PVT73PYDJ3">XFX GeForce 7600 GS 400MHZ PCI-E 512MB 128BIT 533MHZ DDR2 VGA DVI-I HDTV Out Video Card</a>
<a href="http://www.ncix.com/detail/xfx-radeon-r7-260x-dual-74-95523.htm#openbox" title="SKU: 95523
Mfr part #: R7%2D260X%2DCDF4">XFX Radeon R7 260X Dual Fan OC 1.1GHZ 2GB GDDR5 2xDVI HDMI DisplayPort PCI-E Video Card R7-260X-CDF4</a>
<a href="http://www.ncix.com/detail/evga-geforce-gtx-470-superclocked-8c-53391.htm#openbox" title="SKU: 53391
Mfr part #: 012%2DP3%2D1475%2DAR">EVGA GeForce GTX 470 SUPERCLOCKED+ 625MHZ Fermi 1280MB 3.4GHZ GDDR5 2XDVI Mini-HDMI PCI-E Video Card</a>
<a href="http://www.ncix.com/detail/evga-e-geforce-8600gt-540mhz-256mb-31-23715.htm#openbox" title="SKU: 23715
Mfr part #: 256%2DP2%2DN751%2DTR">EVGA E-GEFORCE 8600GT 540MHZ 256MB 1.4GHZ GDDR3 PCI-E Dual DVI-I HDTV Out DIRECTX10 Video Card</a>
<a href="http://www.ncix.com/detail/evga-e-geforce-8600gt-540mhz-512mb-2e-26168.htm#openbox" title="SKU: 26168
Mfr part #: 512%2DP2%2DN756%2DTR">EVGA E-GEFORCE 8600GT 540MHZ 512MB 800HZ DDR2 PCI-E VGA DVI-I HDTV Out DIRECTX10 Video Card</a>
<a href="http://www.ncix.com/detail/evga-e-geforce-7600-gt-co-13-17949.htm#openbox" title="SKU: 17949
Mfr part #: 256%2DP2%2DN555">EVGA E-GEFORCE 7600 GT CO Superclocked 580MHZ PCI-E 256MB 1.5GHZ GDDR3 Dual DVI HDTV Out Video Card</a>
<a href="http://www.ncix.com/detail/evga-geforce-gtx-980-4gb-5e-102000.htm#openbox" title="SKU: 102000
Mfr part #: 04G%2DP4%2D2982%2DKR">EVGA GeForce GTX 980 4GB Super Clocked GAMING Silent Cooling 1241MHZ Boost 1342MHZ Graphics Card</a>
<a href="http://www.ncix.com/detail/gigabyte-geforce-gtx-960-g1-73-108014.htm#openbox" title="SKU: 108014
Mfr part #: GV%2DN960G1%20GAMING%2D4GD">GIGABYTE GeForce GTX 960 G1 1307MHZ 4GB 7.0GHZ GDDR5 DVI HDMI 3xDisplayPort PCI-E Video Card</a>
<a href="http://www.ncix.com/detail/powercolor-radeon-hd-7870-pcs-70-78372.htm#openbox" title="SKU: 78372
Mfr part #: AX7870%202GBD5%2D2DHPPV3E">Powercolor Radeon HD 7870 PCS+ MYST.(TAHITI LE) 2GB 6Gbps GDDR5 DVI HDMI 2XMINIDP PCI-E Video Card</a>
<a href="http://www.ncix.com/detail/powercolor-radeon-hd-3850-pcs-f8-29730.htm#openbox" title="SKU: 29730
Mfr part #: AG3850%20512MD3%2DP">Powercolor Radeon HD 3850 PCs 668MHZ 512MB 1.65GHZ GDDR3 AGP 2XDVI HDTV Out Video Card</a>
<a href="http://www.ncix.com/detail/msi-geforce-gtx-580-twin-79-58685.htm#openbox" title="SKU: 58685
Mfr part #: N580GTX%20Twin%20Frozr%20II%2FOC">MSI GeForce GTX 580 Twin Frozr II OC 800MHZ 1536MB GDDR5 2xDVI Mini-HDMI PCI-E DirectX 11 Video Card</a>
<a href="http://www.ncix.com/detail/gigabyte-radeon-hd-rx3870-775mhz-12-27208.htm#openbox" title="SKU: 27208
Mfr part #: GV%2DRX387512H%2DB">GIGABYTE Radeon HD RX3870 775MHZ 512MB 2.4GHZ GDDR4 Dual DVI-I HDCP HDTV Out PCI-E Video Card</a>
<a href="http://www.ncix.com/detail/gigabyte-radeon-hd-r7-240-bc-90927.htm#openbox" title="SKU: 90927
Mfr part #: GV%2DR724OC%2D2GI%20REV2%2E0">GIGABYTE Radeon HD R7 240 OC 900MHZ 2GB 1.8GHZ GDDR3 DVI HDMI VGA PCI-E Video Card</a>
<a href="http://www.ncix.com/detail/gigabyte-geforce-gtx-980-ti-ff-121258.htm#openbox" title="SKU: 121258
Mfr part #: GV%2DN98TXTREME%20W%2D6GD">GIGABYTE GeForce GTX 980 Ti Xtreme Waterforce 1317MHZ 6GB 7.2GHZ GDDR5 HDMI/3XDPORT/PCI-E Video Card</a>
<a href="http://www.ncix.com/detail/ati-radeon-x1900xtx-650mhz-512mb-75-17528.htm#openbox" title="SKU: 17528
Mfr part #: 100%2D435805">ATI Radeon X1900XTX 650MHZ 512MB 256BIT 1.55GHZ GDDR3 PCI-E Dual DVI-I VIVO HDTV Video Card</a>
<a href="http://www.ncix.com/detail/asus-geforce-8800gtx-575mhz-768mb-d2-21403.htm#openbox" title="SKU: 21403
Mfr part #: EN8800GTX%2FHTDP%2F768M">ASUS GeForce 8800GTX 575MHZ 768MB 1.8GHZ GDDR3 Dual DVI-I HDTV Out DIRECTX10 Video Card</a>
<a href="http://www.ncix.com/detail/asus-geforce-gtx-750-oc-84-94415.htm#openbox" title="SKU: 94415
Mfr part #: GTX750%2DPHOC%2D1GD5">ASUS GeForce GTX 750 OC 1GB GDDR5 PCI-E Video Card</a>
<a href="http://www.ncix.com/detail/asus-geforce-gtx-550-ti-25-59650.htm#openbox" title="SKU: 59650
Mfr part #: ENGTX550%20TI%20DC%20TOP%2FDI%2F1GD5">ASUS GeForce GTX 550 Ti DC Top 975MHZ 1GB 4.1GHZ GDDR5 DVI HDMI VGA PCI-E Video Card</a>
<a href="http://www.ncix.com/detail/asus-geforce-gt-520-700mhz-54-70010.htm#openbox" title="SKU: 70010
Mfr part #: ENGT520SL%2FDI%2F2GD3%28LP%29">ASUS GeForce GT 520 700MHZ 2GB 1.2GHZ DDR3 Low Profile DVI HDMI PCI-E DirectX 11 Video Card</a>
<a href="http://www.ncix.com/detail/asus-geforce-gtx-980-ti-1a-111058.htm#openbox" title="SKU: 111058
Mfr part #: STRIX%2DGTX980TI%2DDC3OC%2D6GD5%2DGAMING">ASUS GeForce GTX 980 Ti Strix 1317MHZ 6GB 7.2GHZ GDDR5 DVI HDMI 3XDISPLAYPORT PCI-E Video Card</a>

All data contains is the necessary from data we need to post, you can see what requests does with it by looking at the body of the request:

In [8]: with requests.Session() as s:
   ...:         soup = BeautifulSoup(s.get("http://www.ncix.com/openbox/").content, "lxml")
   ...:         _id = soup.select_one("img[alt^=Video]")["src"].rsplit("/", 1)[1][:-4]
   ...:         data["minorcatid"] = _id
   ...:         resp = s.post("http://www.ncix.com/promo/openboxfilter.cfm", data=data, headers=headers)
   ...:         req = resp.request
   ...:         print(req.body)
   ...:     

dofilter=1&minorcatid=108

If anything needs to be encoded requests will take care of that for you.

Comments