James Flanagin James Flanagin - 3 years ago 69
Python Question

How do I web scrape the sub-headers from this link?

I've made a web scraper that scrapes data from pages that look like this (it scrapes the tables): https://www.techpowerup.com/gpudb/2/

The problem is that my program, for some reason, is only scraping the values, and not the subheaders. For instance, (click on the link), it only scrapes the "R420", "130nm", "160 million", etc. but not the "GPU Name", "Process Size", "Transistors" etc.

What do I add to the code to get it to scrape the subheaders? Here's my code:

import csv
import requests
import bs4
url = "https://www.techpowerup.com/gpudb/2"

#obtain HTML and parse through it
response = requests.get(url)
html = response.content
import sys
soup = bs4.BeautifulSoup(html, "lxml")
tables = soup.findAll("table")

#reading every value in every row in each table and making a matrix
tableMatrix = []
for table in tables:
list_of_rows = []
for row in table.findAll('tr'):
list_of_cells = []
for cell in row.findAll('td'):
text = cell.text.replace(' ', '')
tableMatrix.append((list_of_rows, list_of_cells))

#(YOU CAN PROBABLY IGNORE THIS)placeHolder used to avoid duplicate data from appearing in list
placeHolder = 0
excelTable = []
for table in tableMatrix:
for row in table:
if placeHolder == 0:
for entry in row:
placeHolder = 1
placeHolder = 0

for value in excelTable:
print value
print '\n'

#create excel file and write the values into a csv
fl = open(str(count) + '.csv', 'w')
writer = csv.writer(fl)
for values in excelTable:

Answer Source

if you check the page source, those cells are header cells. So they are not using TD tags but TH tags. you may want to update your loop to include TH cells alongside TD cells.

Recommended from our users: Dynamic Network Monitoring from WhatsUp Gold from IPSwitch. Free Download