Chiao Yun Chen Chiao Yun Chen - 4 months ago 13
Python Question

My Python Scrapy cannot scrape out the "keyword" content

I cannot scrapy the "keyword" content. >"<
I've tried many methods but still failed.

I've successfully retrieved other contents, but still failed to get the "keyword" content.

Can anyone help to fix this bug??
The keyword content is located at "#keyword_table a",
or XPath "//*[@id="keyword_table"]/tbody/tr/td[2]/a"

picture of the keyword content
Thanks for your help!

import scrapy
from bs4 import BeautifulSoup
from digitimes.items import DigitimesItem


class digitimesCrawler(scrapy.Spider):
name = 'digitimes'
start_urls = ["http://www.digitimes.com.tw/tw/dt/n/shwnws.asp?id=435000"]


def parse(self, response):
soup = BeautifulSoup(response.body,'html.parser')
soupXml = BeautifulSoup(response.body, "lxml")
simpleList = []

item = DigitimesItem()

timeSel=soup.select('.insubject .small')
tmpTime = timeSel[0].text
time = tmpTime[:10]
item['time'] = time #處理完時間啦
print(time)

titleSel = soup.select('title')
title = titleSel[0].text
item['title'] = title #處理完時間啦
print(title)

#================== To Resolve ==================

keywordOutput=""
for k in soupXml.select('#keyword_table a'):
for key in k:
keywordOutput = keywordOutput + key + " "
item['keyword'] = keywordOutput
print(keywordOutput)

#================== To Resolve ==================



categoryOutput=""
for m in soup.select('#sitemaptable tr td a'):
for cate in m:
if(cate!="DIGITIMES"):
categoryOutput = categoryOutput + cate + " "
item['cate'] = categoryOutput
print(categoryOutput)

simpleList.append(item)
return simpleList

Answer

Is there any particular reason you are using BeautifulSoup over scrapy selectors? Response your method receives already acts as a scrapy selector which can do both xpath and css selections.

There seems to be 3 keywords in the table. You can select them with either xpath or css selectors:

response.css("#keyword_table a::text").extract()
# or with xpath
response.xpath("//*[@id='keyword_table']//a/text()").extract()
# both return
>>> [u'Sony', u'\u5f71\u50cf\u611f\u6e2c\u5668', u'\u80a1\u7968\u4ea4\u6613']