Chiao Yun Chen Chiao Yun Chen - 1 year ago 77
Python Question

My Python Scrapy cannot scrape out the "keyword" content

I cannot scrapy the "keyword" content. >"<
I've tried many methods but still failed.

I've successfully retrieved other contents, but still failed to get the "keyword" content.

Can anyone help to fix this bug??
The keyword content is located at "#keyword_table a",
or XPath "//*[@id="keyword_table"]/tbody/tr/td[2]/a"

picture of the keyword content
Thanks for your help!

import scrapy
from bs4 import BeautifulSoup
from digitimes.items import DigitimesItem

class digitimesCrawler(scrapy.Spider):
name = 'digitimes'
start_urls = [""]

def parse(self, response):
soup = BeautifulSoup(response.body,'html.parser')
soupXml = BeautifulSoup(response.body, "lxml")
simpleList = []

item = DigitimesItem()'.insubject .small')
tmpTime = timeSel[0].text
time = tmpTime[:10]
item['time'] = time #處理完時間啦

titleSel ='title')
title = titleSel[0].text
item['title'] = title #處理完時間啦

#================== To Resolve ==================

for k in'#keyword_table a'):
for key in k:
keywordOutput = keywordOutput + key + " "
item['keyword'] = keywordOutput

#================== To Resolve ==================

for m in'#sitemaptable tr td a'):
for cate in m:
categoryOutput = categoryOutput + cate + " "
item['cate'] = categoryOutput

return simpleList

Answer Source

Is there any particular reason you are using BeautifulSoup over scrapy selectors? Response your method receives already acts as a scrapy selector which can do both xpath and css selections.

There seems to be 3 keywords in the table. You can select them with either xpath or css selectors:

response.css("#keyword_table a::text").extract()
# or with xpath
# both return
>>> [u'Sony', u'\u5f71\u50cf\u611f\u6e2c\u5668', u'\u80a1\u7968\u4ea4\u6613']
Recommended from our users: Dynamic Network Monitoring from WhatsUp Gold from IPSwitch. Free Download