最近被scrapy的dont_filter困擾,因?yàn)閷?xiě)的程序經(jīng)常因?yàn)閞equest被過(guò)濾掉而中斷。
自認(rèn)為還是不了解scrapy的運(yùn)行機(jī)制造成的。
如下代碼:
from scrapy.spiders import Spider
from scrapy.selector import Selector
from scrapy.linkextractors import LinkExtractor
from scrapy import Request
from example.items import xxxxItem
import re
class xxxxSpider(Spider):
name = "example"
allowed_domains = ["xxxx.com.cn"]
pat = 'http://finance.xxxx.com.cn/.*[0-9]{4}-[0-9]{2}-[0-9]{2}/[a-z]*-[a-z0-9]*.*'
def start_requests(self):
yield Request(url="http://finance.xxxx.com.cn/", callback=self.parse)
def parse(self, response):
if response.status == 200:
URLgroup = LinkExtractor(allow=()).extract_links(response)
for URL in URLgroup:
key = re.findall(self.pat, URL.url)
if key:
#only crawl url with a fixed prefix
yield Request(url=URL.url, callback=self.parse_content)
def parse_content(self, response):
if response.status == 200:
content = Selector(response)
text = content.xpath("/html/body//div[@id='artibody']//p/descendant::text()").extract()
if text and title:
item = xxxxItem()
Text = ''
for text_one in text:
Text += text_one
item["text"] = Text
yield item
yield Request(url=response.url, callback=self.parse, dont_filter=True)
在最后一行的request中將dont_filter設(shè)置為T(mén)rue,將不會(huì)導(dǎo)致爬蟲(chóng)中途停止,因?yàn)樵L(fǎng)問(wèn)這個(gè)網(wǎng)頁(yè)的request不會(huì)被filtered,進(jìn)而繼續(xù)爬取。