伊人久久青青网站AV,婷婷视频网,日夜综合色色

Scrapy本身支持請求數(shù)據(jù)緩存，提供｛DbmCacheStorage，F(xiàn)ilesystemCacheStorage｝存儲并支持DummyPolicy，RFC2616Policy策略。
默認(rèn)是FilesystemCacheStorage文件系統(tǒng)存儲和DummpyPolicy存儲請求到的所有數(shù)據(jù)。
開啟服務(wù)需要配置：

HTTPCACHE_ENABLED = True # 開啟
HTTPCACHE_EXPIRATION_SECS = 0 # 有效時長 
HTTPCACHE_DIR = '/data/ajk/httpcache'  # 存儲路徑                        
HTTPCACHE_IGNORE_HTTP_CODES = []  # 忽略請求
HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'  #本地存儲                 
HTTPCACHE_GZIP = False # 壓縮格式

緩存的實現(xiàn)是在scrapy.downloadermiddlewares.httpcache.HttpCacheMiddleware，在數(shù)據(jù)抓取過程中遇到目標(biāo)網(wǎng)站跳轉(zhuǎn)到驗證碼頁面，重試3次后會自動退出請求并緩存最終結(jié)果，從而導(dǎo)致即便切換IP重新獲取數(shù)據(jù)時拿的數(shù)據(jù)也是跳向驗證碼頁面的數(shù)據(jù)。由于這樣的數(shù)據(jù)體量小，不可能為此刪除所有緩存重新抓取成功的數(shù)據(jù)，因此需要對HttpCacheMiddleware進行重寫。

分析源碼后，決定當(dāng)從緩存中獲取數(shù)據(jù)后，檢測是否跳轉(zhuǎn)請求，再檢測跳轉(zhuǎn)的url是否是無效的驗證碼url，無效的緩存需要將緩存文件刪除并返回讀取的數(shù)據(jù)為空，代碼如下：

# Define here the models for your spider middleware
#
# See documentation in:
# https://doc.scrapy.org/en/latest/topics/spider-middleware.html

from scrapy import signals
from scrapy.downloadermiddlewares.retry import RetryMiddleware
from scrapy.downloadermiddlewares.httpcache import HttpCacheMiddleware
from scrapy.utils.request import request_fingerprint
from ajk.proxy import *
import logging, os, shutil

class AjkCacheMiddleware(HttpCacheMiddleware):

    def process_request(self, request, spider): # Skip
        if request.meta.get('dont_cache', False):
            return

        # Skip uncacheable requests
        if not self.policy.should_cache_request(request): # Skip
            request.meta['_dont_cache'] = True  # flag as uncacheable
            return

        # Look for cached response and check if expired
        cachedresponse = self.storage.retrieve_response(spider, request)  # Skip
        
        if cachedresponse is None: # Skip
            self.stats.inc_value('httpcache/miss', spider=spider)
            if self.ignore_missing:
                self.stats.inc_value('httpcache/ignore', spider=spider)
                raise IgnoreRequest("Ignored request not in cache: %s" % request)
            return  # first time request

        # 此處判定緩存數(shù)據(jù)的返回狀態(tài)，302--redirect，并轉(zhuǎn)向captcha-verify頁面
        # 然后獲取本地文件路徑， 并刪除，同時返回讀取內(nèi)容 nothing。
        if cachedresponse.status==302 and cachedresponse.url.find('captcha-verify/') > -1:
            cachepath = self._get_request_path(spider, request)
            shutil.rmtree(cachepath)
            return 

        # Return cached response only if not expired
        cachedresponse.flags.append('cached')
        if self.policy.is_cached_response_fresh(cachedresponse, request):
            self.stats.inc_value('httpcache/hit', spider=spider)
            return cachedresponse

        # Keep a reference to cached response to avoid a second cache lookup on
        # process_response hook
        request.meta['cached_response'] = cachedresponse

    def _get_request_path(self, spider, request):
        key = request_fingerprint(request)
        return os.path.join(settings['HTTPCACHE_DIR'], spider.name, key[0:2], key)

為調(diào)高效率，不希望緩存成功的數(shù)據(jù)被再次解析出Item重新入庫，可以在item生成前判定是否從緩存中讀取，eg：

if 'cached' in response.flags: return

image.png

色偷偷精品伊人,欧洲久久精品,欧美综合婷婷骚逼,国产AV主播,国产最新探花在线,九色在线视频一区,伊人大交九欧美,1769亚洲,黄色成人av

Scrapy中HttpCacheMiddleware定制

Scrapy中HttpCacheMiddleware定制

相關(guān)閱讀更多精彩內(nèi)容

友情鏈接更多精彩內(nèi)容

色偷偷精品伊人,欧洲久久精品,欧美综合婷婷骚逼,国产AV主播,国产最新探花在线,九色在线视频一区,伊人大交九 欧美,1769亚洲,黄色成人av

Scrapy中HttpCacheMiddleware定制

相關(guān)閱讀更多精彩內(nèi)容

友情鏈接更多精彩內(nèi)容

色偷偷精品伊人,欧洲久久精品,欧美综合婷婷骚逼,国产AV主播,国产最新探花在线,九色在线视频一区,伊人大交九欧美,1769亚洲,黄色成人av