scrapy用proxy的零零總總

scrapy框架及中間件中說(shuō)到了中間件相關(guān)的數(shù)據(jù)流程,剛好在用proxy爬數(shù)據(jù)的時(shí)候會(huì)用到中間件的零零總總,這回可以一起說(shuō)說(shuō)了。
我覺(jué)得寫(xiě)中間件要先找到內(nèi)置的相關(guān)中間件,根據(jù)你的需求改寫(xiě)其中的request/response/exceptions。
因?yàn)閟crapy里內(nèi)置的downloadermiddlewares應(yīng)該已經(jīng)足夠滿足大部分的需求了,文檔上說(shuō)了一個(gè)順序,也是把所有的downloadermiddlewares羅列出來(lái)。以及每個(gè)中間件要啟用哪些設(shè)置,在文檔中間件有寫(xiě)明。

{
'scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware': 100,#Robots協(xié)議
'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware': 300,#http認(rèn)證
'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware': 350,
'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware': 400,
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware': 500,
'scrapy.downloadermiddlewares.retry.RetryMiddleware': 550,
'scrapy.downloadermiddlewares.ajaxcrawl.AjaxCrawlMiddleware': 560,
'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware': 580,
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware': 590,#壓縮方式——Accept-Encoding: gzip, deflate
'scrapy.downloadermiddlewares.redirect.RedirectMiddleware': 600,#重定向301,302
'scrapy.downloadermiddlewares.cookies.CookiesMiddleware': 700,
'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware': 750,#代理
'scrapy.downloadermiddlewares.stats.DownloaderStats': 850,
'scrapy.downloadermiddlewares.httpcache.HttpCacheMiddleware': 900,#底層緩存支持
}

另spidermiddlewares
{
'scrapy.spidermiddlewares.httperror.HttpErrorMiddleware': 50,#直接跳過(guò)非2**的request,
'scrapy.spidermiddlewares.offsite.OffsiteMiddleware': 500,#在domain之外的網(wǎng)址不被過(guò)濾
'scrapy.spidermiddlewares.referer.RefererMiddleware': 700,#根據(jù)request和response生成request headers中的referer
'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware': 800,#控制爬取得url長(zhǎng)度
'scrapy.spidermiddlewares.depth.DepthMiddleware': 900,#控制爬取得深度
}

這回想要用proxy爬取百度首頁(yè),想的是基本流程是
1.setting里導(dǎo)入ip-list,同時(shí)DOWNLOAD_TIMEOUT=3,默認(rèn)180,3分鐘太長(zhǎng)了
2.修改HttpProxyMiddleware,讓其從setting里都每次都取第一個(gè)proxy發(fā)起request
2.修改RetryMiddleware,如果出現(xiàn)timeout等錯(cuò)誤(重寫(xiě)exception)或者ip被封出現(xiàn)503(重寫(xiě)response)之類(lèi),就把這個(gè)ip刪掉,把刪除后的iplist重寫(xiě)進(jìn)setting,如果iplist為0,就結(jié)束spider。

middleware:

from scrapy import signals
from scrapy.utils.project import get_project_settings
from scrapy.downloadermiddlewares.retry import RetryMiddleware
from scrapy.downloadermiddlewares.httpproxy import HttpProxyMiddleware
import time
import random
from scrapy.utils.response import response_status_message
from scrapy.log import logger

class MyProxyMiddleware(HttpProxyMiddleware):
    def process_request(self, request, spider):
        settings = get_project_settings()
        proxies = settings.get('IPOOL')

        logger.debug('now ip is '+proxies[0])
        request.meta['proxy'] = proxies[0]

class MyRetryMiddleware(RetryMiddleware):
    def delete_proxy(self,spider):
        settings = get_project_settings()
        proxies = settings.get('IPOOL')
        if proxies:
            proxies.pop(0)
            settings.set('IPOOL',proxies)
        else:
            spider.crawler.engine.close_spider(spider, 'response msg error , job done!')

    def process_exception(self, request, exception, spider):
        if isinstance(exception, self.EXCEPTIONS_TO_RETRY) \
                and not request.meta.get('dont_retry', False):
            self.delete_proxy(spider)
            time.sleep(random.randint(3, 5))
            return self._retry(request, exception, spider)

    def process_response(self, request, response, spider):
        if request.meta.get('dont_retry', False):
            return response
        if response.status == 200:
            self.delete_proxy(spider)
            return response
        if response.status in self.retry_http_codes:
            reason = response_status_message(response.status)
            self.delete_proxy(spider)
            time.sleep(random.randint(3, 5))
            return self._retry(request, reason, spider) or response
        return response

settings:

import pandas as pd
df = pd.read_csv('F:\\pycharm project\\pachong\\vpn.csv')
IPOOL = df['address'][df['status'] == 'yes'].tolist()
DOWNLOADER_MIDDLEWARES = {
   # 'mytset.middlewares.MytsetDownloaderMiddleware': 543,
    'mytset.middlewares.MyRetryMiddleware':550,
    'mytset.middlewares.MyProxyMiddleware': 750,
}
DOWNLOAD_TIMEOUT=3

spider:

import scrapy
from pyquery import PyQuery as pq

class BaiduSpider(scrapy.Spider):
    name = 'baidu'
    allowed_domains = ['www.baidu.com']


    def start_requests(self):
        for _ in range(30):
            yield scrapy.Request(url='http://www.baidu.com/',callback=self.parse,dont_filter=True)
    def parse(self, response):
        res = pq(response.body)
        proxy = response.meta['proxy']
        print(proxy)
        print(res('title').text())
最后編輯于
?著作權(quán)歸作者所有,轉(zhuǎn)載或內(nèi)容合作請(qǐng)聯(lián)系作者
【社區(qū)內(nèi)容提示】社區(qū)部分內(nèi)容疑似由AI輔助生成,瀏覽時(shí)請(qǐng)結(jié)合常識(shí)與多方信息審慎甄別。
平臺(tái)聲明:文章內(nèi)容(如有圖片或視頻亦包括在內(nèi))由作者上傳并發(fā)布,文章內(nèi)容僅代表作者本人觀點(diǎn),簡(jiǎn)書(shū)系信息發(fā)布平臺(tái),僅提供信息存儲(chǔ)服務(wù)。

相關(guān)閱讀更多精彩內(nèi)容

  • scrapy學(xué)習(xí)筆記(有示例版) 我的博客 scrapy學(xué)習(xí)筆記1.使用scrapy1.1創(chuàng)建工程1.2創(chuàng)建爬蟲(chóng)模...
    陳思煜閱讀 13,105評(píng)論 4 46
  • 說(shuō)起寫(xiě)爬蟲(chóng),大多數(shù)第一時(shí)間想到的就是python了。python語(yǔ)法簡(jiǎn)潔明了,加上及其豐富好用的庫(kù),用它來(lái)寫(xiě)爬蟲(chóng)有...
    瘋狂的哈丘閱讀 8,370評(píng)論 1 15
  • scrapy是python最有名的爬蟲(chóng)框架之一,可以很方便的進(jìn)行web抓取,并且提供了很強(qiáng)的定制型,這里記錄簡(jiǎn)單學(xué)...
    bomo閱讀 2,339評(píng)論 1 11
  • 《面紗》,關(guān)于愛(ài)與責(zé)任,關(guān)于背叛與救贖。 也許大多數(shù)人的婚姻是這樣的,波瀾不驚,好像愛(ài)情從來(lái)不曾存在過(guò)。當(dāng)婚姻難以...
    北風(fēng)拾柒閱讀 353評(píng)論 0 1
  • 比如,成績(jī)優(yōu)秀的女生也可以稱(chēng)為女神,腳踏實(shí)地努力奮斗的女生也是女神,還有那些獨(dú)立自強(qiáng),不想依靠男生的女生,都可以被...
    小佳人閱讀 430評(píng)論 3 2

友情鏈接更多精彩內(nèi)容