middlewares下載中間件、斷點爬取、設(shè)置文件參數(shù)

User-Agent
Cookies
IP
Selenium

1.User-Agent

settings.py文件中添加幾個UA

USERAGENT = [
    'Mozilla/5.0 (X11; U; Linux x86_64; zh-CN; rv:1.9.2.10) Gecko/20100922 Ubuntu/10.10 (maverick) Firefox/3.6.10',
    'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.11 (KHTML, like Gecko) Chrome/23.0.1271.64 Safari/537.11',
]

middlewares.py中設(shè)置User-Agent中間件

從settings.py文件中取值有兩種方式
方法一:

class UserAgentDownloadMiddlerware(object):
     def __init__(self,User_Agents):
         self.User_Agents = User_Agents
    
     @classmethod
     def from_crawler(cls,crawler):
         User_Agents = crawler.settings['USERAGENT']
       return cls(User_Agents)
    
    def process_request(self,request,spider):
        """
        所有的request在交給下載器之前都會經(jīng)過這個方法
        :param request:
        :param spider:
        :return:
        """
        import random
        radom_ua = random.choice(self.User_Agents)
        if random_ua:
            request.headers['User-Agent']=random_ua

方法二:

class UserAgentDownloadMiddlerware(object):
    def process_request(self,request,spider):
         import random
         User_Agent = spider.settings['USERAGENT']
         random_ua = random.choice(User_Agent)
         if random_ua:
            request.headers['User-Agent']=random_ua

方法三:

class UserAgentDownloadMiddlerware(object):
    def process_request(self,request,spider):
    from fake_useragent import UserAgent
        useAgent = UserAgent()
        random_ua = useAgent.random
        if random_ua:
            print('經(jīng)過了下載中間件',random_ua)
            request.headers['User-Agent'] = random_ua

2.IP代理中間件

在settings.py文件中模擬一個代理池

PROXIES =[
    {'ip':'127.0.0.1:6379','pwd':'zwz:1234'},有賬號密碼
    {'ip':'127.0.0.1:6372','pwd':None},沒有賬號密碼
    {'ip':'127.0.0.1:6373','pwd':None},
    {'ip':'127.0.0.1:6370','pwd':None}
]

middlewares.py中設(shè)置代理中間件

class ProxyDownloadMiddlerware(object):
    def process_request(self,request,spider):
        proxies = spider.settings['PROXIES']
        import random
        proxy_rm = random.choice(proxies)

        if proxy_rm['pwd']:
            #有賬號密碼的代理
            #對賬號密碼進(jìn)行base64編碼
            import base64
            base64_pwd = base64.b64encode(proxy_rm['pwd'].encode('utf-8')).decode('utf-8')
            # 對應(yīng)到代理服務(wù)器的信令格式里
            request.headers['Proxy-Authorization'] = 'Basic ' + base64_pwd
            #設(shè)置ip
            request.meta['proxy'] = proxy_rm['ip']
        else:
            # 設(shè)置ip
            request.meta['proxy'] = proxy_rm['ip']

3.Cookie中間件

在settings.py文件中模擬一個Cookie池

COOKIES =[
    {'cookie1':'xxxx'},
    {'cookie1':'xxxx'},
    {'cookie1':'xxxx'},
    {'cookie1':'xxxx'},
    {'cookie1':'xxxx'},
]

middlewares.py中設(shè)置cookie中間件

import random
class RandomCookiesMiddleware(object):

    def process_request(self, request, spider):
        cookies = spider.settings['COOKIES']
        # 隨機獲取一個cookies
        cookie = random.choice(cookies)
        if cookie:
            request.cookies = cookie

4.Selenium獲取動態(tài)的網(wǎng)頁爬取

在創(chuàng)建的spider項目里添加,因為有的網(wǎng)頁是動態(tài)的,還有靜態(tài)的,所以放在需要爬取動態(tài)頁面的spider里

import scrapy
from selenium import webdriver

class TestSpider(scrapy.Spider):
    name = 'test'
    allowed_domains = ['baidu.com']
    start_urls = ['http://www.baidu.com/']

    # 創(chuàng)建瀏覽器驅(qū)動
    driver = webdriver.Firefox(
        executable_path='/home/zwz/Desktop/瀏覽器驅(qū)動/geckodriver/'
    )
    driver.set_page_load_timeout(10)

    def parse(self, response):
        print(response.status,response.request.headers)

middlewares.py中設(shè)置Selenium獲取動態(tài)的網(wǎng)頁爬取

#scrapy并不支持動態(tài)加載網(wǎng)頁的爬取
from selenium import webdriver
from selenium.common.exceptions import TimeoutException
from scrapy.http import HtmlResponse
class SeleniumDownloadMiddlerWare(object):
    @classmethod
    def from_crawler(cls, crawler):
        # This method is used by Scrapy to create your spiders.
        s = cls()
        #使用信號量,監(jiān)控爬蟲結(jié)束的信號
        crawler.signals.connect(s.close, signal=signals.spider_closed)
        return s

    def close(self, spider):
        import time
        time.sleep(5)
        spider.driver.close()

    def process_request(self,request,spider):
        if spider.name == 'test':
            #獲取url
            url = request.url

            if url:
                try:
                    # self.driver.get(url)
                    spider.driver.get(url)
                    # pageSource = self.driver.page_source
                    pageSource = spider.driver.page_source

                    if pageSource:
                        """
                        url, status=200, headers=None, 
                        body=b'', flags=None, request=None
                        """
                        return HtmlResponse(
                            url=url,
                            status=200,
                            body=pageSource.encode('utf-8'),
                            request=request
                        )

                except TimeoutException as err:
                    print('請求超時',url)
                    return HtmlResponse(
                        url=url,
                        status=408,
                        body=b'',
                        request=request
                    )


最后別忘記設(shè)置和激活下載中間件

在settings.py文件中

DOWNLOADER_MIDDLEWARES = {
   # 'downloadmiddlerware.middlewares.DownloadmiddlerwareDownloaderMiddleware': 543,
    'downloadmiddlerware.middlewares.UserAgentDownloadMiddlerware':543,
    'downloadmiddlerware.middlewares.ProxyDownloadMiddlerware':544,
    'downloadmiddlerware.middlewares.RandomCookiesMiddleware':545,
    'downloadmiddlerware.middlewares.SeleniumDownloadMiddlerWare':546,
}

關(guān)于爬蟲的斷點爬?。?/h3>

scrapy crawl 爬蟲名稱 -s JOBDIR=crawls/爬蟲名稱

requests.queue : 保存的請求的任務(wù)隊列
requests.seen : 保存的是指紋
spider.status : 爬蟲運行的狀態(tài)

scrapy settings.py設(shè)置文件(相關(guān)參數(shù))

項目名稱

BOT_NAME = 'downloadmiddlerware'

# 項目名稱
BOT_NAME = 'downloadmiddlerware'

#爬蟲存儲的文件路徑
SPIDER_MODULES = ['downloadmiddlerware.spiders']
#創(chuàng)建爬蟲文件的模版,創(chuàng)建號的爬蟲文件會存放在這個目錄下
NEWSPIDER_MODULE = 'downloadmiddlerware.spiders'

# Crawl responsibly by identifying yourself (and your website) on the user-agent
#設(shè)置ua,來模擬瀏覽器請求
#USER_AGENT = 'downloadmiddlerware (+http://www.yourdomain.com)'

# Obey robots.txt rules
# 設(shè)置是否需要準(zhǔn)守robot協(xié)議:默認(rèn)為True
ROBOTSTXT_OBEY = False

# Configure maximum concurrent requests performed by Scrapy (default: 16)
# 設(shè)置請求的最大并發(fā)數(shù)據(jù)(下載器) 默認(rèn)是16
#CONCURRENT_REQUESTS = 32

# Configure a delay for requests for the same website (default: 0)
# See https://doc.scrapy.org/en/latest/topics/settings.html#download-delay
# See also autothrottle settings and docs
#設(shè)置請求的下載延時,默認(rèn)為0
DOWNLOAD_DELAY = 3
# The download delay setting will honor only one of:
#設(shè)置網(wǎng)站的最大并發(fā)請求數(shù)量,默認(rèn)是8
CONCURRENT_REQUESTS_PER_DOMAIN = 16

#設(shè)置某個ip的最大并發(fā)請求數(shù)量,默認(rèn)是0
# 如果非0
# 1.CONCURRENT_REQUESTS_PER_DOMAIN不生效,
# 這時候請求的并發(fā)數(shù)量將針對于ip,而不是網(wǎng)站了
#
# 2.設(shè)置的DOWNLOAD_DELAY就是正對于ip而不是網(wǎng)站了
CONCURRENT_REQUESTS_PER_IP = 16

# Disable cookies (enabled by default)
# 是否攜帶cookies:默認(rèn)為True
COOKIES_ENABLED = False

#COOKIES_DEBUG:跟蹤cookies,默認(rèn)情況下為False
COOKIES_DEBUG =True

#關(guān)于日志信息的設(shè)置
LOG_FILE = 'xxx.log'
LOG_LEVEL = 'INFO/DEBUG/....'

# Disable Telnet Console (enabled by default)
#是一個終端的擴展插件
TELNETCONSOLE_ENABLED = False

# Override the default request headers:
#設(shè)置默認(rèn)的請求頭(cookies信息不要放在這里)
DEFAULT_REQUEST_HEADERS = {
  # 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
  # 'Accept-Language': 'en',
    'User-Agent':'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.106 Safari/537.36',
}

# Enable or disable spider middlewares
# See https://doc.scrapy.org/en/latest/topics/spider-middleware.html
#設(shè)置和激活爬蟲中間件
#SPIDER_MIDDLEWARES = {
#    'downloadmiddlerware.middlewares.DownloadmiddlerwareSpiderMiddleware': 543,
#}

# Enable or disable downloader middlewares
# See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html
#設(shè)置和激活下載中間件(后面的數(shù)字表示優(yōu)先級)
DOWNLOADER_MIDDLEWARES = {
   # 'downloadmiddlerware.middlewares.DownloadmiddlerwareDownloaderMiddleware': 543,
   #  'downloadmiddlerware.middlewares.UserAgentDownloadMiddlerware':543,
    'downloadmiddlerware.middlewares.SeleniumDownloadMiddlerWare':543,
}

# Enable or disable extensions
# See https://doc.scrapy.org/en/latest/topics/extensions.html
#設(shè)置擴展
EXTENSIONS = {
   'scrapy.extensions.telnet.TelnetConsole': None,
}

# Configure item pipelines
# See https://doc.scrapy.org/en/latest/topics/item-pipeline.html
#設(shè)置和激活管道文件(后面的數(shù)字表示優(yōu)先級)
ITEM_PIPELINES = {
    'downloadmiddlerware.pipelines.DownloadmiddlerwarePipeline': 300,
}

#自動限速的擴展(實現(xiàn)上一個請求和下一個請求之間的時間是不固定的)
# Enable and configure the AutoThrottle extension (disabled by default)
# See https://doc.scrapy.org/en/latest/topics/autothrottle.html
#默認(rèn)請情框下自動限速的擴展是關(guān)閉的:AUTOTHROTTLE_ENABLED:False
 AUTOTHROTTLE_ENABLED = True
 # The initial download delay
 #初始的下載吧延時默認(rèn)是5秒
 AUTOTHROTTLE_START_DELAY = 5
 # The maximum download delay to be set in case of high latencies
 #最大下載延時
 AUTOTHROTTLE_MAX_DELAY = 60
 # The average number of requests Scrapy should be sending in parallel to
 # each remote server
 #針對于網(wǎng)站的最大的并行請求數(shù)量
 AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
 # Enable showing throttling stats for every response received:
#調(diào)試模式:默認(rèn)為False,未開啟
AUTOTHROTTLE_DEBUG = False

# Enable and configure HTTP caching (disabled by default)
# See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
#設(shè)置數(shù)據(jù)的緩存,默認(rèn)情況下是未開啟的
HTTPCACHE_ENABLED = True
#設(shè)置緩存的超時時間,默認(rèn)為0表示永久有效
HTTPCACHE_EXPIRATION_SECS = 0
#設(shè)置緩存的存儲文件路徑
HTTPCACHE_DIR = 'httpcache'
#忽略某些狀態(tài)碼的請求結(jié)果(Response)
HTTPCACHE_IGNORE_HTTP_CODES = []
#開始緩存的擴展插件
HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'

最后編輯于
?著作權(quán)歸作者所有,轉(zhuǎn)載或內(nèi)容合作請聯(lián)系作者
【社區(qū)內(nèi)容提示】社區(qū)部分內(nèi)容疑似由AI輔助生成,瀏覽時請結(jié)合常識與多方信息審慎甄別。
平臺聲明:文章內(nèi)容(如有圖片或視頻亦包括在內(nèi))由作者上傳并發(fā)布,文章內(nèi)容僅代表作者本人觀點,簡書系信息發(fā)布平臺,僅提供信息存儲服務(wù)。

友情鏈接更多精彩內(nèi)容