1.CrawlSpider介紹

Scrapy框架中分兩類爬蟲

Spider類和CrawlSpider類。

crawlspider是Spider的派生類(一個子類)，Spider類的設(shè)計原則是只爬取start_url列表中的網(wǎng)頁，而CrawlSpider類定義了一些規(guī)則(rule)來提供跟進link的方便的機制，從爬取的網(wǎng)頁中獲取link并繼續(xù)爬取的工作更適合。

創(chuàng)建項目

scrapy startproject + 項目名稱

模版創(chuàng)建：

scrapy genspider -t crawl 項目名稱 + 域

crawlspider繼承與Spider類，除了繼承的屬性（name，allow_domains),
還提供了新的屬性和方法：

LinkExtractors 鏈接提取器

class scrapy.linkextractors.LinkExtractor

LinkExtractors的目的很簡單：提取鏈接。
每個LinkExtractors有唯一的公共方法是extract_links(),他接收一個response對象，并返回一個scrapy.link.Link對象
Linkextractors要實力話一次，并且extract_links方法會根據(jù)不同的response調(diào)用多次提取鏈接

主要參數(shù)：
1.allow：滿足括號中“正則表達式”的值會被提取，如果為空，則會全部匹配。
2.deny：與這個正則表達式(或正則表達式列表)不匹配的URL一定不提取
3.allow_domains：會被提取的鏈接domains。
4.deny_domains：一定不會被提取鏈接的domains
5.restrick_xpaths：使用xpath表達式，和allow共同作用過濾鏈接

rules

在rules中包含一個或多個Rule對象
每個Rule對爬取網(wǎng)站的動作定義了特定的操作。
如果多個rule匹配了相同的鏈接，則根據(jù)規(guī)則在本集合中被定義的順序，第一個會被使用

參數(shù)的介紹

link_extractors:是一個LinkExtractor對象，用于定義需要提取的鏈接

callback：從link_extractor中沒獲取鏈接時，參數(shù)所制定的值作為回調(diào)函數(shù)，該回調(diào)函數(shù)接受一個response作為起第一個參數(shù)
注意：當(dāng)編寫爬蟲規(guī)則是，避免使用parse作為回調(diào)函數(shù)。由于CrawlSpider使用parse方法來實現(xiàn)其邏輯，如果覆蓋了parse方法，CrawlSpider將會運行失敗

follow：是一個布爾值(boolean),制定了根據(jù)該規(guī)則從response提取的鏈接是偶需要跟進。如果callback為None，follow默認(rèn)設(shè)置為True，否則默認(rèn)為Flase

process_links：指定該Spider中那個的函數(shù)將會被調(diào)用，從link_extractor中獲取到鏈接列表是將會調(diào)用該函數(shù)。該方法主要用來過濾

process_request：指定該Spider中那個的函數(shù)將會被調(diào)用，該規(guī)則提取到每個request是都會調(diào)用該函數(shù)。(用來過濾request)

生成的爬蟲文件項目.py

# -*- coding: utf-8 -*-
import scrapy
# 導(dǎo)入CrawlSpider相關(guān)模塊
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule

# 表示該爬蟲程序是基于CrawlSpider類的
class CrawldemoSpider(CrawlSpider):
    name = 'crawlDemo'    #爬蟲文件名稱
    #allowed_domains = ['www.qiushibaike.com']
    start_urls = ['http://www.qiushibaike.com/']
    
    #連接提取器：會去起始url響應(yīng)回來的頁面中提取指定的url
    link = LinkExtractor(allow=r'/8hr/page/\d+')
    #rules元組中存放的是不同的規(guī)則解析器（封裝好了某種解析規(guī)則)
    rules = (
        #規(guī)則解析器：可以將連接提取器提取到的所有連接表示的頁面進行指定規(guī)則（回調(diào)函數(shù)）的解析
        Rule(link, callback='parse_item', follow=True),
    )
    # 解析方法
    def parse_item(self, response):
        #print(response.url)
        divs = response.xpath('//div[@id="content-left"]/div')
        for div in divs:
            author = div.xpath('./div[@class="author clearfix"]/a[2]/h2/text()').extract_first()
            print(author)

<!-- CrawlSpider類和Spider類的最大不同是CrawlSpider多了一個rules屬性，其作用是定義”提取動作“。在rules中可以包含一個或多個Rule對象，在Rule對象中包含了LinkExtractor對象。 -->

生成的爬蟲文件參數(shù)介紹

LinkExtractor：顧名思義，鏈接提取器。

LinkExtractor(
allow=r'Items/'，# 滿足括號中“正則表達式”的值會被提取，如果為空，則全部匹配。
deny=xxx, # 滿足正則表達式的則不會被提取。
restrict_xpaths=xxx, # 滿足xpath表達式的值會被提取
restrict_css=xxx, # 滿足css表達式的值會被提取
deny_domains=xxx, # 不會被提取的鏈接的domains。　
)

作用：提取response中符合規(guī)則的鏈接。

Rule : 規(guī)則解析器。根據(jù)鏈接提取器中提取到的鏈接，根據(jù)指定規(guī)則提取解析器鏈接網(wǎng)頁中的內(nèi)容。

Rule(LinkExtractor(allow=r'Items/'), callback='parse_item', follow=True)

參數(shù)介紹：

參數(shù)1：指定鏈接提取器
參數(shù)2：指定規(guī)則解析器解析數(shù)據(jù)的規(guī)則（回調(diào)函數(shù)）
參數(shù)3：是否將鏈接提取器繼續(xù)作用到鏈接提取器提取出的鏈接網(wǎng)頁中。當(dāng)callback為None,參數(shù)3的默認(rèn)值為true。

rules=( ):指定不同規(guī)則解析器。一個Rule對象表示一種提取規(guī)則。
CrawlSpider整體爬取流程：

a).爬蟲文件首先根據(jù)起始url，獲取該url的網(wǎng)頁內(nèi)容

b).鏈接提取器會根據(jù)指定提取規(guī)則將步驟a中網(wǎng)頁內(nèi)容中的鏈接進行提取

c).規(guī)則解析器會根據(jù)指定解析規(guī)則將鏈接提取器中提取到的鏈接中的網(wǎng)頁內(nèi)容根據(jù)指定的規(guī)則進行解析

d).將解析數(shù)據(jù)封裝到item中，然后提交給管道進行持久化存儲

基于CrawlSpider示例

1. 爬蟲文件

# -*- coding: utf-8 -*-
import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
from qiubaiBycrawl.items import QiubaibycrawlItem
import re
class QiubaitestSpider(CrawlSpider):
    name = 'qiubaiTest'
    #起始url
    start_urls = ['http://www.qiushibaike.com/']

    #定義鏈接提取器，且指定其提取規(guī)則
    page_link = LinkExtractor(allow=r'/8hr/page/\d+/')
    
    rules = (
        #定義規(guī)則解析器，且指定解析規(guī)則通過callback回調(diào)函數(shù)
        Rule(page_link, callback='parse_item', follow=True),
    )

    #自定義規(guī)則解析器的解析規(guī)則函數(shù)
    def parse_item(self, response):
        div_list = response.xpath('//div[@id="content-left"]/div')
        
        for div in div_list:
            #定義item
            item = QiubaibycrawlItem()
            #根據(jù)xpath表達式提取糗百中段子的作者
            item['author'] = div.xpath('./div/a[2]/h2/text()').extract_first().strip('\n')
            #根據(jù)xpath表達式提取糗百中段子的內(nèi)容
            item['content'] = div.xpath('.//div[@class="content"]/span/text()').extract_first().strip('\n')

            yield item #將item提交至管道

2.items文件

# -*- coding: utf-8 -*-

# Define here the models for your scraped items
#
# See documentation in:
# https://doc.scrapy.org/en/latest/topics/items.html

import scrapy


class QiubaibycrawlItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    author = scrapy.Field() #作者
    content = scrapy.Field() #內(nèi)容

3.管道文件

# -*- coding: utf-8 -*-

# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://doc.scrapy.org/en/latest/topics/item-pipeline.html

class QiubaibycrawlPipeline(object):
    
    def __init__(self):
        self.fp = None
        
    def open_spider(self,spider):
        print('開始爬蟲')
        self.fp = open('./data.txt','w')
        
    def process_item(self, item, spider):
        #將爬蟲文件提交的item寫入文件進行持久化存儲
        self.fp.write(item['author']+':'+item['content']+'\n')
        return item
    
    def close_spider(self,spider):
        print('結(jié)束爬蟲')
        self.fp.close()

https://www.cnblogs.com/sunxiuwen/p/10121068.html

色偷偷精品伊人,欧洲久久精品,欧美综合婷婷骚逼,国产AV主播,国产最新探花在线,九色在线视频一区,伊人大交九欧美,1769亚洲,黄色成人av

CrawlSpider介紹

CrawlSpider介紹

1.CrawlSpider介紹

Scrapy框架中分兩類爬蟲

創(chuàng)建項目

模版創(chuàng)建：

LinkExtractors 鏈接提取器

class scrapy.linkextractors.LinkExtractor

rules

參數(shù)的介紹

生成的爬蟲文件項目.py

生成的爬蟲文件參數(shù)介紹

基于CrawlSpider示例

1. 爬蟲文件

2.items文件

3.管道文件

友情鏈接更多精彩內(nèi)容

色偷偷精品伊人,欧洲久久精品,欧美综合婷婷骚逼,国产AV主播,国产最新探花在线,九色在线视频一区,伊人大交九 欧美,1769亚洲,黄色成人av

CrawlSpider介紹

1.CrawlSpider介紹

Scrapy框架中分兩類爬蟲

創(chuàng)建項目

模版創(chuàng)建：

LinkExtractors 鏈接提取器

class scrapy.linkextractors.LinkExtractor

rules

參數(shù)的介紹

生成的爬蟲文件 項目.py

生成的爬蟲文件參數(shù)介紹

基于CrawlSpider示例

1. 爬蟲文件

2.items文件

3.管道文件

友情鏈接更多精彩內(nèi)容

色偷偷精品伊人,欧洲久久精品,欧美综合婷婷骚逼,国产AV主播,国产最新探花在线,九色在线视频一区,伊人大交九欧美,1769亚洲,黄色成人av

生成的爬蟲文件項目.py