久久一本二本,亚洲成人色情人妻天堂,中文字幕网爆黑料

scrapy項(xiàng)目結(jié)構(gòu)與各個(gè)組件的作用之前已經(jīng)討論過了，需要多多掌握的是scrapy內(nèi)部運(yùn)行機(jī)理，請求如何處理，這樣我們才能理解中間件的概念，各個(gè)函數(shù)的作用。此次項(xiàng)目目標(biāo)是爬去云起書院小說信息，存入mongodb數(shù)據(jù)庫，使用redis去重，由于使用了redis數(shù)據(jù)庫，可以將爬蟲分布式運(yùn)行。使用scrapy新建爬蟲項(xiàng)目
scrapy startproject yunqiCrawl
頁面分析
云起書院小說條目基本如下，需要從中提取小說標(biāo)題、作者、分類、狀態(tài)、更新時(shí)間、總字?jǐn)?shù)、小說圖片url、小說id信息

在點(diǎn)擊小說具體信息后，進(jìn)入小說頁面，可以看到小說人氣如下，同樣抓取人氣信息

通過scrapy shell url進(jìn)入scrapy的shell中，使用該response慢慢調(diào)試需要構(gòu)造的xpath語句，使用正確的xpath語句從頁面中提取信息，取定xpath語句

項(xiàng)目需要兩種類型的Item，一種是小說列表頁面的小說信息組成的Item，另一種是小說人氣信息，所以先定義兩個(gè)Item，代碼如下，定義了需要的域，存放信息
vim yunqiCrawl/items.py

import scrapy


class YunqiBookListItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    novelId = scrapy.Field()
    title = scrapy.Field()
    author = scrapy.Field()
    link = scrapy.Field()
    status = scrapy.Field()
    updateTime = scrapy.Field()
    wordsCount = scrapy.Field()
    imageUrl = scrapy.Field()
    novelType = scrapy.Field()


class YunqiBookDetialItem(scrapy.Item):
    novelId = scrapy.Field()
    allClick = scrapy.Field()
    allLike = scrapy.Field()
    weekLike = scrapy.Field()

編寫spider模塊，爬取頁面信息，設(shè)置start_url等信息。討論一些request對(duì)象的構(gòu)造，spider返回request對(duì)象后，會(huì)經(jīng)過調(diào)度器和下載器處理，返回response，繼續(xù)交給spider處理，spider中可以定義多個(gè)處理函數(shù)，在request構(gòu)造時(shí)，通過給callback參數(shù)傳如處理函數(shù)，可以指定，該request經(jīng)過處理返回的response使用哪個(gè)函數(shù)來解析。rules定義了一組從網(wǎng)頁中提取鏈接的規(guī)則，LinkExtractor對(duì)象從頁面中找到url，指定callback函數(shù)，直接以request的形式返回，使我們直接跳過了提取該頁面url的步驟。spider對(duì)象的name指定了爬蟲名稱，allowed_domains指定了允許的域名，start_urls指定了爬蟲程序的起點(diǎn)。程序時(shí)用yield關(guān)鍵字寫成生成器的方式，可迭代

# -*- coding: utf-8 -*-
import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
from yunqiCrawl.items import YunqiBookListItem,YunqiBookDetialItem

class YunqiQqComSpider(CrawlSpider):
    name = 'yunqi.qq.com'
    allowed_domains = ['yunqi.qq.com']
    start_urls = ['http://yunqi.qq.com/bk/so2/n30p1']

    rules = (
        Rule(LinkExtractor(
             allow=r'/bk/so2/n30p\d+'),
             callback='parse_book_list',
             follow=True),
    )

    def parse_book_list(self, response):
        books = response.xpath('.//div[@class="book"]')
        for book in books:
            novelId = book\
                .xpath('./div[@class="book_info"]/h3/a/@id').extract_first()
            novelImageUrl = book\
                .xpath('./a/img/@src').extract_first()
            novelLink = book\
                .xpath('./div[@class="book_info"]/h3/a/@href').extract_first()
            novelTitle = book\
                .xpath('./div[@class="book_info"]/h3/a/text()').extract_first()
            novelInfos = book\
                .xpath('./div[@class="book_info"]/dl/dd[@class="w_auth"]')
            if len(novelInfos) > 4:
                novelAuthor = novelInfos[0].xpath('./a/text()').extract_first()
                novelTypeB = novelInfos[1].xpath('./a/text()').extract_first()
                novelStatus = novelInfos[2].xpath('./text()').extract_first()
                novelUpdateTime = novelInfos[3].xpath('./text()')\
                    .extract_first()
                novelWordsCount = novelInfos[4].xpath('./text()')\
                    .extract_first()
            else:
                novelAuthor = ''
                novelTypeB = ''
                novelStatus = ''
                novelUpdateTime = ''
                novelWordsCount = ''
            bookItem = YunqiBookListItem(
                novelId=novelId,
                title=novelTitle,
                link=novelLink,
                author=novelAuthor,
                status=novelStatus,
                updateTime=novelUpdateTime,
                wordsCount=novelWordsCount,
                novelType=novelTypeB,
                imageUrl=novelImageUrl
            )
            yield bookItem
            newRequest = scrapy.Request(
                url=novelLink,
                callback=self.parse_book_detail
            )
            print 'send request',novelLink
            newRequest.meta['novelId'] = novelId
            yield newRequest

    def parse_book_detail(self, response):
        novelId = response.meta['novelId']
        tdlist = response.xpath('.//div[@class="num"]/table/tr/td')
        novelAllClick = tdlist[0]\
            .xpath('./text()').extract_first().split(u'：')[1]
        novelAllLike = tdlist[1]\
            .xpath('./text()').extract_first().split(u'：')[1]
        novelWeekLike = tdlist[2]\
            .xpath('./text()').extract_first().split(r'：')[1]
        bookDetialItem = YunqiBookDetialItem(
            novelId=novelId,
            allClick=novelAllClick,
            allLike=novelAllLike,
            weekLike=novelWeekLike
        )
        yield bookDetialItem

pipeline
在ItemPipeline中，可以完成數(shù)據(jù)存儲(chǔ)操作，由于需要將數(shù)據(jù)存儲(chǔ)在mongodb中，在pipeline中同時(shí)完成mongodb數(shù)據(jù)庫的連接。我們需要了解mongodb數(shù)據(jù)基本操作語法。ItemPipeline有一個(gè)特殊的from_crawler類方法，該方法，接受一個(gè)crawler對(duì)象，返回一個(gè)該類的實(shí)例，crawler是正在處理的spider，通過這個(gè)spider，可以獲取到全局信息，比如settings.py文件中的設(shè)置信息，在該方法中獲取配置信息，open_spider函數(shù)在打開spider時(shí)執(zhí)行，close_spider函數(shù)在關(guān)閉spider時(shí)運(yùn)行，在本例中，分別寫入了打開mongodb連接和關(guān)閉mongodb連接操作。process_item是處理item的主要方法，在其中進(jìn)行數(shù)據(jù)的存儲(chǔ)操作
```
import pymongo
from yunqiCrawl.items import YunqiBookListItem, YunqiBookDetialItem


class YunqicrawlPipeline(object):
    def __init__(self, mongo_db, mongo_uri):
        self.mongo_uri = mongo_uri
        self.mongo_db = mongo_db

    @classmethod
    def from_crawler(cls, crawler):
        return cls(
            mongo_db=crawler.settings.get('MONGO_DATABASE', 'yunqi'),
            mongo_uri=crawler.settings.get('MONGO_URI')
        )

    def open_spider(self, spider):
        self.client = pymongo.MongoClient(self.mongo_uri)
        self.db = self.client[self.mongo_db]

    def close_spider(self, spider):
        self.client.close()

    def process_item(self, item, spider):
        self.db.bookInfo.insert(dict(item))
```
在settings中添加mongodb的信息
針對(duì)反爬
使用隨機(jī)user-agent頭防止爬蟲被發(fā)現(xiàn)，修改request請求頭的操作，應(yīng)該在下載器下載之前完成，對(duì)的，就是下載器中間件，在下載器中間件中，獲取settings中的User-agent列表，使用random模塊隨機(jī)選擇一個(gè)User-Agent頭對(duì)request頭進(jìn)行修改，隨后在settings文件中啟用該中間件
User-Agents列表內(nèi)容較長，就是一個(gè)包含各種agents的列表，寫入settings.py文件中即可

編寫下載器中間件對(duì)request中的User-Agent進(jìn)行修改，注意理解from_crawler函數(shù)，該函數(shù)可通過crawler參數(shù)獲取爬蟲全局信息，經(jīng)常用于獲取配置信息，下面的類時(shí)在middlewires文件中添加的。
```
class RandomUserAgent(object):
    def __init__(self, agents):
        self.agents = agents

    @classmethod
    def from_crawler(cls, crawler):
        return cls(crawler.settings.getlist('USER_AGENTS'))

    def process_request(self, request, spider):
        request.headers.setdefault('User-Agent', random.choice(self.agents))
```
隨后在settings.py文件中啟用該中間件，下載器中間件，系統(tǒng)默認(rèn)會(huì)啟用一些，如果需要禁用它們，需要在配置文件中說明，如下，禁用系統(tǒng)本來的user-agent中間件，使用我們自己編寫的user-agent中間件

此外我們可以設(shè)置自定義request的headers信息

禁用cookie，在配置文件中說明即可
去重
使用redis緩存去重的方式較為簡單，配置redis服務(wù)器信息，安裝scrapy_redis，隨后在配置文件中添加

需要說明的是，該去重方式實(shí)際上使用了set集合元素的單一性，效率堪憂。關(guān)于去重，有專門的BloomFilter算法可以使用，也可以在redis中使用該算法去重，使用github前輩寫好的工程項(xiàng)目即可
分布式
在不使用redis隊(duì)列時(shí)，原始的爬蟲數(shù)據(jù)流動(dòng)如下圖

在使用redis隊(duì)列調(diào)度之后，爬蟲數(shù)據(jù)流如下

request隊(duì)列存入redis服務(wù)器中，通過redis服務(wù)器來管理，這樣可以將一個(gè)項(xiàng)目放在多臺(tái)主機(jī)上，使用同一臺(tái)redis服務(wù)器，可以達(dá)到分布式爬蟲的效果。使用redis的步驟也不算繁瑣，首先python的redis模塊必不可少，使用pip安裝即可，另外scrapy與redis對(duì)接的模塊scrapy_redis也需要提前安裝。接下來在配置文件中配置即可，即編輯settings.py文件

第一行指定了使用redis調(diào)度，第二行指定了使用redis隊(duì)列，第三行是說狀態(tài)信息會(huì)得到保存，以便于停止，重新運(yùn)行，然后指定了redis主機(jī)信息。程序需要mongodb服務(wù)與redis服務(wù)，最后在命令行啟動(dòng)該爬蟲即可。

tips：該項(xiàng)目源碼來自《python爬蟲開發(fā)與項(xiàng)目實(shí)戰(zhàn)》