国产欧美又粗又猛,久久人体艺术网站,经典av在线观看

總結(jié)一下之前的spider，總的來(lái)說(shuō)，Spider類(lèi)就是定義了如何爬取某個(gè)(或某些)網(wǎng)站。包括了爬取的動(dòng)作以及如何從網(wǎng)頁(yè)的內(nèi)容中提取結(jié)構(gòu)化數(shù)據(jù)(爬取item)。換句話說(shuō)，Spider就是您定義爬取的動(dòng)作及分析某個(gè)網(wǎng)頁(yè)(或者是有些網(wǎng)頁(yè))的地方。

對(duì)spider來(lái)說(shuō)，爬取的循環(huán)類(lèi)似下文:
1、以初始的URL初始化Request，并設(shè)置回調(diào)函數(shù)。當(dāng)該request下載完畢并返回時(shí)，將生成response，并作為參數(shù)傳給該回調(diào)函數(shù)。
spider中初始的request是通過(guò)調(diào)用 start_requests() 來(lái)獲取的。 start_requests() 讀取 start_urls 中的URL，并以parse 為回調(diào)函數(shù)生成 Request 。
2、在回調(diào)函數(shù)內(nèi)分析返回的(網(wǎng)頁(yè))內(nèi)容，返回 Item 對(duì)象或者 Request 或者一個(gè)包括二者的可迭代容器。返回的Request對(duì)象之后會(huì)經(jīng)過(guò)Scrapy處理，下載相應(yīng)的內(nèi)容，并調(diào)用設(shè)置的callback函數(shù)(函數(shù)可相同)。
3、回調(diào)函數(shù)內(nèi)，您可以使用選擇器(Selectors) (您也可以使用BeautifulSoup, lxml 或者您想用的任何解析器) 來(lái)分析網(wǎng)頁(yè)內(nèi)容，并根據(jù)分析的數(shù)據(jù)生成item。
4、最后，由spider返回的item將被存到數(shù)據(jù)庫(kù)(由某些 Item Pipeline 處理)或使用 Feed exports 存入到文件中。

雖然該循環(huán)對(duì)任何類(lèi)型的spider都適用，但Scrapy仍然為了不同的需求提供了多種默認(rèn)spider。下面將簡(jiǎn)單介紹這些spider。

內(nèi)置Spider參考手冊(cè)：
Scrapy提供多種方便的通用spider供您繼承使用。這些spider為一些常用的爬取情況提供方便的特性，例如根據(jù)某些規(guī)則跟進(jìn)某個(gè)網(wǎng)站的所有鏈接、根據(jù) Sitemaps 來(lái)進(jìn)行爬取，或者分析XML/CSV源。
主要包括：scrapy.Spider、CrawlSpider、XMLFeedSpider、CSVFeedSpider、SitemapSpider；
下面主要介紹一下scrapy.Spider和CrawlSpider
一、scrapy.Spider
Spider是最簡(jiǎn)單的spider。每個(gè)其他的spider必須繼承自該類(lèi)(包括Scrapy自帶的其他spider以及您自己編寫(xiě)的spider)。 Spider并沒(méi)有提供什么特殊的功能。其僅僅請(qǐng)求給定的 start_urls/start_requests ，并根據(jù)返回的結(jié)果(resulting responses)調(diào)用spider的 parse 方法。

name：定義spider名字的字符串(string)。spider的名字定義了Scrapy如何定位(并初始化)spider，所以其必須是唯一的。不過(guò)您可以生成多個(gè)相同的spider實(shí)例(instance)，這沒(méi)有任何限制。 name是spider最重要的屬性，而且是必須的。如果該spider爬取單個(gè)網(wǎng)站(single domain)，一個(gè)常見(jiàn)的做法是以該網(wǎng)站(domain)(加或不加后綴 )來(lái)命名spider。例如，如果spider爬取 mywebsite.com ，該spider通常會(huì)被命名為 mywebsite 。

allowed_domains：可選。包含了spider允許爬取的域名(domain)列表(list)。當(dāng) OffsiteMiddleware 啟用時(shí)，域名不在列表中的URL不會(huì)被跟進(jìn)。

start_urls：URL列表。當(dāng)沒(méi)有制定特定的URL時(shí)，spider將從該列表中開(kāi)始進(jìn)行爬取。因此，第一個(gè)被獲取到的頁(yè)面的URL將是該列表之一。后續(xù)的URL將會(huì)從獲取到的數(shù)據(jù)中提取。

custom_settings：該屬性由初始化類(lèi)之后由from_crawler（）類(lèi)方法設(shè)置，并鏈接到此蜘蛛實(shí)例綁定到的Crawler對(duì)象。
Crawlers在項(xiàng)目中封裝了大量組件，用于單一訪問(wèn)（例如擴(kuò)展，中間件，信號(hào)管理器等）。請(qǐng)參閱Crawler API了解更多關(guān)于它們。

crawler：該屬性由初始化類(lèi)之后由from_crawler（）類(lèi)方法設(shè)置，并鏈接到此蜘蛛實(shí)例綁定到的Crawler對(duì)象。
Crawlers在項(xiàng)目中封裝了大量組件，用于單一訪問(wèn)（例如擴(kuò)展，中間件，信號(hào)管理器等）。請(qǐng)參閱Crawler API了解更多關(guān)于它們。

start_requests()：該方法必須返回一個(gè)可迭代對(duì)象(iterable)。該對(duì)象包含了spider用于爬取的第一個(gè)Request。當(dāng)spider啟動(dòng)爬取并且未制定URL時(shí)，該方法被調(diào)用。當(dāng)指定了URL時(shí)，make_requests_from_url() 將被調(diào)用來(lái)創(chuàng)建Request對(duì)象。該方法僅僅會(huì)被Scrapy調(diào)用一次，因此您可以將其實(shí)現(xiàn)為生成器。該方法的默認(rèn)實(shí)現(xiàn)是使用 start_urls 的url生成Request。
如果您想要修改最初爬取某個(gè)網(wǎng)站的Request對(duì)象，您可以重寫(xiě)(override)該方法。例如，如果您需要在啟動(dòng)時(shí)以POST登錄某個(gè)網(wǎng)站，你可以這么寫(xiě):

def start_requests(self):
    return [scrapy.FormRequest("http://www.example.com/login",
               formdata={'user': 'john', 'pass': 'secret'},
               callback=self.logged_in)
           ]

def logged_in(self, response):
    # here you would extract links to follow and return Requests for
    # each of them, with another callback
    pass

make_requests_from_url(url)：該方法接受一個(gè)URL并返回用于爬取的 Request 對(duì)象。該方法在初始化request時(shí)被 start_requests() 調(diào)用，也被用于轉(zhuǎn)化url為request。默認(rèn)未被復(fù)寫(xiě)(overridden)的情況下，該方法返回的Request對(duì)象中， parse() 作為回調(diào)函數(shù)，dont_filter參數(shù)也被設(shè)置為開(kāi)啟。

parse(response)：當(dāng)response沒(méi)有指定回調(diào)函數(shù)時(shí)，該方法是Scrapy處理下載的response的默認(rèn)方法。 parse 負(fù)責(zé)處理response并返回處理的數(shù)據(jù)以及(/或)跟進(jìn)的URL。 Spider 對(duì)其他的Request的回調(diào)函數(shù)也有相同的要求。該方法及其他的Request回調(diào)函數(shù)必須返回一個(gè)包含 Request 及(或) Item 的可迭代的對(duì)象。

log(message[, level, component])：使用 scrapy.log.msg() 方法記錄(log)message。 log中自動(dòng)帶上該spider的 name 屬性。詳情請(qǐng)參見(jiàn) Logging 。

closed(reason)：當(dāng)spider關(guān)閉時(shí)，該函數(shù)被調(diào)用。該方法提供了一個(gè)替代調(diào)用signals.connect()來(lái)監(jiān)聽(tīng)spider_closed 信號(hào)的快捷方式。

scrapy.Spider的例子，這里就不詳細(xì)介紹了，之前的文章都是繼承scrapy.Spider完成的；

二、CrawlSpider
爬取一般網(wǎng)站常用的spider。其定義了一些規(guī)則(rule)來(lái)提供跟進(jìn)link的方便的機(jī)制。也許該spider并不是完全適合您的特定網(wǎng)站或項(xiàng)目，但其對(duì)很多情況都使用。因此您可以以其為起點(diǎn)，根據(jù)需求修改部分方法。當(dāng)然您也可以實(shí)現(xiàn)自己的spider。

除了從Spider繼承過(guò)來(lái)的(您必須提供的)屬性外，其提供了一個(gè)新的屬性:
rules：一個(gè)包含一個(gè)(或多個(gè)) Rule 對(duì)象的集合(list)。每個(gè) Rule 對(duì)爬取網(wǎng)站的動(dòng)作定義了特定表現(xiàn)。 Rule對(duì)象在下邊會(huì)介紹。如果多個(gè)rule匹配了相同的鏈接，則根據(jù)他們?cè)诒緦傩灾斜欢x的順序，第一個(gè)會(huì)被使用。

parse_start_url(response)：是一個(gè)可復(fù)寫(xiě)(overrideable)的方法，當(dāng)start_url的請(qǐng)求返回時(shí)，該方法被調(diào)用。該方法分析最初的返回值并必須返回一個(gè) Item 對(duì)象或者一個(gè) Request 對(duì)象或者一個(gè)可迭代的包含二者對(duì)象。

爬取規(guī)則(Crawling rules)：

class scrapy.contrib.spiders.Rule(link_extractor, callback=None, cb_kwargs=None, follow=None, process_links=None, process_request=None)

link_extractor 是一個(gè) Link Extractor 對(duì)象。其定義了如何從爬取到的頁(yè)面提取鏈接。
callback 是一個(gè)callable或string(該spider中同名的函數(shù)將會(huì)被調(diào)用)。從link_extractor中每獲取到鏈接時(shí)將會(huì)調(diào)用該函數(shù)。該回調(diào)函數(shù)接受一個(gè)response作為其第一個(gè)參數(shù)，并返回一個(gè)包含 Item 以及(或) Request 對(duì)象(或者這兩者的子類(lèi))的列表(list)。
cb_kwargs 包含傳遞給回調(diào)函數(shù)的參數(shù)(keyword argument)的字典。
follow 是一個(gè)布爾(boolean)值，指定了根據(jù)該規(guī)則從response提取的鏈接是否需要跟進(jìn)。如果 callback 為None， follow 默認(rèn)設(shè)置為 True ，否則默認(rèn)為 False 。
process_links 是一個(gè)callable或string(該spider中同名的函數(shù)將會(huì)被調(diào)用)。從link_extractor中獲取到鏈接列表時(shí)將會(huì)調(diào)用該函數(shù)。該方法主要用來(lái)過(guò)濾。
process_request 是一個(gè)callable或string(該spider中同名的函數(shù)將會(huì)被調(diào)用)。該規(guī)則提取到每個(gè)request時(shí)都會(huì)調(diào)用該函數(shù)。該函數(shù)必須返回一個(gè)request或者None。 (用來(lái)過(guò)濾request)
restrict_xpaths：使用xpath表達(dá)式，和allow共同作用過(guò)濾鏈接。還有一個(gè)類(lèi)似的restrict_cs
下面總結(jié)一下：
1、CrawlSpider的工作原理：
CrawlSpider繼承了Spider，所以具有Spider的所有函數(shù)。
先由start_requests對(duì)start_urls中的每一個(gè)url發(fā)起請(qǐng)求（make_requests_from_url)，這個(gè)請(qǐng)求會(huì)被parse接收。在Spider里面的parse需要我們定義，但CrawlSpider定義parse去解析響應(yīng)（self._parse_response(response, self.parse_start_url, cb_kwargs={}, follow=True)）；_parse_response根據(jù)有無(wú)callback,follow和self.follow_links執(zhí)行不同的操作；其中_requests_to_follow又會(huì)獲取link_extractor（這個(gè)是我們傳入的LinkExtractor）解析頁(yè)面得到的link（link_extractor.extract_links(response)）,對(duì)url進(jìn)行加工（process_links，需要自定義），對(duì)符合的link發(fā)起Request。使用.process_request(需要自定義）處理響應(yīng)。
下面是對(duì)應(yīng)的源碼：

def _parse_response(self, response, callback, cb_kwargs, follow=True):
    ##首先，如果傳入了callback，使用這個(gè)callback解析頁(yè)面并獲取解析得到的reques或item
        if callback:
            cb_res = callback(response, **cb_kwargs) or ()
            cb_res = self.process_results(response, cb_res)
            for requests_or_item in iterate_spider_output(cb_res):
                yield requests_or_item
    ## 然后，判斷有無(wú)follow，用_requests_to_follow解析響應(yīng)是否有符合要求的link。
        if follow and self._follow_links:
            for request_or_item in self._requests_to_follow(response):
                yield request_or_item

2、CrawlSpider獲取rules的原理：
CrawlSpider類(lèi)會(huì)在init方法中調(diào)用_compile_rules方法，然后在其中淺拷貝rules中的各個(gè)Rule獲取要用于回調(diào)(callback)，要進(jìn)行處理的鏈接（process_links）和要進(jìn)行的處理請(qǐng)求（process_request)
對(duì)應(yīng)的源碼：

def _compile_rules(self):
        def get_method(method):
            if callable(method):
                return method
            elif isinstance(method, six.string_types):
                return getattr(self, method, None)

        self._rules = [copy.copy(r) for r in self.rules]
        for rule in self._rules:
            rule.callback = get_method(rule.callback)
            rule.process_links = get_method(rule.process_links)
            rule.process_request = get_method(rule.process_request)

Rule的源碼：

class Rule(object):

        def __init__(self, link_extractor, callback=None, cb_kwargs=None, follow=None, process_links=None, process_request=identity):
            self.link_extractor = link_extractor
            self.callback = callback
            self.cb_kwargs = cb_kwargs or {}
            self.process_links = process_links
            self.process_request = process_request
            if follow is None:
                self.follow = False if callback else True
            else:
                self.follow = follow

最終結(jié)果是：LinkExtractor會(huì)傳給link_extractor。
3、_parse_response會(huì)處理有callback的（響應(yīng)）response，對(duì)于有callback參數(shù)Rule是傳給指定的函數(shù)處理，
沒(méi)有callback的處理：
cb_res = callback(response, **cb_kwargs) or ()
而_requests_to_follow會(huì)將self._response_downloaded傳給callback用于對(duì)頁(yè)面中匹配的url發(fā)起請(qǐng)求（request）。
r = Request(url=link.url, callback=self._response_downloaded)

這里貼上Scrapy.spiders.CrawlSpider的完整源碼：

"""
This modules implements the CrawlSpider which is the recommended spider to use
for scraping typical web sites that requires crawling pages.

See documentation in docs/topics/spiders.rst
"""

import copy
import six

from scrapy.http import Request, HtmlResponse
from scrapy.utils.spider import iterate_spider_output
from scrapy.spiders import Spider


def identity(x):
    return x


class Rule(object):

    def __init__(self, link_extractor, callback=None, cb_kwargs=None, follow=None, process_links=None, process_request=identity):
        self.link_extractor = link_extractor
        self.callback = callback
        self.cb_kwargs = cb_kwargs or {}
        self.process_links = process_links
        self.process_request = process_request
        if follow is None:
            self.follow = False if callback else True
        else:
            self.follow = follow


class CrawlSpider(Spider):

    rules = ()

    def __init__(self, *a, **kw):
        super(CrawlSpider, self).__init__(*a, **kw)
        self._compile_rules()

    def parse(self, response):
        return self._parse_response(response, self.parse_start_url, cb_kwargs={}, follow=True)

    def parse_start_url(self, response):
        return []

    def process_results(self, response, results):
        return results

    def _requests_to_follow(self, response):
        if not isinstance(response, HtmlResponse):
            return
        seen = set()
        for n, rule in enumerate(self._rules):
            links = [lnk for lnk in rule.link_extractor.extract_links(response)
                     if lnk not in seen]
            if links and rule.process_links:
                links = rule.process_links(links)
            for link in links:
                seen.add(link)
                r = Request(url=link.url, callback=self._response_downloaded)
                r.meta.update(rule=n, link_text=link.text)
                yield rule.process_request(r)

    def _response_downloaded(self, response):
        rule = self._rules[response.meta['rule']]
        return self._parse_response(response, rule.callback, rule.cb_kwargs, rule.follow)

    def _parse_response(self, response, callback, cb_kwargs, follow=True):
        if callback:
            cb_res = callback(response, **cb_kwargs) or ()
            cb_res = self.process_results(response, cb_res)
            for requests_or_item in iterate_spider_output(cb_res):
                yield requests_or_item

        if follow and self._follow_links:
            for request_or_item in self._requests_to_follow(response):
                yield request_or_item

    def _compile_rules(self):
        def get_method(method):
            if callable(method):
                return method
            elif isinstance(method, six.string_types):
                return getattr(self, method, None)

        self._rules = [copy.copy(r) for r in self.rules]
        for rule in self._rules:
            rule.callback = get_method(rule.callback)
            rule.process_links = get_method(rule.process_links)
            rule.process_request = get_method(rule.process_request)

    @classmethod
    def from_crawler(cls, crawler, *args, **kwargs):
        spider = super(CrawlSpider, cls).from_crawler(crawler, *args, **kwargs)
        spider._follow_links = crawler.settings.getbool(
            'CRAWLSPIDER_FOLLOW_LINKS', True)
        return spider

    def set_crawler(self, crawler):
        super(CrawlSpider, self).set_crawler(crawler)
        self._follow_links = crawler.settings.getbool('CRAWLSPIDER_FOLLOW_LINKS', True)

接下來(lái)給出配合rule使用CrawlSpider的例子:
爬取豆瓣圖書(shū)
1、首先確定要爬取的數(shù)據(jù);
items.py

# -*- coding: utf-8 -*-

# Define here the models for your scraped items
# See documentation in:
# http://doc.scrapy.org/en/latest/topics/items.html

import scrapy


class DoubanbookItem(scrapy.Item):
    # define the fields for your item here like:
    name = scrapy.Field()  # 書(shū)名
    images = scrapy.Field()  # 圖片
    author = scrapy.Field()  # 作者
    press = scrapy.Field()  # 出版社
    date = scrapy.Field()  # 出版日期
    page = scrapy.Field()  # 頁(yè)數(shù)
    price = scrapy.Field()  # 價(jià)格
    ISBN = scrapy.Field()  # ISBN號(hào)
    score = scrapy.Field()  # 豆瓣評(píng)分
    author_profile = scrapy.Field()  # 作者簡(jiǎn)介
    content_description = scrapy.Field()  # 內(nèi)容簡(jiǎn)介
    link = scrapy.Field()  # 詳情頁(yè)鏈接

2、最主要的爬蟲(chóng)部分：
doubanbooks.py

# -*- coding: utf-8 -*-
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
from ..items import DoubanbookItem
import re
import os
import urllib.request
from scrapy.http import HtmlResponse, Request
from scrapy.conf import settings  # 從settings文件中導(dǎo)入Cookie，這里也可以from scrapy.conf import settings.COOKIE
import random
import string


class BookspiderSpider(CrawlSpider):
    name = 'bookSpider'
    allowed_domains = ['book.douban.com']
    cookie = settings['COOKIE']  # 帶著Cookie向網(wǎng)頁(yè)發(fā)請(qǐng)求
    #獲取隨機(jī)的cookies
    cookies = "bid=%s" % "".join(random.sample(string.ascii_letters + string.digits, 11))
    start_urls = ['https://book.douban.com/tag/數(shù)據(jù)分析?start=0&type=T']
    rules = (
        # 列表頁(yè)url
        Rule(LinkExtractor(allow=(r"tag/數(shù)據(jù)分析?start=\d+&type=T")),follow = True),
        # 詳情頁(yè)url
        Rule(LinkExtractor(allow=(r"subject/\d+/$")), callback="parse_item",  follow = True)
    )
    #將獲取到的cookie傳遞給每一個(gè)url鏈接的ruquest
    def request_question(self, request):
        return Request(request.url, meta={'cookiejar': 1}, callback=self.parse_item)

    #獲取詳情頁(yè)具體的圖書(shū)信息
    def parse_item(self, response):

        if response.status == 200:
            item = DoubanbookItem()
            # 圖書(shū)名
            item["name"] = response.xpath("http://div[@id='wrapper']/h1/span/text()").extract()[0].strip()
            # 圖書(shū)的圖片
            src = response.xpath("http://div[@id='mainpic']/a/img/@src").extract()[0].strip()
            file_name = "%s.jpg" % (item["name"])  # 圖書(shū)名
            file_path = os.path.join("E:\\spider\\pictures\\douban_book\\book_img", file_name)  # 拼接這個(gè)圖片的路徑
            opener = urllib.request.build_opener()
            opener.addheaders = [('User-Agent',
                                  'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/36.0.1941.0 Safari/537.36')]
            urllib.request.install_opener(opener)
            urllib.request.urlretrieve(src, file_path)  # 接收文件路徑和需要保存的路徑，會(huì)自動(dòng)去文件路徑下載并保存到我們指定的本地路徑
            item["images"] = file_path
            #作者
            if len(response.xpath("http://div[@id='info']/span[1]/a/text()").extract()) > 0:
                authors = response.xpath("http://div[@id='info']/span[1]/a/text()").extract()
                item["author"] = ",".join(author.strip() for author in authors).strip()
            else:
                authors = response.xpath("http://div[@id='info']/a[1]/text()").extract()
                item["author"] = ",".join(author.strip() for author in authors).strip()
            #出版社
            try:
                item["press"] = response.xpath("http://div[@id='info']").re(r'出版社:</span> (.+)<br>\n')[0].strip()
            except:
                item["press"] = "無(wú)"
            #出版年
            try:
                item["date"] = response.xpath("http://div[@id='info']").re(r'出版年:</span> (.+)<br>\n')[0].strip()
            except:
                item["date"] = "無(wú)"
            #頁(yè)數(shù)
            try:
                page_str = response.xpath("http://div[@id='info']").re(r'頁(yè)數(shù):</span> (.+)<br>\n')[0].strip()
                item["page"] = int(re.findall(r'\d+', page_str)[0])
            except:
                item["page"] = "無(wú)"
            #定價(jià)
            try:
                item["price"] = response.xpath("http://div[@id='info']").re(r'定價(jià):</span> (.+)<br>\n')[0].strip()
            except:
                item["price"] = "無(wú)"
            #ISBN
            try:
                item["ISBN"] = response.xpath("http://div[@id='info']").re(r'ISBN:</span> (.+)<br>\n')[0].strip()
            except:
                item["ISBN"] = "無(wú)"
            # 豆瓣評(píng)分

            if len(response.xpath("http://div[@class='rating_self clearfix']/strong/text()").extract()[0].strip()) > 0:
                item["score"] = float(response.xpath("http://div[@class='rating_self clearfix']/strong/text()").extract()[0].strip())
            else:
                item["score"] = "評(píng)價(jià)人數(shù)不足"

            # 內(nèi)容簡(jiǎn)介

            if len(response.xpath('//span[@class="all hidden"]/div/div[@class="intro"]/p')) > 0:
                contents = response.xpath('//span[@class="all hidden"]/div/div[@class="intro"]/p/text()').extract()
                item["content_description"] = "\n".join(content.strip() for content in contents)
            elif len(response.xpath('//div[@id="link-report"]/div/div[@class="intro"]/p')) > 0:
                contents = response.xpath('//div[@id="link-report"]/div/div[@class="intro"]/p/text()').extract()
                item["content_description"] = "\n".join(content.strip() for content in contents)
            else:
                item["content_description"] = "無(wú)"
            # 作者簡(jiǎn)介

            profiles_tag = response.xpath('//div[@class="intro"]')[-1]
            profiles = profiles_tag.xpath('p/text()').extract()
            if len(profiles) > 0:
                item["author_profile"] = "\n".join(profile.strip() for profile in profiles)
            else:
                item["author_profile"] = "無(wú)"

            # 詳情頁(yè)鏈接
            item["link"] = response.url

            return item

這里主要注意一下，cookies的使用，如果不用cookie的話，很容易被ban，或者你也可以選擇使用ip代理；只要不被ban就行；
其他的部分，代碼里有注釋?zhuān)@里我就不解釋了；
3、數(shù)據(jù)存儲(chǔ)部分：
pipelines.py

# -*- coding: utf-8 -*-

# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: http://doc.scrapy.org/en/latest/topics/item-pipeline.html
import pymongo
from scrapy.conf import settings
from scrapy.exceptions import DropItem
from scrapy import log


class DoubanbookPipeline(object):
    def __init__(self):
        connection = pymongo.MongoClient(
            settings['MONGODB_SERVER'],
            settings['MONGODB_PORT']
        )
        db = connection[settings['MONGODB_DB']]
        self.collection = db[settings['MONGODB_COLLECTION']]
    def process_item(self, item, spider):
        valid = True
        for data in item:
            if not data:
                valid = False
                raise DropItem("Missing {0}!".format(data))
        if valid:
            self.collection.insert(dict(item))
            log.msg("Question added to MongoDB database!",
                    level=log.DEBUG, spider=spider)
        return item

這里我是選擇存儲(chǔ)到pymongo，其他數(shù)據(jù)庫(kù)都行，看自己的選擇了；
4、設(shè)置：
settings.py

# -*- coding: utf-8 -*-
import random
from useragent import Agent

# Scrapy settings for DoubanBook project
#
# For simplicity, this file contains only settings considered important or
# commonly used. You can find more settings consulting the documentation:
#
#     http://doc.scrapy.org/en/latest/topics/settings.html
#     http://scrapy.readthedocs.org/en/latest/topics/downloader-middleware.html
#     http://scrapy.readthedocs.org/en/latest/topics/spider-middleware.html

BOT_NAME = 'DoubanBook'

SPIDER_MODULES = ['DoubanBook.spiders']
NEWSPIDER_MODULE = 'DoubanBook.spiders'

ITEM_PIPELINES = {
    'DoubanBook.pipelines.DoubanbookPipeline': 300,
}

MONGODB_SERVER = 'localhost'
MONGODB_PORT = 27017
MONGODB_DB = 'douban'
MONGODB_COLLECTION = 'book_數(shù)據(jù)分析'

# Crawl responsibly by identifying yourself (and your website) on the user-agent


USER_AGENT = '%s' % random.choice(Agent.user_agent)
# USER_AGENT = 'DoubanBook (+http://www.yourdomain.com)'

# Obey robots.txt rules
ROBOTSTXT_OBEY = True

# Configure maximum concurrent requests performed by Scrapy (default: 16)
CONCURRENT_REQUESTS = 100

# Retry many times since proxies often fail
RETRY_TIMES = 10
# Retry on most error codes since proxies fail for different reasons
RETRY_HTTP_CODES = [500, 503, 504, 400, 403, 404, 408]

# Configure a delay for requests for the same website (default: 0)
# See http://scrapy.readthedocs.org/en/latest/topics/settings.html#download-delay
# See also autothrottle settings and docs
DOWNLOAD_DELAY = 0.2
# The download delay setting will honor only one of:
# CONCURRENT_REQUESTS_PER_DOMAIN = 16
# CONCURRENT_REQUESTS_PER_IP = 16

# Disable cookies (enabled by default)禁止使用cookie
COOKIES_ENABLED = False

運(yùn)行的結(jié)果：

圖書(shū)的圖片

圖書(shū)詳情頁(yè)的數(shù)據(jù)

大概爬了1000多本圖書(shū)，感覺(jué)應(yīng)該是比scrapy.Spider快一點(diǎn)，但是這個(gè)還是要看自己的網(wǎng)絡(luò)、設(shè)備等；
爬的過(guò)程中有幾個(gè)url由于請(qǐng)求網(wǎng)絡(luò)超時(shí)而失敗，所以可以把超時(shí)的時(shí)間設(shè)稍微長(zhǎng)一點(diǎn)；

附贈(zèng)一些些資源：

Scrapy模擬登陸知乎
 CrawlSpider爬取拉勾招聘網(wǎng)

色偷偷精品伊人,欧洲久久精品,欧美综合婷婷骚逼,国产AV主播,国产最新探花在线,九色在线视频一区,伊人大交九欧美,1769亚洲,黄色成人av

scrapy的快速入門(mén)（三）

scrapy的快速入門(mén)（三）

相關(guān)閱讀更多精彩內(nèi)容

友情鏈接更多精彩內(nèi)容

色偷偷精品伊人,欧洲久久精品,欧美综合婷婷骚逼,国产AV主播,国产最新探花在线,九色在线视频一区,伊人大交九 欧美,1769亚洲,黄色成人av

scrapy的快速入門(mén)（三）

相關(guān)閱讀更多精彩內(nèi)容

友情鏈接更多精彩內(nèi)容

色偷偷精品伊人,欧洲久久精品,欧美综合婷婷骚逼,国产AV主播,国产最新探花在线,九色在线视频一区,伊人大交九欧美,1769亚洲,黄色成人av