overview.html

Scrapy是一個(gè)應(yīng)用程序框架，為各種各樣的應(yīng)用程序爬取網(wǎng)站提取結(jié)構(gòu)化數(shù)據(jù)，如數(shù)據(jù)挖掘，信息處理或者歷史檔案。

Scrapy不止可以做網(wǎng)站的數(shù)據(jù)提取，也可以用于APIs（如 Amazon Associates Web Services）的數(shù)據(jù)提取或者作為專用的web蜘蛛。

運(yùn)行一個(gè)簡(jiǎn)單的蜘蛛

這是從 http://quotes.toscrape.com 網(wǎng)站獲取名言的蜘蛛代碼片段：

import scrapy

class QuotesSpider(scrapy.Spider):
    name = "quotes"
    start_urls = [
        'http://quotes.toscrape.com/tag/humor/',
    ]

    def parse(self, response):
        for quote in response.css('div.quote'):
            yield {
                'text': quote.css('span.text::text').extract_first(),
                'author': quote.xpath('span/small/text()').extract_first(),
            }

        next_page = response.css('li.next a::attr("href")').extract_first()
        if next_page is not None:
            next_page = response.urljoin(next_page)
            yield scrapy.Request(next_page, callback=self.parse)```

把代碼存放在文件中，命名為`quotes_spider.py`，使用`runspider`命令運(yùn)行蜘蛛。
`scrapy runspider quotes_spider.py -o quotes.json`
運(yùn)行結(jié)束時(shí)你會(huì)有個(gè)`quotes.json`文件列出所有的JSON格式名言，包括文本和作者，類似這樣（這里為了閱讀重新格式化了）：

[{
"author": "Jane Austen",
"text": "\u201cThe person, be it gentleman or lady, who has not pleasure in a good novel, must be intolerably stupid.\u201d"
},
{
"author": "Groucho Marx",
"text": "\u201cOutside of a dog, a book is man's best friend. Inside of a dog it's too dark to read.\u201d"
},
{
"author": "Steve Martin",
"text": "\u201cA day without sunshine is like, you know, night.\u201d"
},
...]

## 發(fā)生了什么
當(dāng)你運(yùn)行命令`scrapy runspider quotes_spider.py`時(shí)，Scrapy找到代碼中蜘蛛的定義，然后在crawler引擎中運(yùn)行它。

蜘蛛從`start_urls`屬性中給定的URLS開始請(qǐng)求（此例只有quotes的humor目錄網(wǎng)址），然后調(diào)用默認(rèn)的回調(diào)函數(shù)`parse`，把請(qǐng)求結(jié)果作為參數(shù)。在`parse`回調(diào)中，我們使用CSS選擇器循環(huán)quote元素，生成一個(gè)含有quote文本和作者的python字典，查找下一頁(yè)的鏈接地址計(jì)劃用另一個(gè)請(qǐng)求使用相同的`parse`方法回調(diào)。

此處你注意到Scrapy的主要優(yōu)點(diǎn)：請(qǐng)求的計(jì)劃和處理都是異步的。這意味著Scrapy不需要等待一個(gè)請(qǐng)求的結(jié)束然后處理，它可以發(fā)送另一個(gè)請(qǐng)求或者同時(shí)做其他的事情。這意味著即使有些請(qǐng)求失敗或者出錯(cuò)了，其他的請(qǐng)求也會(huì)繼續(xù)運(yùn)行。

這可以使你快速爬去數(shù)據(jù)（同時(shí)發(fā)送多個(gè)請(qǐng)求）Scrapy通過[一些小的設(shè)置](https://doc.scrapy.org/en/latest/topics/settings.html#topics-settings-ref)可以使你的爬蟲更加禮貌。你可以設(shè)置每次請(qǐng)求之間的延遲，限制同時(shí)請(qǐng)求每個(gè)域名或ip的個(gè)數(shù)，或者直接使用 [using an auto-throttling extension](https://doc.scrapy.org/en/latest/topics/autothrottle.html#topics-autothrottle) 自動(dòng)實(shí)現(xiàn)。

#### 提示
此處使用JSON文件導(dǎo)出結(jié)果你也可以使用XML或CSV格式，或者使用pipline把item存到數(shù)據(jù)庫(kù)中。

色偷偷精品伊人,欧洲久久精品,欧美综合婷婷骚逼,国产AV主播,国产最新探花在线,九色在线视频一区,伊人大交九欧美,1769亚洲,黄色成人av

Scrapy簡(jiǎn)介

Scrapy簡(jiǎn)介

Scrapy一覽https://doc.scrapy.org/en/latest/intro/overview.html

運(yùn)行一個(gè)簡(jiǎn)單的蜘蛛

相關(guān)閱讀更多精彩內(nèi)容

友情鏈接更多精彩內(nèi)容

色偷偷精品伊人,欧洲久久精品,欧美综合婷婷骚逼,国产AV主播,国产最新探花在线,九色在线视频一区,伊人大交九 欧美,1769亚洲,黄色成人av

Scrapy簡(jiǎn)介

Scrapy一覽https://doc.scrapy.org/en/latest/intro/overview.html

運(yùn)行一個(gè)簡(jiǎn)單的蜘蛛

相關(guān)閱讀更多精彩內(nèi)容

友情鏈接更多精彩內(nèi)容

色偷偷精品伊人,欧洲久久精品,欧美综合婷婷骚逼,国产AV主播,国产最新探花在线,九色在线视频一区,伊人大交九欧美,1769亚洲,黄色成人av