開發(fā)環(huán)境：
Python 3.6.0 版本 （當(dāng)前最新）
Scrapy 1.3.2 版本 （當(dāng)前最新）

Item Pipeline（項目管道）

在項目被蜘蛛抓取后，它被發(fā)送到項目管道，它通過順序執(zhí)行的幾個組件來處理它。

每個項目管道組件（有時稱為“Item Pipeline”）是一個實(shí)現(xiàn)簡單方法的Python類。他們接收一個項目并對其執(zhí)行操作，還決定該項目是否應(yīng)該繼續(xù)通過流水線或被丟棄并且不再被處理。

項目管道的典型用途是：

清理HTML數(shù)據(jù)
驗證抓取的數(shù)據(jù)（檢查項目是否包含特定字段）
檢查重復(fù)（并刪除）
將刮取的項目存儲在數(shù)據(jù)庫中

編寫自己的項目管道

每個項目管道組件是一個Python類，必須實(shí)現(xiàn)以下方法：
process_item(self, item, spider)

對于每個項目管道組件調(diào)用此方法。process_item() 必須：返回一個帶數(shù)據(jù)的dict，返回一個Item （或任何后代類）對象，返回一個Twisted Deferred或者raise DropItemexception。丟棄的項目不再由其他管道組件處理。

參數(shù)：

item（Itemobject或dict） - 剪切的項目
Spider（Spider對象） - 抓取物品的蜘蛛

另外，它們還可以實(shí)現(xiàn)以下方法：

open_spider(self, spider)
當(dāng)蜘蛛打開時調(diào)用此方法。

參數(shù)：

蜘蛛（Spider對象） - 打開的蜘蛛

close_spider(self, spider)
當(dāng)蜘蛛關(guān)閉時調(diào)用此方法。

參數(shù)：

蜘蛛（Spider對象） - 被關(guān)閉的蜘蛛

from_crawler(cls, crawler)
如果存在，則調(diào)用此類方法以從a創(chuàng)建流水線實(shí)例Crawler。它必須返回管道的新實(shí)例。Crawler對象提供對所有Scrapy核心組件（如設(shè)置和信號）的訪問; 它是管道訪問它們并將其功能掛鉤到Scrapy中的一種方式。

參數(shù)：

crawler（Crawlerobject） - 使用此管道的crawler

項目管道示例

價格驗證和丟棄項目沒有價格

讓我們來看看以下假設(shè)的管道，它調(diào)整 price那些不包括增值稅（price_excludes_vat屬性）的項目的屬性，并刪除那些不包含價格的項目：

from scrapy.exceptions import DropItem

class PricePipeline(object):

    vat_factor = 1.15

    def process_item(self, item, spider):
        if item['price']:
            if item['price_excludes_vat']:
                item['price'] = item['price'] * self.vat_factor
            return item
        else:
            raise DropItem("Missing price in %s" % item)

將項目寫入JSON文件
以下管道將所有抓取的項目（來自所有蜘蛛）存儲到單個items.jl文件中，每行包含一個項目，以JSON格式序列化：

import json

class JsonWriterPipeline(object):

    def open_spider(self, spider):
        self.file = open('items.jl', 'wb')

    def close_spider(self, spider):
        self.file.close()

    def process_item(self, item, spider):
        line = json.dumps(dict(item)) + "\n"
        self.file.write(line)
        return item

注意

JsonWriterPipeline的目的只是介紹如何編寫項目管道。如果您真的想要將所有抓取的項目存儲到JSON文件中，則應(yīng)使用Feed導(dǎo)出。

將項目寫入MongoDB

在這個例子中，我們使用pymongo將項目寫入MongoDB。MongoDB地址和數(shù)據(jù)庫名稱在Scrapy設(shè)置中指定; MongoDB集合以item類命名。

這個例子的要點(diǎn)是顯示如何使用from_crawler()方法和如何正確清理資源：

import pymongo

class MongoPipeline(object):

    collection_name = 'scrapy_items'

    def __init__(self, mongo_uri, mongo_db):
        self.mongo_uri = mongo_uri
        self.mongo_db = mongo_db

    @classmethod
    def from_crawler(cls, crawler):
        return cls(
            mongo_uri=crawler.settings.get('MONGO_URI'),
            mongo_db=crawler.settings.get('MONGO_DATABASE', 'items')
        )

    def open_spider(self, spider):
        self.client = pymongo.MongoClient(self.mongo_uri)
        self.db = self.client[self.mongo_db]

    def close_spider(self, spider):
        self.client.close()

    def process_item(self, item, spider):
        self.db[self.collection_name].insert(dict(item))
        return item

拍攝項目的屏幕截圖

此示例演示如何從方法返回Deferredprocess_item()。它使用Splash來呈現(xiàn)項目網(wǎng)址的屏幕截圖。Pipeline請求本地運(yùn)行的Splash實(shí)例。在請求被下載并且Deferred回調(diào)觸發(fā)后，它將項目保存到一個文件并將文件名添加到項目。

import scrapy
import hashlib
from urllib.parse import quote


class ScreenshotPipeline(object):
    """Pipeline that uses Splash to render screenshot of
    every Scrapy item."""

    SPLASH_URL = "http://localhost:8050/render.png?url={}"

    def process_item(self, item, spider):
        encoded_item_url = quote(item["url"])
        screenshot_url = self.SPLASH_URL.format(encoded_item_url)
        request = scrapy.Request(screenshot_url)
        dfd = spider.crawler.engine.download(request, spider)
        dfd.addBoth(self.return_item, item)
        return dfd

    def return_item(self, response, item):
        if response.status != 200:
            # Error happened, return item.
            return item

        # Save screenshot to file, filename will be hash of url.
        url = item["url"]
        url_hash = hashlib.md5(url.encode("utf8")).hexdigest()
        filename = "{}.png".format(url_hash)
        with open(filename, "wb") as f:
            f.write(response.body)

        # Store filename in item.
        item["screenshot_filename"] = filename
        return item

復(fù)制過濾器

用于查找重復(fù)項目并刪除已處理的項目的過濾器。假設(shè)我們的項目具有唯一的ID，但是我們的蜘蛛會返回具有相同id的多個項目：

from scrapy.exceptions import DropItem

class DuplicatesPipeline(object):

    def __init__(self):
        self.ids_seen = set()

    def process_item(self, item, spider):
        if item['id'] in self.ids_seen:
            raise DropItem("Duplicate item found: %s" % item)
        else:
            self.ids_seen.add(item['id'])
            return item

激活項目管道組件

要激活項目管道組件，必須將其類添加到 ITEM_PIPELINES設(shè)置，類似于以下示例：

ITEM_PIPELINES = {
    'myproject.pipelines.PricePipeline': 300,
    'myproject.pipelines.JsonWriterPipeline': 800,
}

您在此設(shè)置中分配給類的整數(shù)值確定它們運(yùn)行的??順序：項目從較低值到較高值類。通常將這些數(shù)字定義在0-1000范圍內(nèi)。

色偷偷精品伊人,欧洲久久精品,欧美综合婷婷骚逼,国产AV主播,国产最新探花在线,九色在线视频一区,伊人大交九欧美,1769亚洲,黄色成人av

Scrapy爬蟲入門教程九 Item Pipeline（項目管道）

Scrapy爬蟲入門教程九 Item Pipeline（項目管道）

Item Pipeline（項目管道）

編寫自己的項目管道

項目管道示例

價格驗證和丟棄項目沒有價格

將項目寫入MongoDB

拍攝項目的屏幕截圖

復(fù)制過濾器

激活項目管道組件

相關(guān)閱讀更多精彩內(nèi)容

友情鏈接更多精彩內(nèi)容

色偷偷精品伊人,欧洲久久精品,欧美综合婷婷骚逼,国产AV主播,国产最新探花在线,九色在线视频一区,伊人大交九 欧美,1769亚洲,黄色成人av

Scrapy爬蟲入門教程九 Item Pipeline（項目管道）

Item Pipeline（項目管道）

編寫自己的項目管道

項目管道示例

價格驗證和丟棄項目沒有價格

將項目寫入MongoDB

拍攝項目的屏幕截圖

復(fù)制過濾器

激活項目管道組件

相關(guān)閱讀更多精彩內(nèi)容

友情鏈接更多精彩內(nèi)容

色偷偷精品伊人,欧洲久久精品,欧美综合婷婷骚逼,国产AV主播,国产最新探花在线,九色在线视频一区,伊人大交九欧美,1769亚洲,黄色成人av