安裝 Scrapy 框架

pip3 install Scrapy

Scrapy架構(gòu)圖(綠線是數(shù)據(jù)流向)：

scrapy架構(gòu).png

Scrapy Engine(引擎): 負(fù)責(zé)Spider、ItemPipeline、Downloader、Scheduler中間的通訊，信號、數(shù)據(jù)傳遞等。
Scheduler(調(diào)度器): 它負(fù)責(zé)接受引擎發(fā)送過來的Request請求，并按照一定的方式進(jìn)行整理排列，入隊(duì)，當(dāng)引擎需要時(shí)，交還給引擎。
Downloader（下載器）：負(fù)責(zé)下載Scrapy Engine(引擎)發(fā)送的所有Requests請求，并將其獲取到的Responses交還給Scrapy Engine(引擎)，由引擎交給Spider來處理，
Spider（爬蟲）：它負(fù)責(zé)處理所有Responses,從中分析提取數(shù)據(jù)，獲取Item字段需要的數(shù)據(jù)，并將需要跟進(jìn)的URL提交給引擎，再次進(jìn)入Scheduler(調(diào)度器)，
Item Pipeline(管道)：它負(fù)責(zé)處理Spider中獲取到的Item，并進(jìn)行進(jìn)行后期處理（詳細(xì)分析、過濾、存儲(chǔ)等）的地方.
Downloader Middlewares（下載中間件）：你可以當(dāng)作是一個(gè)可以自定義擴(kuò)展下載功能的組件。
Spider Middlewares（Spider中間件）：你可以理解為是一個(gè)可以自定擴(kuò)展和操作引擎和Spider中間通信的功能組件（比如進(jìn)入Spider的Responses;和從Spider出去的Requests）

新建項(xiàng)目(scrapy startproject)

在開始爬取之前，必須創(chuàng)建一個(gè)新的Scrapy項(xiàng)目。進(jìn)入自定義的項(xiàng)目目錄中，運(yùn)行下列命令：
scrapy startproject myspider

新建爬蟲文件

** scrapy genspider jobbole jobbole.com**

關(guān)于yeild函數(shù)

參考資料說明：https://blog.csdn.net/u013205877/article/details/70332612 https://www.ibm.com/developerworks/cn/opensource/os-cn-python-yield/

簡單地講，yield 的作用就是把一個(gè)函數(shù)變成一個(gè) generator（生成器），帶有 yield 的函數(shù)不再是一個(gè)普通函數(shù)，Python 解釋器會(huì)將其視為一個(gè) generator，帶有yeild的函數(shù)遇到y(tǒng)eild的時(shí)候就返回一個(gè)迭代值，下次迭代時(shí)，代碼從 yield 的下一條語句繼續(xù)執(zhí)行，而函數(shù)的本地變量看起來和上次中斷執(zhí)行前是完全一樣的，于是函數(shù)繼續(xù)執(zhí)行，直到再次遇到 yield。

通俗的講就是：在一個(gè)函數(shù)中，程序執(zhí)行到y(tǒng)ield語句的時(shí)候，程序暫停，返回yield后面表達(dá)式的值，在下一次調(diào)用的時(shí)候，從yield語句暫停的地方繼續(xù)執(zhí)行，如此循環(huán)，直到函數(shù)執(zhí)行完。

settings.py文件設(shè)置參考

爬蟲的文件路徑
SPIDER_MODULES = ['ziruproject.spiders']
NEWSPIDER_MODULE = 'ziruproject.spiders'
用戶代理，一般設(shè)置這個(gè)參數(shù)用來偽裝瀏覽器請求
USER_AGENT = ''
是否遵守ROBOT協(xié)議，為False時(shí)，表示不遵守，為True時(shí)表示遵守（默認(rèn)為True）
ROBOTSTXT_OBEY = True
Scrapy downloader(下載器) 處理的最大的并發(fā)請求數(shù)量。默認(rèn): 16
CONCURRENT_REQUESTS
下載延遲的秒數(shù)，用來限制訪問的頻率
默認(rèn)為:0
DOWNLOAD_DELAY

scrapy案例

以'http://chinaz.com/'為例
(下載項(xiàng)目圖片以及實(shí)現(xiàn)爬蟲數(shù)據(jù)持久化保存)

chinaz.py

# -*- coding: utf-8 -*-
import scrapy
from chinaz.items import ChinazprojectItem, ChinazprojectWebInfoItem

class ChinazSpider(scrapy.Spider):
    #爬蟲名稱
    name = 'china'
    #設(shè)置允許爬取的域
    allowed_domains = ['chinaz.com']
    #設(shè)置起始urls
    start_urls = ['http://top.chinaz.com/hangyemap.html']

    # 可以根據(jù)不同的爬蟲文件做自定義的參數(shù)設(shè)置，會(huì)覆蓋settings.py中的相關(guān)設(shè)置
    custom_settings = {
        'USER_AGENT' : 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.132 Safari/537.36'
    }

    def parse(self, response):
        """
        在parse回調(diào)方法中
        step1:提取目標(biāo)數(shù)據(jù)
        step2:獲取新的url
        :param response: 請求的響應(yīng)結(jié)果
        :return:
        """
        print(response.status)
        # response.xpath():使用xpath語法,得到的是SelectorList對象
        # response.css():使用css選擇器,得到的是SelectorList對象
        # extract():　將selector 序列化為unicode字符串
        # step1: 提取目標(biāo)數(shù)據(jù)
        # 獲取分類列表
        tags = response.xpath('//div[@class="Taright"]/a')
        # tags = response.css('.Taright a')
        for tag in tags:
            #實(shí)例化一個(gè)item,用來存儲(chǔ)數(shù)據(jù)
            tag_item = ChinazprojectItem()
            #獲取網(wǎng)站分類的名稱
            # tagName = tag.xpath('./text()')[0].extract()
            tagName = tag.xpath('./text()').extract_first('')
            tag_item['tagName'] = tagName
            # 使用css取值(文本)
            # tagName = tag.css('::text').extract_first('')

            #獲取網(wǎng)站分了的首頁url地址
            # first_url = tag.xpath('./@href')[0].extract()
            first_url = tag.xpath('./@href').extract_first('')
            tag_item['firsturl'] = first_url
            #css語法取值(屬性)
            # first_url = tag.css('::attr(href)').extract_first('')
            # print(tag_item)

            #將獲取到的數(shù)據(jù)交給管道處理
            yield tag_item
            # http://top.chinaz.com/hangye/index_yule_yinyue.html
            '''
             url : 設(shè)置需要發(fā)起請求的url地址
             callback=None, ： 設(shè)置請求成功后的回調(diào)方法
             method='GET', ： 請求方式，默認(rèn)為get請求
             headers=None, ： 設(shè)置請求頭，字典類型
             cookies=None, ： 設(shè)置cookie信息，模擬登陸用戶，字典類型
             meta=None, ： 傳參，字典類型
             encoding='utf-8', ： 設(shè)置編碼
             dont_filter=False, ： 是否要去重，默認(rèn)False表示去重
             errback=None, ： 設(shè)置請求失敗后的回調(diào)
            '''
            yield scrapy.Request(first_url,callback=self.parse_tags_page)

    def parse_tags_page(self,response):
        '''
        解析分類分頁的網(wǎng)站信息
        :param response : 響應(yīng)結(jié)果
        :return:
        '''
        print('分頁請求', response.status, response.url)
        # 列表
        webInfos = response.xpath('//ul[@class="listCentent"]/li')
        for webinfo in webInfos:
            web = ChinazprojectWebInfoItem()
            # 封面圖片
            web['coverImage'] = webinfo.xpath('//div[@class="leftImg"]/a/img/@src').extract_first('')
            # web['coverImages'] = ['']
            # 標(biāo)題
            web['title'] = webinfo.xpath('//div[@class="CentTxt"]/h3/a/text()').extract_first('')
            # 域名
            web['domenis'] = webinfo.xpath('//div[@class="CentTxt"]/h3/span/text()').extract_first('')
            # 周排名
            web['weekRank'] = webinfo.xpath('//div[@class="CentTxt"]/div[@class="RtCPart clearfix"]/p[0]//a/text()').extract_first('')
            # 反鏈數(shù)
            web['ulink'] = webinfo.xpath('//div[@class="CentTxt"]/div[@class="RtCPart clearfix"]//p[3]//a/text()').extract_first('')
            # 網(wǎng)站簡介
            web['info'] = webinfo.xpath('//div[@class="CentTxt"]/p/text()').extract_first('')
            # 得分
            web['score'] = webinfo.xpath('//div[@class="RtCRateCent"]/span/text()').extract_first('')
            # 排名
            web['rank'] = webinfo.xpath('//div[@class="RtCRateCent"]/strong/text()').extract_first('')
            # print(web)
            yield web
            #發(fā)起其他頁面請求

            next_urls = response.xpath('//div[@class="ListPageErap"]/a/@href').extract()[1:]
            for next_url in next_urls:
                # 使用urljoin方法將不完整的url拼接完整
                next_url = response.urljoin(next_url)

                yield scrapy.Request(next_url,callback = self.parse_tags_page)

items.py

# -*- coding: utf-8 -*-
# Define here the models for your scraped items
# See documentation in:
# https://doc.scrapy.org/en/latest/topics/items.html
import scrapy

class ChinazprojectItem(scrapy.Item):
    '''
    存儲(chǔ)網(wǎng)頁分類信息
    '''
    # 分類名稱
    tagName = scrapy.Field()
    # 分類首頁url地址
    firsturl = scrapy.Field()

    def get_insert_spl_data(self, dataDict):
        '''
        step1 ： 創(chuàng)建SQL語句
        step2 ： 返回要存儲(chǔ)的數(shù)據(jù)
        :param dataDict:
        :return:
        '''
        # 往數(shù)據(jù)庫寫
        sql = """
                INSERT INTO tags(%s)
                VALUES (%s)
                """ % (
            ','.join(dataDict.keys()),
            ','.join(['%s'] * len(dataDict))
        )
        # 需要往數(shù)據(jù)庫中存儲(chǔ)的數(shù)據(jù)
        data = list(dataDict.values())
        return sql, data

class ChinazprojectWebInfoItem(scrapy.Item):
    # 封面圖片
    coverImage = scrapy.Field()
    # 標(biāo)題
    title = scrapy.Field()
    # 域名
    domenis = scrapy.Field()
    # 周排名
    weekRank = scrapy.Field()
    # 反鏈接數(shù)
    ulink = scrapy.Field()
    # 網(wǎng)站簡介
    info = scrapy.Field()
    # 得分
    score = scrapy.Field()
    # 排名
    rank = scrapy.Field()
    # 圖片本地存儲(chǔ)路徑
    locakImagePath = scrapy.Field()

pipelines.py

# -*- coding: utf-8 -*-

# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://doc.scrapy.org/en/latest/topics/item-pipeline.html

import mysql.connector as c
import pymongo
from scrapy.contrib.pipeline.images import ImagesPipeline
import scrapy
from chinaz.items import ChinazprojectWebInfoItem, ChinazprojectItem
from scrapy.utils.project import get_project_settings
import os

images_store = get_project_settings().get('IMAGES_STORE')

class ChinazProjectImagePipeline(ImagesPipeline):
    # 實(shí)現(xiàn)2個(gè)方法
    def get_media_requests(self, item, info):
        '''
        根據(jù)圖片的url地址構(gòu)造request請求
        :param item:
        :param info:
        :return:
        '''
        if isinstance(item,ChinazprojectWebInfoItem):
            # 獲取圖片地址
            image_url = 'http:' + item['coverImage']
            print('獲取到圖片地址', image_url)
            yield scrapy.Request(image_url)
            # 如果有多個(gè)圖片地址item['coverImage']對應(yīng)一個(gè)列表
            # image_urls = 'http:' + item['coverImage']
            # return [scrapy.Request(x) for x in image_urls]
            
    def item_completed(self, results, item, info):
       '''
       圖片下載之后的回調(diào)方法
       :param results:[(True(表示圖片是否下載成功),{'path':'圖片下載之后的存儲(chǔ)路徑','url'.'圖片url地址','checksum'：'經(jīng)過hash加密的一個(gè)字符串'})]
       :param item:
       :param info:
       :return:
       '''
       if isinstance(item,ChinazprojectWebInfoItem):
           paths = [result['path'] for status,result in results if status]
           print('圖片下載成功', paths)
           if len(paths) > 0:
                print('圖片獲取成功')
                # 使用rname方法修改圖片名稱
                os.rename(images_store + '/' + paths[0],images_store + '/' + item['title'] + '.jpg')
                image_path = images_store + '/' + item['title'] + '.jpg'
                print('修改后的圖片路徑', image_path)
                item['localImagepath'] = image_path
           else:
                # 如果沒有獲取到圖片吧這個(gè)item丟棄
                from scrapy.exceptions import DropItem
                raise DropItem('沒有獲取到圖片,丟棄item')
           return item

# class ChinazprojectPipeline(object):
#
#     def __init__(self):
#         """
#         初始化方法
#         """
#         # self.file = open('chainz.json','a')
#         # 創(chuàng)建數(shù)據(jù)庫連接
#         self.client = c.Connect(
#             host = '127.0.0.1', user = 'root', password = '123456',
#             database = 'chinaz', port = 3306, charset='utf8'
#         )
#         #創(chuàng)建游標(biāo)
#         self.cursor = self.client.cursor()
#
#     def open_spider(self,spider):
#         """
#         爬蟲啟動(dòng)的時(shí)候會(huì)調(diào)用一次
#         :param spider:
#         :return:
#         """
#         print('爬蟲開啟')
#
#     def process_item(self, item, spider):
#         """
#         這個(gè)方法是必須實(shí)現(xiàn)的,爬蟲文件中所有的item
#         都會(huì)經(jīng)過這個(gè)方法
#         :param item: 爬蟲文件傳遞過來的item對象
#         :param spider: 爬蟲文件實(shí)例化的對象
#         :return:
#         """
#         #存儲(chǔ)到本地json文件
#         data_dict = dict(item)
#
#             # import json
#             # json_data = json.dumps(data_dict,ensure_ascii=False)
#             # self.file.write(json_data+'\n')
#             # 使用isinstance判斷item要存儲(chǔ)的表
#             # if isinstance(item, ChinazprojectWebInfoItem):
#             #     print('網(wǎng)站信息')
#             #     tablename = 'webinfo'
#             # elif isinstance(item, ChinazprojectItem):
#             #     print('網(wǎng)站分類信息')
#             #     tablename = 'tags'
#             #往數(shù)據(jù)庫寫
#             # sql = """
#             # INSERT INTO tags(%s)
#             # VALUES (%s)
#             # """ %(
#             #     ','.join(data_dict.keys()),
#             #     ','.join(['%s']*len(data_dict))
#             # )
#         if data_dict:
#
#             sql, data = item.get_insert_spl_data(data_dict)
#             try:
#                 # self.cursor.execute(sql,list(data_dict.values()))
#                 self.cursor.execute(sql, data)
#                 self.client.commit()
#             except Exception as err:
#                 self.client.rollback()
#                 print(err)
#
#             #如果有多個(gè)管道文件,一定要注意return　item,
#             #否則下一個(gè)管道無法接收到item
#             print('經(jīng)過了管道文件')
#             return item
#
#     def close_spider(self,spider):
#         """
#         爬蟲結(jié)束的時(shí)候會(huì)調(diào)用一次
#         :param spider:
#         :return:
#         """
#         # self.file.close()
#         self.cursor.close()
#         self.client.close()
#         print('爬蟲結(jié)束')

# 往MongoDB中插入數(shù)據(jù)
# class ChinazprojectPipeline(object):
#
#     def __init__(self,host,port,db):
#         #創(chuàng)建MongoDB的數(shù)據(jù)庫鏈接
#         self.mongo_client = pymongo.MongoClient(
#             host='127.0.0.1',port=27017
#         )
#         # 獲取要操作的數(shù)據(jù)庫
#         self.db = self.mongo_client['db']
#
#     @classmethod
#     def from_crawler(cls,crawler):
#         '''
#         MONGODB_HOST = '127.0.0.1'
#         MONGODB_PORT= 27017
#         MONGODB_DB = "chinaz"
#         :param crawler:
#         :return:
#         '''
#         host = crawler.settings['MONGODB_HOST']
#         port = crawler.settings['MONGODB_PORT']
#         db = crawler.settings['MONGODB_DB']
#         return cls(host,port,db)
#
#     def process_item(self, item, spider):
#         '''
#         這個(gè)方法是必須實(shí)現(xiàn)的,爬蟲文件中所有的item
#         都會(huì)經(jīng)過這個(gè)方法
#         :param item: 爬蟲文件傳遞過來的item對象
#         :param spider: 爬蟲文件實(shí)例化的對象
#         :return:
#         '''
#         # 往哪個(gè)集合下插入數(shù)據(jù)
#         # 往集合下插入什么數(shù)據(jù)
#         col_name = item.get_mongodb_collectionName()
#         col = self.db[col_name]
#         dict_data = dict(item)
#         try:
#             col.insert(dict_data)
#             print('數(shù)據(jù)插入成功')
#         except Exception as err:
#             print('數(shù)據(jù)插入失敗',err)
#         return item
#
#     def open_spider(self,spider):
#         print(spider.name,'爬蟲開始運(yùn)行')
#
#     def close_spider(self, spider):
#         # self.file.close()
#         self.cursor.close()
#         self.client.close()
#         print('爬蟲結(jié)束')

#實(shí)現(xiàn)mysql數(shù)據(jù)庫的異步插入(要插入的數(shù)據(jù)量非常大的情況下)
from twisted.enterprise import adbapi

class ChinazprojectPipeline(object):

    def __init__(self,dbpool):
        self.dbpool = dbpool

    @classmethod
    def from_crawler(cls,cralwer):
        """
        MYSQL_HOST = '127.0.0.1'
        MYSQL_USER = 'root'
        MYSQL_PWD = 'ljh1314'
        MYSQL_DB = 'chainz'
        MYSQL_PORT = 3306
        MYSQL_CHARSET = 'utf8'
        :param cralwer:
        :return:
        """
        db_parmars = {
            'host':cralwer.settings['MYSQL_HOST'],
            'user':cralwer.settings['MYSQL_USER'],
            'passwd':cralwer.settings['MYSQL_PWD'],
            'db':cralwer.settings['MYSQL_DB'],
            'port':cralwer.settings['MYSQL_PORT'],
            'charset':cralwer.settings['MYSQL_CHARSET'],
        }

        dbpool = adbapi.ConnectionPool('pymysql',**db_parmars)

        return cls(dbpool)

    def process_item(self,item,spider):

        query = self.dbpool.runInteraction(
            self.insert_data_to_mysql,
            item
        )
        query.addErrback(
            self.insert_err,
            item
        )

        return item

    def insert_data_to_mysql(self,cursor,item):
        data_dict = dict(item)
        sql,data = item.get_insert_sql_data(data_dict)
        cursor.execute(sql,data)

    def insert_err(self,failure,item):
        print(failure,'插入失敗',item)

scrapy shell

Scrapy終端是一個(gè)交互終端，我們可以在未啟動(dòng)spider的情況下嘗試及調(diào)試代碼，也可以用來測試XPath或CSS表達(dá)式，查看他們的工作方式，方便我們爬取的網(wǎng)頁中提取的數(shù)據(jù)。

啟動(dòng)Scrapy Shell
scrapy shell "http://www.baidu.com/"
scrapy shell -s USER_AGENT=' '

Scrapy Shell根據(jù)下載的頁面會(huì)自動(dòng)創(chuàng)建一些方便使用的對象，例如 Response 對象，以及 Selector 對象 (對HTML及XML內(nèi)容)。

當(dāng)shell載入后，將得到一個(gè)包含response數(shù)據(jù)的本地 response 變量，輸入 response.body將輸出response的包體，輸出 response.headers 可以看到response的包頭。

輸入 response.selector 時(shí)， 將獲取到一個(gè)response 初始化的類 Selector 的對象，此時(shí)可以通過使用 response.selector.xpath()或response.selector.css() 來對 response 進(jìn)行查詢。

Scrapy也提供了一些快捷方式, 例如 response.xpath()或response.css()同樣可以生效（如之前的案例）。

Selectors選擇器 Scrapy Selectors 內(nèi)置 XPath 和 CSS Selector 表達(dá)式機(jī)制 Selector有四個(gè)基本的方法，最常用的還是xpath:

xpath(): 傳入xpath表達(dá)式，返回該表達(dá)式所對應(yīng)的所有節(jié)點(diǎn)的selector list列表
extract(): 序列化該節(jié)點(diǎn)為字符串并返回list
css(): 傳入CSS表達(dá)式，返回該表達(dá)式所對應(yīng)的所有節(jié)點(diǎn)的selector list列表，語法同 BeautifulSoup4
re(): 根據(jù)傳入的正則表達(dá)式對數(shù)據(jù)進(jìn)行提取，返回字符串list列表

僅為個(gè)人學(xué)習(xí)小結(jié)，若有錯(cuò)處，歡迎指正~

色偷偷精品伊人,欧洲久久精品,欧美综合婷婷骚逼,国产AV主播,国产最新探花在线,九色在线视频一区,伊人大交九欧美,1769亚洲,黄色成人av

Scrapy框架（下載項(xiàng)目圖片以及實(shí)現(xiàn)爬蟲數(shù)據(jù)持久化保存）scrapy shell

Scrapy框架（下載項(xiàng)目圖片以及實(shí)現(xiàn)爬蟲數(shù)據(jù)持久化保存）scrapy shell

安裝 Scrapy 框架

Scrapy架構(gòu)圖(綠線是數(shù)據(jù)流向)：

新建項(xiàng)目(scrapy startproject)

新建爬蟲文件

關(guān)于yeild函數(shù)

settings.py文件設(shè)置參考

scrapy案例

chinaz.py

items.py

pipelines.py

scrapy shell

相關(guān)閱讀更多精彩內(nèi)容

友情鏈接更多精彩內(nèi)容

色偷偷精品伊人,欧洲久久精品,欧美综合婷婷骚逼,国产AV主播,国产最新探花在线,九色在线视频一区,伊人大交九 欧美,1769亚洲,黄色成人av

Scrapy框架（下載項(xiàng)目圖片以及實(shí)現(xiàn)爬蟲數(shù)據(jù)持久化保存）scrapy shell

安裝 Scrapy 框架

Scrapy架構(gòu)圖(綠線是數(shù)據(jù)流向)：

新建項(xiàng)目(scrapy startproject)

新建爬蟲文件

關(guān)于yeild函數(shù)

settings.py文件設(shè)置參考

scrapy案例

chinaz.py

items.py

pipelines.py

scrapy shell

相關(guān)閱讀更多精彩內(nèi)容

友情鏈接更多精彩內(nèi)容

色偷偷精品伊人,欧洲久久精品,欧美综合婷婷骚逼,国产AV主播,国产最新探花在线,九色在线视频一区,伊人大交九欧美,1769亚洲,黄色成人av