综合国产一区二区,91成人综合在线观看

用理工科思維看待這個世界

系列爬蟲專欄

初學者，盡力實現(xiàn)最小化學習系統(tǒng)

主題：Scrapy 實戰(zhàn)，并分別存儲在MySQL 和 Mongodb中

0：目標說明

Scrapy 基礎(chǔ)教程
你要的最佳實戰(zhàn)
劉未鵬博客
點我啊
目標：獲取劉未鵬博客全站博文
- 文章標題：Title
- 文章發(fā)布時間：Time
- 文章全文：Content
- 文章的鏈接：Url
思路：
- 分析首頁和翻頁的組成
- 抓取全部的文章鏈接
- 在獲取的全部鏈接的基礎(chǔ)上解析需要的標題，發(fā)布時間，全文和鏈接

1：目標分解

Scrapy支持xpath

全部鏈接獲取

# 首頁和剩余的頁獲取鏈接的xpath有點差異
each_page_data = selector.xpath('//div[@id="index-featured1"]/ul/li/h3[@class="entry-title"]/a/@href').extract()
each_page_data_other = selector.xpath('//div[@id="content"]/div/ul/li/h3[@class="entry-title"]/a/@href').extract()
# 全部的url放在一個列表里：item_url

文章標題

title = selector.xpath('//div[@id="content"]/div/h1[@class="entry-title"]/a/text()').extract()

文章發(fā)布時間

time = selector.xpath('//div[@id="content"]/div/div[@class="entry-info"]/abbr/text()').extract()

文章全文

content = selector.xpath('//div[@id="content"]/div/div[@class="entry-content clearfix"]/p/text()').extract()

文章鏈接

url = selector.xpath('//div[@id="content"]/div/h1[@class="entry-title"]/a/@href').extract()

使用Scrapy 框架的基本教程：
翻譯版教程

一般步驟
- 新建項目
- 定義Item : items.py文件是定義的抓取目標
- 編寫spider:spiders文件夾是用來編寫爬蟲文件
- settings.py文件是用來編寫配置文件比如頭部信息，一些常量，比如MySQL用戶，端口等
- pipelines.py文件是用來編寫存儲數(shù)據(jù)操作，比如MySQL數(shù)據(jù)庫的操作，mongodb數(shù)據(jù)庫的操作
Scrapy 框架的原理
經(jīng)典說明文檔

001.png

* 引擎scrapy
* 調(diào)度器 scheduler
* 下載器 downloader
* 爬蟲 spider
* 項目管道 pipeline

運行流程：
Scrapy運行流程大概如下：
首先，引擎從調(diào)度器中取出一個鏈接(URL)用于接下來的抓取
引擎把URL封裝成一個請求(Request)傳給下載器，下載器把資源下載下來，并封裝成應(yīng)答包(Response)
然后，爬蟲解析Response
若是解析出實體（Item）,則交給實體管道進行進一步的處理。
若是解析出的是鏈接（URL）,則把URL交給Scheduler等待抓取

2：目標實戰(zhàn)

編寫Items 文件定義抓取目標

class LiuweipengItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    Title = scrapy.Field() # 標題
    Time = scrapy.Field() # 發(fā)布時間
    Url = scrapy.Field() # 文章鏈接
    Content = scrapy.Field() # 文章內(nèi)容

編寫爬蟲程序

# 獲取整個網(wǎng)站的文章鏈接
class BlogSpider(Spider):
    name = "liuweipeng"
    start_urls = ["http://mindhacks.cn/","http://mindhacks.cn/page/2/", "http://mindhacks.cn/page/3/", "http://mindhacks.cn/page/4/"]
    def parse(self, response):
        url_item = []
        selector = Selector(response)
        each_page_data = selector.xpath('//div[@id="index-featured1"]/ul/li/h3[@class="entry-title"]/a/@href').extract()
        each_page_data_other = selector.xpath('//div[@id="content"]/div/ul/li/h3[@class="entry-title"]/a/@href').extract()
        url_item.extend(each_page_data)
        url_item.extend(each_page_data_other)
        for one in url_item:
            yield Request(one, callback=self.parse_detail)

#------------------------------------------------------------------------------------------
# 對獲取的鏈接進行內(nèi)容的解析
    def parse_detail(self, response):
        Item = LiuweipengItem()
        selector = Selector(response)
        title = selector.xpath('//div[@id="content"]/div/h1[@class="entry-title"]/a/text()').extract()
        time = selector.xpath('//div[@id="content"]/div/div[@class="entry-info"]/abbr/text()').extract()
        content = selector.xpath('//div[@id="content"]/div/div[@class="entry-content clearfix"]/p/text()').extract()
        url = selector.xpath('//div[@id="content"]/div/h1[@class="entry-title"]/a/@href').extract()
        print(content)
        for title, time, content, url in zip(title, time, content, url):
            Item["Title"] = title
            Item["Time"] = time
            Item["Content"] = content
            Item["Url"] = url
        yield Item

編寫設(shè)置文件（1）：存儲mongodb

MONGODB_HOST = '127.0.0.1' # localhost
MONGODB_PORT = 27017   # 端口號
MONGODB_DBNAME = 'Liuweipeng' # 數(shù)據(jù)庫名
MONGODB_DOCNAME = 'blog' # 集合名

編寫管道文件，存儲數(shù)據(jù)mongodb


import pymongo
import pymysql
from scrapy.conf import settings
class LiuweipengPipeline(object):
    def __init__(self):
        host = settings['MONGODB_HOST']
        port = settings['MONGODB_PORT']
        dbName = settings['MONGODB_DBNAME']
        client = pymongo.MongoClient(host=host, port=port)
        tdb = client[dbName]
        self.post = tdb[settings['MONGODB_DOCNAME']]  # 初始化設(shè)置數(shù)據(jù)鏈接等信息
    def process_item(self, item, spider):
        content = dict(item)
        self.post.insert(content)  # 將抓取的數(shù)據(jù)插入mongodb

效果顯示：

002.png

存儲方式2：mysql

# 管道文件編寫方式改變?yōu)椋?# 這里導(dǎo)入的是pymysql 
    def __init__(self):
        self.connection = pymysql.connect(host='localhost',
                             user='root',
                             password='123456',
                             port=3306,
                             db='test',
                             charset='utf8')
        pass
    def process_item(self, item, spider):
        with self.connection.cursor() as cursor:
            sql = "INSERT INTO `blog`(`Title`, `Time`, `Content`, `Url`) VALUES (%s, %s, %s, %s)"
            cursor.execute(sql, (item['Title'],item["Time"], item["Content"],item["Url"]))
        self.connection.commit()

需要在本地創(chuàng)建數(shù)據(jù)表：

# 在test數(shù)據(jù)庫中創(chuàng)建一個blog的數(shù)據(jù)表，定義字段如下所示：
CREATE TABLE `blog` (
    `id` INT(11) NOT NULL AUTO_INCREMENT,
    `Title` VARCHAR(255) COLLATE utf8_bin NOT NULL,
    `Content` VARCHAR(255) COLLATE utf8_bin NOT NULL,
    `Time` VARCHAR(255) COLLATE utf8_bin NOT NULL,
    `Url` VARCHAR(255) COLLATE utf8_bin NOT NULL,
    PRIMARY KEY (`id`)
) ENGINE=INNODB DEFAULT CHARSET=utf8 COLLATE=utf8_bin
AUTO_INCREMENT=1 ;

效果顯示2：

003.png

完整版代碼：不點不知道bug

3：總結(jié)全文

使用Scrapy框架實現(xiàn)抓取博客，并分別使用兩種存儲方式。
目標分析的很詳細了。

再補一句：任何實用性的東西都解決不了你所面臨的實際問題，但為什么還有看？為了經(jīng)驗，為了通過閱讀抓取別人的經(jīng)驗，雖然還需批判思維看待

崇尚的思維是：
了解這是什么。
知道應(yīng)該怎么做。
學會親自動手。(事實上這是我第一次使用Scrapy 框架存儲在mysql中，還是遇到了好些問題)

關(guān)于本人：
只有一個職業(yè)：學生
只有一個任務(wù)：學習
在這條路上，充滿無盡的困境，我希望成為一個精神世界豐滿的人。

色偷偷精品伊人,欧洲久久精品,欧美综合婷婷骚逼,国产AV主播,国产最新探花在线,九色在线视频一区,伊人大交九欧美,1769亚洲,黄色成人av

專欄：014：客官，你要的實戰(zhàn)我給你.

專欄：014：客官，你要的實戰(zhàn)我給你.

0：目標說明

1：目標分解

2：目標實戰(zhàn)

3：總結(jié)全文

相關(guān)閱讀更多精彩內(nèi)容

友情鏈接更多精彩內(nèi)容

色偷偷精品伊人,欧洲久久精品,欧美综合婷婷骚逼,国产AV主播,国产最新探花在线,九色在线视频一区,伊人大交九 欧美,1769亚洲,黄色成人av

專欄：014：客官，你要的實戰(zhàn)我給你.

0：目標說明

1：目標分解

2：目標實戰(zhàn)

3：總結(jié)全文

相關(guān)閱讀更多精彩內(nèi)容

友情鏈接更多精彩內(nèi)容

色偷偷精品伊人,欧洲久久精品,欧美综合婷婷骚逼,国产AV主播,国产最新探花在线,九色在线视频一区,伊人大交九欧美,1769亚洲,黄色成人av