人工智能 視覺算法 數(shù)據(jù)標(biāo)注 AI 時(shí)間量子 領(lǐng)航數(shù)據(jù)
前言
天黑之后就在圖書館玩一個(gè)爬蟲,就是那個(gè)開源的爬蟲 -- scrapy!早幾天就搭建了一個(gè)Redis集群服務(wù)器,于是就將爬取的數(shù)據(jù)存儲(chǔ)于Redis數(shù)據(jù)庫(kù)。
Redis數(shù)據(jù)庫(kù)集群搭建 | 實(shí)踐篇
Scrapy
Scrapy是一個(gè)為了爬取網(wǎng)站數(shù)據(jù),提取結(jié)構(gòu)性數(shù)據(jù)而編寫的應(yīng)用框架。 可以應(yīng)用在包括數(shù)據(jù)挖掘,信息處理或存儲(chǔ)歷史數(shù)。Scrapy 使用 Twisted這個(gè)異步網(wǎng)絡(luò)庫(kù)來處理網(wǎng)絡(luò)通訊,架構(gòu)清晰,并且包含了各種中間件接口,可以靈活的完成各種需求。
目的
目標(biāo)是學(xué)校圖書館的熱榜書籍,內(nèi)網(wǎng)首選[ 省流量O(∩_∩)O哈哈~ ]。
將熱榜書籍的相關(guān)數(shù)據(jù)存儲(chǔ)到Reids數(shù)據(jù)庫(kù)即可,很簡(jiǎn)單的一個(gè)實(shí)驗(yàn)
PS
# 安裝python-redis
sudo pip install python-redis
一切就緒前提已經(jīng)掌握helloworld
- 關(guān)閉ROBOTSTXT_OBEY
編輯settings.py,不然有些URL拒絕訪問。
# Obey robots.txt rules
ROBOTSTXT_OBEY = False
- 定義Item
編輯items.py,該文件就是用于定義對(duì)象的。此時(shí)我們定義書籍的對(duì)象。
class BookItem(scrapy.Item):
bid = scrapy.Field() # 序號(hào)
name = scrapy.Field() # 書名
author = scrapy.Field() # 作者
kid = scrapy.Field() # 索引號(hào)
isbn = scrapy.Field() # isbn
public_location = scrapy.Field() # 出版社
public = scrapy.Field() # 出版地
clicked = scrapy.Field() # 瀏覽次數(shù)
type = scrapy.Field() # 書籍類型
- 編輯爬蟲Spider程序
在spiders文件夾新建一個(gè)BookSpider.py文件,用戶爬取數(shù)據(jù)邏輯的文件,獲取書籍的信息并存儲(chǔ)到Redis,核心程序!
內(nèi)容如下:
#!/usr/bin/python
# -*- coding: UTF-8 -*-
import scrapy
import redis
from demo01.items import BookItem
class BookSpider(scrapy.Spider):
# Spider的名字 | 唯一的
name = "books"
# 允許爬取的域名
# allowed_domains = ["samego.com"]
# 爬取的域名
start_urls = [
"http://172.16.4.188:8081/opac/index_hotll.jsp/"
]
# 開始執(zhí)行任務(wù) | parse屬于Spider的一個(gè)方法
def parse(self, response):
# 定義redis線程池
pool = redis.ConnectionPool(host='172.16.168.1', port=6379)
presenter = redis.Redis(connection_pool=pool)
# 有關(guān)書籍的表格
book_table = response.xpath('//table[@class="bordermain"][1]')
# 獲取每一本書的集合tr
book_elements = book_table.xpath(".//tr")
# 刪除第一行的tr
del book_elements[0]
# 遍歷處理數(shù)據(jù)
for book_tr in book_elements:
# 定義書籍對(duì)象
book_item = BookItem()
# 獲取各本書籍的信息
book = book_tr.xpath(".//td")
book_item["bid"] = book[1].xpath("./text()").extract()
book_item["name"] = book[2].xpath("./text()").extract()
book_item["author"] = book[3].xpath("./text()").extract()
book_item["kid"] = book[4].xpath("./text()").extract()
book_item["isbn"] = book[5].xpath("./text()").extract()
book_item["public_location"] = book[6].xpath("./text()").extract()
book_item["public"] = book[7].xpath("./text()").extract()
book_item["clicked"] = book[9].xpath("./text()").extract()
book_item["type"] = book[10].xpath("./text()").extract()
# 設(shè)置有序集合鍵值
z_key = '%s_%s' % ('book', book_item["bid"][0])
# 將數(shù)據(jù)保存redis
presenter.zadd(z_key, book_item["name"], 1)
presenter.zadd(z_key, book_item["author"], 2)
presenter.zadd(z_key, book_item["name"], 3)
presenter.zadd(z_key, book_item["kid"], 4)
presenter.zadd(z_key, book_item["isbn"], 5)
presenter.zadd(z_key, book_item["public_location"], 6)
presenter.zadd(z_key, book_item["public"], 7)
presenter.zadd(z_key, book_item["clicked"], 8)
presenter.zadd(z_key, book_item["type"], 9)
yield book_item
- 執(zhí)行scrapy的程序
? ~ scrapy crawl books
# or 將數(shù)據(jù)以json的形式保存在books.json
? ~ scrapy crawl books -o books.json
- 終端運(yùn)行
部分返回如下
? demo01 scrapy crawl books
2017-04-19 21:45:45 [scrapy.utils.log] INFO: Scrapy 1.3.3 started (bot: demo01)
... ...
2017-04-19 21:45:46 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://172.16.4.188:8081/opac/index_hotll.jsp/> (referer: None)
2017-04-19 21:45:47 [scrapy.core.scraper] DEBUG: Scraped from <200 http://172.16.4.188:8081/opac/index_hotll.jsp/>
{'author': ['楊萃先 ... [等]著\xa0'],
'bid': ['1'],
'clicked': ['14987\xa0'],
'isbn': ['978-7-80080-752-7\xa0'],
'kid': ['C913.2/258\xa0'],
'name': ['\xa0'],
'public': ['群言出版社\xa0'],
'public_location': ['北京\xa0'],
'type': ['中文圖書\xa0']}
2017-04-19 21:45:47 [scrapy.core.scraper] DEBUG: Scraped from <200 http://172.16.4.188:8081/opac/index_hotll.jsp/>
{'author': ['(英)靄理士(Havelock Ellis)著\xa0'],
'bid': ['2'],
'clicked': ['1219\xa0'],
'isbn': ['7-108-00161-6\xa0'],
'kid': ['R167/20\xa0'],
'name': ['\xa0'],
'public': ['三聯(lián)書店\xa0'],
'public_location': ['北京\xa0'],
'type': ['中文圖書\xa0']}
... ...
注意:還有一點(diǎn)問題,存在unicode編碼的字符,編碼處理不難,可是圖書館又準(zhǔn)備關(guān)門了。得跑路啦O(∩_∩)O哈哈~
Alic say :****價(jià)值源于技術(shù),貢獻(xiàn)源于分享****