[Python] CrawlSpider框架爬取數(shù)碼寶貝全圖鑒

寫在開頭:本文運(yùn)行爬蟲的示例網(wǎng)站為 數(shù)碼獸數(shù)據(jù)庫http://digimons.net/digimon/chn.html

文章提及的細(xì)節(jié)參見我的另外三篇簡(jiǎn)文
MySQL本地類細(xì)節(jié)配置:
[Python] 爬取bioinfo帖子相關(guān)信息 (requests + openpyxl + pymysql)
Scrapy各組件詳細(xì)配置:
[Python] 爬蟲 Scrapy框架各組件詳細(xì)設(shè)置
SQL命令基礎(chǔ):
[SQL] MySQL基礎(chǔ)+Python交互

轉(zhuǎn)載請(qǐng)注明:陳熹 chenx6542@foxmail.com (簡(jiǎn)書號(hào):半為花間酒)
若公眾號(hào)內(nèi)轉(zhuǎn)載請(qǐng)聯(lián)系公眾號(hào):早起Python

需求分析

- 主頁面分析

首先點(diǎn)擊http://digimons.net/digimon/chn.html

進(jìn)入中文檢索頁面

查看頁面源碼

有兩點(diǎn)發(fā)現(xiàn):

  1. 數(shù)據(jù)不是通過Ajax加載
  2. 獲得全部數(shù)據(jù)不需要什么翻頁邏輯,所有數(shù)碼獸的后半url都在當(dāng)前源碼里。href中的格式是數(shù)碼獸英文名/index.html

接下來分析幾個(gè)數(shù)碼獸詳情頁的url:
http://digimons.net/digimon/agumon_yuki_kizuna/index.html
http://digimons.net/digimon/herakle_kabuterimon/index.html
http://digimons.net/digimon/mugendramon/index.html
http://digimons.net/digimon/king_etemon/index.html

根據(jù)這個(gè)情況有一種思路:利用正則或者其他(超)文本解析工具獲取源碼中的所有href,然后利用urllib.parse.urljoin和父目錄路徑http://digimons.net/digimon/ 拼起來構(gòu)成完整url,再訪問詳情頁

但本文換了一種思路

許多爬蟲的數(shù)據(jù)采集工作都是類似的,因此Scrapy提供了若干個(gè) 更高程度封裝的通用爬蟲類

可用以下命令查看:

# 查看scrapy提供的通用爬蟲(Generic Spiders)
scrapy genspider -l

CrawlSpider 是通用爬蟲里最常用的一個(gè)

通過一套規(guī)則引擎,它自動(dòng)實(shí)現(xiàn)了頁面鏈接的搜索跟進(jìn),解決了包含但不限于自動(dòng)采集詳情頁、跟進(jìn)分類/分頁地址等問題。主要運(yùn)行邏輯是深度優(yōu)先

這個(gè)網(wǎng)站的設(shè)計(jì)非常簡(jiǎn)單,因此可以考慮用便捷的全網(wǎng)爬取框架。這個(gè)框架的前提是:無關(guān)url和需要url有明顯差別,可以利用正則獲取其他方式區(qū)別開

簡(jiǎn)而言之,可以想象給定爬蟲一個(gè)一只url以后,爬蟲會(huì)繼續(xù)訪問從這個(gè)url出發(fā)能訪問到的新url,然后爬蟲需要根據(jù)預(yù)設(shè)的語法判斷這個(gè)url是不是所需的,如果是則先解析后延伸訪問新url,如果不是則繼續(xù)訪問新url,假如沒有新url該叉結(jié)束

文章中的外鏈比較多是wikipedia,url差別較大。通過對(duì)比和數(shù)碼獸詳情頁的url,可以總結(jié)出所需url的格式:

http://digimons.net/digimon/.*/index.html

利用正則替換中間的英文名

- 詳情頁分析

爬取的需求如下:

基本一只數(shù)碼獸所有的資料都會(huì)爬取下來,但需要注意不同的數(shù)碼獸資料不一定完整,故寫代碼需要留意(舉兩個(gè)例子)

需求分析差不多了,可以著手寫代碼

代碼實(shí)戰(zhàn)

- 創(chuàng)建項(xiàng)目

# scrapy startproject <Project_name>
scrapy startproject Digimons

# scrapy genspider <spider_name> <domains>
scrapy genspider –t crawl digimons http://digimons.net/digimon/chn.html

- spiders.py

依次打開項(xiàng)目文件夾Digimons - Digimons - spiders,創(chuàng)建digimons.py

# -*- coding: utf-8 -*-
import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
from ..items import DigimonsItem


class DigimonsSpider(CrawlSpider):
    name = 'digimons'
    allowed_domains = ['digimons.net']
    start_urls = ['http://digimons.net/digimon/chn.html']

    # 爬蟲的規(guī)則是重點(diǎn),把先前分析的結(jié)果url放進(jìn)allow
    # callback='parse_item',符合規(guī)則的url回調(diào)該函數(shù)
    # follow = False則爬到該頁面后不繼續(xù)拓寬深度
    rules = (
        Rule(LinkExtractor(allow=r'http://digimons.net/digimon/.*/index.html'), callback='parse_item', follow=False),
    )

    # 按需求逐個(gè)解析,有的不存在需要判斷
    def parse_item(self, response):
        # 名字列表
        name_lst = response.xpath('//*[@id="main"]/article/h2[1]//text()').extract()
        name_lstn = [i.replace('/', '').strip() for i in name_lst if i.strip() != '']
        # 中文名
        Cname = name_lstn[0].replace(' ', '-')
        # 日文名
        Jname = name_lstn[1]
        # 英文名
        Ename = name_lstn[2]
        # 等級(jí)
        digit_grade = response.xpath("http://article/div[@class='data'][1]/table/tr[1]/td/text()").extract()
        digit_grade = '-' if digit_grade == [] else ''.join(digit_grade)
        # 類型
        digit_type = response.xpath("http://article/div[@class='data'][1]/table/tr[2]/td/text()").extract()
        digit_type = '-' if digit_type == [] else ''.join(digit_type)
        # 屬性
        digit_attribute = response.xpath("http://article/div[@class='data'][1]/table/tr[3]/td/text()").extract()
        digit_attribute = '-' if digit_attribute == [] else ''.join(digit_attribute)
        # 所屬
        belongs = response.xpath("http://article/div[@class='data'][1]/table/tr[4]/td/text()").extract()
        belongs = '-' if belongs == [] else ''.join(belongs)
        # 適應(yīng)領(lǐng)域
        adaptation_field = response.xpath("http://article/div[@class='data'][1]/table/tr[5]/td/text()").extract()
        adaptation_field = '-' if adaptation_field == [] else ''.join(adaptation_field)
        # 首次登場(chǎng)
        debut = response.xpath("http://article/div[@class='data'][1]/table/tr[6]/td/text()").extract()
        debut = '-' if debut == [] else ''.join(debut)
        # 名字來源
        name_source = response.xpath("http://article/div[@class='data'][1]/table/tr[7]/td/text()").extract()
        name_source = '-' if name_source == [] else '/'.join(name_source).strip('/')
        # 必殺技
        nirvana = response.xpath("http://article/div[@class='data'][2]/table/tr/td[1]/text()").extract()
        nirvana = '-' if nirvana == [] else '/'.join(nirvana).strip('/')
        # 介紹資料
        info_lst = response.xpath("http://*[@id='cn']/p/text()").extract()
        info = ''.join([i.replace('/', '').strip() for i in info_lst if i.strip() != ''])
        # 圖片url
        img_url = response.xpath('//*[@id="main"]/article/div[1]/a/img/@src').extract()
        img_url = response.url[:-10] + img_url[0] if img_url != [] else '-'
        
        # 個(gè)人習(xí)慣簡(jiǎn)單輸出
        print(Cname, Jname, Ename)

        # 如果要持久化存儲(chǔ)轉(zhuǎn)向items
        item = DigimonsItem()
        item['Cname'] = Cname
        item['Jname'] = Jname
        item['Ename'] = Ename
        item['digit_grade'] = digit_grade
        item['digit_type'] = digit_type
        item['digit_attribute'] = digit_attribute
        item['belongs'] = belongs
        item['adaptation_field'] = adaptation_field
        item['debut'] = debut
        item['name_source'] = name_source
        item['nirvana'] = nirvana
        item['info'] = info
        item['img_url'] = img_url
        yield item

- items.py

關(guān)于MySQL存儲(chǔ)的細(xì)節(jié)可以參考我的另一篇文章:
[Python] 爬取生信坑論壇 bioinfo帖子相關(guān)信息 (requests + openpyxl + pymysql)

# -*- coding: utf-8 -*-

# Define here the models for your scraped items
#
# See documentation in:
# https://docs.scrapy.org/en/latest/topics/items.html

import scrapy

# 和spiders.py中的傳入對(duì)應(yīng)
class DigimonsItem(scrapy.Item):
    Cname = scrapy.Field()
    Jname = scrapy.Field()
    Ename = scrapy.Field()
    digit_grade = scrapy.Field()
    digit_type = scrapy.Field()
    digit_attribute = scrapy.Field()
    belongs = scrapy.Field()
    adaptation_field = scrapy.Field()
    debut = scrapy.Field()
    name_source = scrapy.Field()
    nirvana = scrapy.Field()
    info = scrapy.Field()
    img_url = scrapy.Field()

    # 注釋部分是sql語法,需要在命令行運(yùn)行
    def get_insert_sql_and_data(self):
    # CREATE TABLE digimons(
    # id int not null auto_increment primary key,
    # Chinese_name text, Japanese_name text, English_name text,
    # digit_grade text, digit_type text, digit_attribute text,
    # belongs text, adaptation_field text, debut text, name_source text,
    # narvana text, info text)ENGINE=INNODB DEFAULT CHARSET=UTF8mb4;
        insert_sql = 'insert into digimons(Chinese_name,Japanese_name,English_name,digit_grade,digit_type,' \
                     'digit_attribute,belongs,adaptation_field,debut,name_source,nirvana,info,img_url)' \
                     'values(%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s)'
        data = (self['Cname'],self['Jname'],self['Ename'],self['digit_grade'],self['digit_type'],
                self['digit_attribute'],self['belongs'],self['adaptation_field'],self['debut'],
                self['name_source'],self['nirvana'],self['info'],self['img_url'])
        return (insert_sql, data)

- pipelines.py

借助items.py和Mysqlhelper完成存儲(chǔ)

# -*- coding: utf-8 -*-
from mysqlhelper import Mysqlhelper
# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html


class DigimonsPipeline(object):
    def __init__(self):
        self.mysqlhelper = Mysqlhelper()

    def process_item(self, item, spider):
        if 'get_insert_sql_and_data' in dir(item):
            (insert_sql, data) = item.get_insert_sql_and_data()
            self.mysqlhelper.execute_sql(insert_sql, data)
        return item

- settings.py

沒有特別修改,下面的代碼默認(rèn)是注釋狀態(tài),需要打開

ITEM_PIPELINES = {
   'Digimons.pipelines.DigimonsPipeline': 300,
}

- extends.py

自定義拓展這部分內(nèi)容不看也可以,我實(shí)現(xiàn)的功能就是在爬蟲運(yùn)行結(jié)束的時(shí)候自動(dòng)給微信推送消息(借助喵提醒

喵提醒需要申請(qǐng)賬號(hào)獲得自己的id,官方已經(jīng)給了一個(gè)API可以包裝在本地

from urllib import request, parse
import json

class Message(object):
    def __init__(self,text):
        self.text = text
    def push(self):
        # 重要,在id中填寫自己綁定的id
        page = request.urlopen("http://miaotixing.com/trigger?" + parse.urlencode({"id": "xxxxxx", "text": self.text, "type": "json"}))
        result = page.read()
        jsonObj = json.loads(result)
        if (jsonObj["code"] == 0):
            print("\nReminder message was sent successfully")
        else:
            print("\nReminder message failed to be sent,wrong code:" + str(jsonObj["code"]) + ",describe:" + jsonObj["msg"])

寫在另外的py文件里命名為message.py

接下來寫extends.py,需要自己新建,位置和pipelines.py,items.py同級(jí)

from scrapy import signals
from message import Message


class MyExtension(object):
    def __init__(self, value):
        self.value = value

    @classmethod
    def from_crawler(cls, crawler):
        val = crawler.settings.getint('MMMM')
        ext = cls(val)

        crawler.signals.connect(ext.spider_opened, signal=signals.spider_opened)
        crawler.signals.connect(ext.spider_closed, signal=signals.spider_closed)

        return ext

    def spider_opened(self, spider):
        print('spider running')

    def spider_closed(self, spider):
        # 推送的消息可以自定義,可以再給出獲取了多少條數(shù)據(jù)
        text = 'DigimonsSpider運(yùn)行結(jié)束'
        message = Message(text)
        message.push()
        print('spider closed')

重要的是如果添加了自定義拓展,需要在settings中也打開
默認(rèn)是:

# Enable or disable extensions
# See https://docs.scrapy.org/en/latest/topics/extensions.html
#EXTENSIONS = {
#    'scrapy.extensions.telnet.TelnetConsole': None,
#}

需要修改并打開:

# Enable or disable extensions
# See https://docs.scrapy.org/en/latest/topics/extensions.html
EXTENSIONS = {
   'Digimons.extends.MyExtension': 500,
}

- running.py

最后就是運(yùn)行這個(gè)項(xiàng)目了,也可以直接在命令行中cd到項(xiàng)目位置后運(yùn)行

scrapy crawl digimons

個(gè)人比較喜歡在py文件中運(yùn)行,新建一個(gè)running.py(和items.py同級(jí)目錄)

from scrapy.cmdline import execute

execute('scrapy crawl digimons'.split())

最后運(yùn)行runnings.py這個(gè)項(xiàng)目就啟動(dòng)了

項(xiàng)目運(yùn)行

運(yùn)行中如圖:

運(yùn)行完成共獲得1179條數(shù)據(jù),進(jìn)入Navicat(MySQL等數(shù)據(jù)庫的圖形交互界面):

數(shù)據(jù)非常好的儲(chǔ)存了,可以看到不是所有的數(shù)碼獸都有所屬陣營(yíng)

可以做一些簡(jiǎn)單的查詢,比如七大魔王陣營(yíng)里都有誰:

查詢結(jié)果對(duì)比確實(shí)只有7只,重復(fù)出現(xiàn)的都是形態(tài)變化或者變體:

對(duì)其中重復(fù)的一只數(shù)碼獸再次查詢,的確簡(jiǎn)介都不同

如果覺得用MySQL查看不方便,可以在Navicat中轉(zhuǎn)出成EXCEL

在命令行中導(dǎo)出詳情參見:[SQL] MySQL基礎(chǔ)+Python交互


爬取的數(shù)據(jù)EXCEL形式供大家下載:
https://pan.baidu.com/s/1oKFsw3at4cF5p4WpW7ZRcA
提取碼:1wvs

寫在最后
數(shù)碼獸的資料里數(shù)字信息較少,可供挖掘和造模的信息有限
后續(xù)爬取口袋妖怪小精靈們的數(shù)據(jù)就有意思多了 : )

關(guān)于數(shù)據(jù)分析的部分以后會(huì)涉及,歡迎繼續(xù)關(guān)注

最后編輯于
?著作權(quán)歸作者所有,轉(zhuǎn)載或內(nèi)容合作請(qǐng)聯(lián)系作者
【社區(qū)內(nèi)容提示】社區(qū)部分內(nèi)容疑似由AI輔助生成,瀏覽時(shí)請(qǐng)結(jié)合常識(shí)與多方信息審慎甄別。
平臺(tái)聲明:文章內(nèi)容(如有圖片或視頻亦包括在內(nèi))由作者上傳并發(fā)布,文章內(nèi)容僅代表作者本人觀點(diǎn),簡(jiǎn)書系信息發(fā)布平臺(tái),僅提供信息存儲(chǔ)服務(wù)。

相關(guān)閱讀更多精彩內(nèi)容

友情鏈接更多精彩內(nèi)容