内射小骚笔,网友自拍人妻激情网

更新日志：
2019.4.28 更新檢索模塊

自從追完約定夢幻島就念念不忘，想著追下漫畫，可是，電腦上看太不方便，手機一看，廣告太多而且翻頁什么的都太不方便了，于是乎，就有了今天的爬蟲實戰(zhàn)了。
我這次爬取的漫畫目標網(wǎng)站為: http://www.1kkk.com/

1.寫在前面

求點贊，求點贊，求點贊~（小聲）
(覺得啰嗦可直接跳到正文部分)
(還是覺得啰嗦的可以直接跳到最后整合后完整代碼部分~)
漫畫下載需求確認:

用戶輸入漫畫名，程序自動完成檢索，打印檢索漫畫信息，含漫畫名、作者、連載情況、摘要等；
判斷漫畫是否為付費漫畫，并打印提示，選擇僅下載免費章節(jié)或退出下載；
判斷是否為限制級漫畫，打印提示，并自動完成校驗進入下一步；
用戶確認信息是否檢索準確，是則下載，否則退出；
能完整下載所有章節(jié)所有漫畫頁高清圖片；
能根據(jù)不同章節(jié)打包，文件夾漫按漫畫章節(jié)名命名；
每章節(jié)內漫畫頁按順序命名；
盡可能提升下載效率。

先給大家看下成果，爬取到的漫畫結合本地漫畫app "Perfect Viewer"觀看的效果:

在手機上可實現(xiàn)觸屏點擊自動翻頁、跳章，還能記錄當前看到的位置，可以說很爽了~

本地app內效果

高清漫畫圖頁

寫在前面的總結

本爬蟲使用到的模塊如下:

Selector : scrapy的解析庫
selector提取內容的方法,本文使用getall()和get()替代了舊的extract()和extract_first()
requests:請求庫
selenium:瀏覽器模擬工具
time:時間模塊
re:正則表達式
multiprocessing: 多進程
pymongo: mongodb數(shù)據(jù)庫

本爬蟲使用的工具如下:

谷歌瀏覽器
解析插件：xpath helper
postman

本爬蟲遇到的值得強調的問題如下:

多進程不共享全局變量,不能使用input()函數(shù).
漫畫圖片鏈接使用了JS渲染,不能直接在主頁獲取.
請求圖片鏈接必須攜帶對應章節(jié)referer信息才有返回數(shù)據(jù).
部分漫畫缺少資源，需增加判定.
部分漫畫為付費漫畫，需增加判定.
部分漫畫為限制級漫畫，需模擬點擊驗證才能返回數(shù)據(jù).
需分章節(jié)創(chuàng)建目錄，并判定目錄是否存在.
漫畫圖片需按順序命名.

正文開始:

2. 目標網(wǎng)頁結構分析(思路分析,具體代碼下一章)

爬蟲編寫建議逆向分析網(wǎng)頁,即:從自己最終需求所在的數(shù)據(jù)網(wǎng)頁開始,分析網(wǎng)頁加載形式, 請求類型, 參數(shù)構造 ,再逆向逐步推導出構造參數(shù)的來源.分析完成后再從第一個網(wǎng)頁出發(fā), 以獲取構造參數(shù)為目的,逐步請求得到參數(shù),構造出最終的數(shù)據(jù)頁鏈接并獲取所需數(shù)據(jù).
因此,我這次的爬取先從漫畫圖片所在頁開始分析.

找到漫畫圖片鏈接
隨意選擇一部漫畫進入任意一頁,這里還是以<<約定夢幻島>>為例吧,我隨便點擊進了第85話:
http://www.1kkk.com/ch103-778911/#ipg1
常規(guī)操作,首先使用谷歌瀏覽器,按F12打開開發(fā)者工具,選擇元素,點擊漫畫圖片,自動定位到圖片地址源碼位置.如圖所示:

開發(fā)者工具定位漫畫圖片鏈接
確定返回該鏈接的源網(wǎng)址
簡單的找到漫畫圖片鏈接后,需要確定返回該鏈接的請求網(wǎng)址。
最簡單的情況是圖片沒有異步加載,鏈接隨主頁網(wǎng)址返回,怎么確定它是不是異步加載的呢?
很簡單,1.通過Preview查看渲染后的網(wǎng)頁中是否包含漫畫圖片;2.在Response響應中直接搜索是否包含圖片鏈接.具體示例圖如下:

異步確認

通過上面的操作，我們得出結論，圖片鏈接是通過異步加載得到的。因此需要找到它的數(shù)據(jù)來源，經(jīng)過一段時間的尋找，我，放棄了。沒有找到結構化且明確的鏈接所在，確認是通過JS渲染得到的，最終考慮到并非進行大規(guī)模爬取，決定用selenium模擬來完成圖片鏈接獲取的工作。這樣，獲取一頁圖片鏈接的步驟就沒問題了。
實現(xiàn)章節(jié)內翻頁獲取全部圖片鏈接
既然已經(jīng)確定采用seleni模擬瀏覽器來獲取圖片鏈接，那翻頁的網(wǎng)頁結構分析步驟也省略了，只需獲取"下一頁"節(jié)點,模擬瀏覽器不斷點擊下一頁操作即可。

下一頁節(jié)點

這樣就確定了每一章所有頁的圖片獲取方式.接下來需要做的是獲取所有章節(jié)的鏈接.
獲取章節(jié)鏈接
當然,可能會與人想,章節(jié)也可以繼續(xù)用selenium來模擬瀏覽器翻頁點擊啊,這樣是可以沒錯,但是......selenium的效率是真的低,能不用就不要用,不然一部漫畫的下載時間可能需要很長。
我們來分析該部漫畫主頁，其實一下就能看出，獲取章節(jié)鏈接和信息是很簡單的。

章節(jié)鏈接結構

這里只需要構造一個requests請求再解析網(wǎng)頁即可獲得所有章節(jié)的鏈接及章節(jié)名.
其實到這里,我們就已經(jīng)可以完成單個指定漫畫的爬蟲簡單版了,為什么叫簡單版,因為還有很多判定,很多自動化檢索功能未添加進去..

3.編寫漫畫爬蟲簡單版

何為簡單版?

沒有檢索功能，不能自動檢索漫畫并下載。
漫畫名、漫畫主頁鏈接需要手工給定輸入。
下載的漫畫不能為付費漫畫、限制級漫畫。

其余功能，包括多進程下載都正常包含。
實現(xiàn)代碼模塊將在下面分別講解：

獲取全部章節(jié)信息

from scrapy.selector import Selector
import requests

# 約定夢幻島漫畫鏈接
start_url = "http://www.1kkk.com/manhua31328/"
header = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.86 Safari/537.36"
}
def get_chapter_list(start_url):
    res = requests.get(start_url, headers=header)
    selector = Selector(text=res.text)
    items = selector.xpath("http://ul[@id='detail-list-select-1']//li")
    # 用于存放所有章節(jié)信息
    chapter_list = []
    for item in items:
        # 構造絕對鏈接
        chapter_url = "http://www.1kkk.com" + item.xpath("./a/@href").get()
        title = item.xpath("./a/text()").get().rstrip()
        # 若上述位置未匹配到標題,則換下面的匹配式
        if not title:
            title = item.xpath("./a//p/text()").get().rstrip()
        dic = {
            "chapter_url": chapter_url,
            "title": title
        }
        chapter_list.append(dic)
    # 按章節(jié)正需排序
    chapter_list.reverse()
    total_len = len(chapter_list)
    print("\n【總共檢索到 {} 個章節(jié)信息如下】:\n{}".format(total_len, chapter_list))
    return chapter_list

if __name__ == '__main__':
    get_chapter_list(start_url)

輸出結果:
得到包含所有章節(jié)鏈接和標題數(shù)據(jù)。

【總共檢索到 103 個章節(jié)信息如下】:
[{'chapter_url': 'http://www.1kkk.com/ch1-399698/', 'title': '第1話 GFhouse'}, {'chapter_url': 'http://www.1kkk.com/ch2-400199/', 'title': '第2話 出口'}, {'chapter_url': 'http://www.1kkk.com/ch3-402720/', 'title': '第3話 鐵之女'}, {'chapter_url': 'http://www.1kkk.com/ch4-404029/', 'title': '第4話 最好'}, {'chapter_url': 'http://www.1kkk.com/ch5-405506/', 'title': '第5話 被算計了！'}, {'chapter_url': 'http://www.1kkk.com/ch6-406812/', 'title': '第6話 卡羅露和克洛涅'}, {'chapter_url': 'http://www.1kkk.com/ch7-407657/', 'title': '第7話 全靠你了'}, {'chapter_url': 'http://www.1kkk.com/ch8-409649/', 'title': '第8話 我有個主意'}, {'chapter_url': 'http://www.1kkk.com/ch9-411128/', 'title': '第9話 一起來玩捉迷藏吧'}, {'chapter_url': 'http://www.1kkk.com/ch10-418782/', 'title': '第10話 掌控'}, {'chapter_url': 'http://www.1kkk.com/ch11-421753/', 'title': '第11話 內鬼①'}, {'chapter_url': 'http://www.1kkk.com/ch12-422720/', 'title': '第12話 內鬼?'}, {'chapter_url': 'http://www.1kkk.com/ch13-424435/', 'title': '第13話 內鬼3'}, {'chapter_url': 'http://www.1kkk.com/ch14-425751/', 'title': '第14話 殺手锏'}, {'chapter_url': 'http://www.1kkk.com/ch15-427433/', 'title': '第15話 不要有下次了'}, {'chapter_url': 'http://www.1kkk.com/ch16-428613/', 'title': '第16話 秘密的房間和W.密涅爾巴'}, {'chapter_url': 'http://www.1kkk.com/ch17-429698/', 'title': '第17話 秘密的房間和W.密涅瓦 ?'}, {'chapter_url': 'http://www.1kkk.com/ch18-430916/', 'title': '第18話 覺悟'}, {'chapter_url': 'http://www.1kkk.com/ch19-432001/', 'title': '第19話 廚具'}, {'chapter_url': 'http://www.1kkk.com/ch20-452160/', 'title': '第20話 “攜手共戰(zhàn)”'}, {'chapter_url': 'http://www.1kkk.com/ch21-452161/', 'title': '第21話 被看穿的策略'}, {'chapter_url': 'http://www.1kkk.com/ch22-453011/', 'title': '第22話 誘餌'}, {'chapter_url': 'http://www.1kkk.com/ch23-453852/', 'title': '第23話 砸個粉碎!!'}, {'chapter_url': 'http://www.1kkk.com/ch24-454970/', 'title': '第24話 預先調查①'}, {'chapter_url': 'http://www.1kkk.com/ch25-455408/', 'title': '第25話 預先調查②'}, {'chapter_url': 'http://www.1kkk.com/ch26-456937/', 'title': '第26話 想活下去'}, {'chapter_url': 'http://www.1kkk.com/ch27-459192/', 'title': '第27話 不會讓你死'}, {'chapter_url': 'http://www.1kkk.com/ch28-463002/', 'title': '第28話 潛伏'}, {'chapter_url': 'http://www.1kkk.com/ch29-469845/', 'title': '第29話 潛伏②'}, {'chapter_url': 'http://www.1kkk.com/ch30-470068/', 'title': '第30話 抵抗'}, {'chapter_url': 'http://www.1kkk.com/ch31-471022/', 'title': '第31話 空虛'}, {'chapter_url': 'http://www.1kkk.com/ch32-471987/', 'title': '第32話 決行①'}, {'chapter_url': 'http://www.1kkk.com/ch33-475979/', 'title': '第33話 決行②'}, {'chapter_url': 'http://www.1kkk.com/ch34-477581/', 'title': '第34話 決行③'}, {'chapter_url': 'http://www.1kkk.com/ch35-478788/', 'title': '第35話 決行④'}, {'chapter_url': 'http://www.1kkk.com/ch36-480532/', 'title': '第36話 決行⑤'}, {'chapter_url': 'http://www.1kkk.com/ch37-484169/', 'title': '第37話 逃脫'}, {'chapter_url': 'http://www.1kkk.com/ch38-487071/', 'title': '第38話 誓言之森'}, {'chapter_url': 'http://www.1kkk.com/ch39-489256/', 'title': '第39話 意料之外'}, {'chapter_url': 'http://www.1kkk.com/ch40-491112/', 'title': '第40話 阿爾巴比涅拉之蛇'}, {'chapter_url': 'http://www.1kkk.com/ch41-492519/', 'title': '第41話 襲來'}, {'chapter_url': 'http://www.1kkk.com/ch42-495364/', 'title': '第42話 怎么可能讓你吃掉'}, {'chapter_url': 'http://www.1kkk.com/ch43-497162/', 'title': '第43話 81194'}, {'chapter_url': 'http://www.1kkk.com/ch44-498952/', 'title': '第44話 戴兜帽的少女'}, {'chapter_url': 'http://www.1kkk.com/ch45-500306/', 'title': '第45話 救援'}, {'chapter_url': 'http://www.1kkk.com/ch46-501983/', 'title': '第46話 頌施與繆西卡'}, {'chapter_url': 'http://www.1kkk.com/ch47-503551/', 'title': '第47話 昔話'}, {'chapter_url': 'http://www.1kkk.com/ch48-505288/', 'title': '第48話 兩個世界'}, {'chapter_url': 'http://www.1kkk.com/ch49-508300/', 'title': '第49話 請教教我'}, {'chapter_url': 'http://www.1kkk.com/ch50-514639/', 'title': '第50話 朋友'}, {'chapter_url': 'http://www.1kkk.com/ch51-521408/', 'title': '第51話 B06-32①'}, {'chapter_url': 'http://www.1kkk.com/ch52-523467/', 'title': '第52話 B06-32②'}, {'chapter_url': 'http://www.1kkk.com/ch53-525733/', 'title': '第53話 B06-32③'}, {'chapter_url': 'http://www.1kkk.com/ch54-527909/', 'title': '第54話 B06-32④'}, {'chapter_url': 'http://www.1kkk.com/ch55-540686/', 'title': '第55話 B06-32⑤'}, {'chapter_url': 'http://www.1kkk.com/ch56-542516/', 'title': '第56話 交易①'}, {'chapter_url': 'http://www.1kkk.com/ch57-544193/', 'title': '第57話 交易②'}, {'chapter_url': 'http://www.1kkk.com/ch58-545650/', 'title': '第58話 判斷'}, {'chapter_url': 'http://www.1kkk.com/ch59-547841/', 'title': '第59話 任你挑選'}, {'chapter_url': 'http://www.1kkk.com/ch60-551884/', 'title': '第60話 金色池塘'}, {'chapter_url': 'http://www.1kkk.com/ch61-552877/', 'title': '第61話 活下去看看呀'}, {'chapter_url': 'http://www.1kkk.com/ch62-558935/', 'title': '第62話 不死之身的怪物'}, {'chapter_url': 'http://www.1kkk.com/ch63-559580/', 'title': '第63話 HELP'}, {'chapter_url': 'http://www.1kkk.com/ch64-559739/', 'title': '第64話 如果是我的話'}, {'chapter_url': 'http://www.1kkk.com/ch65-560418/', 'title': '第65話 SECRET.GARDEN'}, {'chapter_url': 'http://www.1kkk.com/ch66-563262/', 'title': '第66話 被禁止的游戲①'}, {'chapter_url': 'http://www.1kkk.com/ch67-563263/', 'title': '第67話 被禁止的游戲②'}, {'chapter_url': 'http://www.1kkk.com/ch68-566491/', 'title': '第68話 就是這么回事'}, {'chapter_url': 'http://www.1kkk.com/ch69-567669/', 'title': '第69話 想讓你見的人'}, {'chapter_url': 'http://www.1kkk.com/ch70-573812/', 'title': '第70話 試看版'}, {'chapter_url': 'http://www.1kkk.com/ch71-573813/', 'title': '第71話 試看版'}, {'chapter_url': 'http://www.1kkk.com/ch72-575487/', 'title': '第72話 試看版'}, {'chapter_url': 'http://www.1kkk.com/ch73-626152/', 'title': '第73話 頑起'}, {'chapter_url': 'http://www.1kkk.com/ch74-629319/', 'title': '第74話 特別的孩子'}, {'chapter_url': 'http://www.1kkk.com/ch75-629320/', 'title': '第75話 倔強的華麗'}, {'chapter_url': 'http://www.1kkk.com/ch76-629321/', 'title': '第76話 開戰(zhàn)'}, {'chapter_url': 'http://www.1kkk.com/ch77-629322/', 'title': '第77話 無知的雜魚們'}, {'chapter_url': 'http://www.1kkk.com/ch78-629323/', 'title': '第78話 新解決一雙'}, {'chapter_url': 'http://www.1kkk.com/ch79-629324/', 'title': '第79話 一箭必定'}, {'chapter_url': 'http://www.1kkk.com/ch80-629219/', 'title': '第80話 來玩游戲吧，大公！'}, {'chapter_url': 'http://www.1kkk.com/ch81-633406/', 'title': '第81話 死守'}, {'chapter_url': 'http://www.1kkk.com/ch82-633407/', 'title': '第82話 獵場的主人'}, {'chapter_url': 'http://www.1kkk.com/ch83-633409/', 'title': '第83話 穿越13年的答復'}, {'chapter_url': 'http://www.1kkk.com/ch84-633410/', 'title': '第84話 停'}, {'chapter_url': 'http://www.1kkk.com/ch85-633411/', 'title': '第85話 怎么辦'}, {'chapter_url': 'http://www.1kkk.com/ch86-633290/', 'title': '第86話 戰(zhàn)力'}, {'chapter_url': 'http://www.1kkk.com/ch87-633867/', 'title': '第87話 境界'}, {'chapter_url': 'http://www.1kkk.com/ch88-708386/', 'title': '第88話 一雪前恥'}, {'chapter_url': 'http://www.1kkk.com/ch89-709622/', 'title': '第89話 匯合'}, {'chapter_url': 'http://www.1kkk.com/ch90-710879/', 'title': '第90話 贏吧'}, {'chapter_url': 'http://www.1kkk.com/ch91-711639/', 'title': '第91話 把一切都'}, {'chapter_url': 'http://www.1kkk.com/ch92-715647/', 'title': '第92話'}, {'chapter_url': 'http://www.1kkk.com/ch93-720622/', 'title': '第93話 了斷'}, {'chapter_url': 'http://www.1kkk.com/ch94-739797/', 'title': '第94話 大家活下去'}, {'chapter_url': 'http://www.1kkk.com/ch95-750533/', 'title': '第95話 回去吧'}, {'chapter_url': 'http://www.1kkk.com/ch96-754954/', 'title': '第96話 歡迎回來'}, {'chapter_url': 'http://www.1kkk.com/ch97-755431/', 'title': '第97話 所期望的未來'}, {'chapter_url': 'http://www.1kkk.com/ch98-758827/', 'title': '第98話 開始的聲音'}, {'chapter_url': 'http://www.1kkk.com/ch99-764478/', 'title': '第99話 Khacitidala'}, {'chapter_url': 'http://www.1kkk.com/ch100-769132/', 'title': '第100話 到達'}, {'chapter_url': 'http://www.1kkk.com/ch101-774024/', 'title': '第101話 過來吧'}, {'chapter_url': 'http://www.1kkk.com/ch102-776372/', 'title': '第102話 找到寺廟！'}, {'chapter_url': 'http://www.1kkk.com/ch103-778911/', 'title': '第103話 差一步'}]

selenium模擬瀏覽器獲取漫畫圖片鏈接
定義一個從章節(jié)內獲取每頁圖片信息的函數(shù)，其接受參數(shù)為函數(shù)get_chapter_list返回值列表中的字典。
經(jīng)過上面的分析，我們已確定該處要采用selenium進行圖片鏈接獲取，因此，在函數(shù)定義之前，還需要初始化selenium,并設置不加載圖片，不開啟可視化的選項，提高效率。
在此之前，你除了pip安裝好所需模塊外，還需要安裝對應谷歌瀏覽器版本的chromedriver，64位向下兼容，所以下載32位的是沒問題的。下載地址http://chromedriver.storage.googleapis.com/index.html。

from selenium import webdriver
from selenium.webdriver.chrome.options import Options

chrome_options = Options()
chrome_options.add_argument('blink-settings=imagesEnabled=false')  # 不加載圖片
chrome_options.add_argument('--headless')  # 不開啟可視化
browser = webdriver.Chrome(options=chrome_options)

為增加爬取效率，我的當前考慮時不在獲取圖片鏈接信息后直接下載圖片，而是持久化存入數(shù)據(jù)庫保存，隨時可以再次下載，不用同一部漫畫每次都采用selenium從頭獲取圖片鏈接。
因此，這里我用到了Mongodb數(shù)據(jù)庫，同樣，使用前需要先初始化數(shù)據(jù)庫,我們將要下載的漫漫畫名用一個變量來表示：

import pymongo

CARTOON_NAME = "約定夢幻島"
client = pymongo.MongoClient("localhost", 27017)
db = client["1kkk_cartoon"]
collection = db[CARTOON_NAME]

初始化selenium和數(shù)據(jù)庫完，下面編寫獲取漫畫頁信息的函數(shù)：

def get_page(chapter_dic):
    chapter_title = chapter_dic.get("title")
    chapter_url = chapter_dic.get("chapter_url")
    image_info = []
    browser.get(chapter_url)
    time.sleep(2)
    source = browser.page_source
    selector = Selector(text=source)
    # 獲取總頁數(shù)
    total_page = selector.xpath("http://div[@id='chapterpager']/a[last()]/text()").get()
    print(" ", chapter_name, "--總頁數(shù)：", total_page)
    # 循環(huán)點擊下一頁次數(shù)等于總頁數(shù)
    for index in range(1, int(total_page) + 1):
        page_source = browser.page_source
        selector2 = Selector(text=page_source)
        image_url = selector2.xpath("http://div[@id='cp_img']/img/@src").get()
    # 如網(wǎng)絡不穩(wěn)定，圖片信心有丟失，可加如下備注代碼，增加等待時常直至獲取數(shù)據(jù)
        # while image_url is None:
        #     time.sleep(1)
        #     page_source = browser.page_source
        #     selector2 = Selector(text=page_source)
        #     image_url = selector2.xpath("http://div[@id='cp_img']/img/@src").get()

        # 以索引順序命名圖片
        f_name = str(index)
        # 下一頁標簽
        next_page = browser.find_element_by_xpath("http://div[@class='container']/a[contains(text(),'下一頁')]")
        # 模擬點擊下一頁
        next_page.click()
        time.sleep(2)
        # 將漫畫圖片關鍵信息存入字典，用以后續(xù)批量下載
        # 重要：此處保存了頁面來源的章節(jié)鏈接，因為后續(xù)爬取將會知道，此Referer必不可少，否則將會被判定為異常訪問，拿不到圖片數(shù)據(jù)。
        page_info = {
            "chapter_title": chapter_title,
            'Referer': chapter_url,
            'img_url': image_url,
            'img_index': f_name
        }
        image_info.append(page_info)
        print(page_info)
        print("-----已下載{}，第{}頁-----".format(chapter_title, index))
        # 存入數(shù)據(jù)庫
        collection.insert_one(page_info)
    # 其實數(shù)據(jù)都已經(jīng)寫入數(shù)據(jù)庫了，也可以不用再return，這里return后完整運行代碼后可不連接數(shù)據(jù)庫讀取圖片信息。
    return image_info

設計多進程運行get_page()函數(shù)
上述兩個函數(shù)get_chapter_list、及get_chapter_list()組合運行后,便能完成爬取所有章節(jié)全部漫畫頁的詳情信息并存入數(shù)據(jù)庫中。
為了提高爬取效率，這里我直接用了多進程進程池-multiprocessing.Pool()，有不了解多進程的可以參考我之前的文章或網(wǎng)上了解下,這里不多闡述。
調用get_chapter_list(start_url)函數(shù),得到章節(jié)信息返回值, 開啟多進程運行get_page(chapter_dic):

if __name__ == '__main__':
    # 運行get_chapter_list(start_url) 得到返回章節(jié)信息列表
    chapter_list = get_chapter_list(start_url)
    # 實例化進程池，不傳參數(shù)將默認以你當前計算機的cpu核心數(shù)來創(chuàng)建進程數(shù)，比如我的電腦默認為Pool(4)
    p = Pool()
    for chapter_dic in chapter_list:
        # 開啟非阻塞式多進程
        p.apply_async(get_page,(chapter_dic,)) # 傳參那里不要漏了逗號，參數(shù)要求必須是元組
    p.close()
    p.join()
    # 關閉瀏覽器，回收設備資源
    browser.close()

這樣就得到了所有包含圖片URL、對應章節(jié)鏈接：Referer、章節(jié)名、章節(jié)內漫畫順序索引的字典信息，并同時存進了數(shù)據(jù)庫。
運行輸出如下：

獲取漫畫圖片信息

下載并保存圖片
所有圖片信息已經(jīng)獲取完成，后續(xù)的下載保存邏輯就很簡單了,代碼邏輯如下:
1. 從數(shù)據(jù)庫取出圖片信息數(shù)據(jù)，或者直接使用get_page函數(shù)的返回值。
2. requests構造請求，須攜帶Referer,保存圖片數(shù)據(jù)。
  
  在瀏覽器中直接訪問圖片鏈接不能獲得正確圖片
3. 按漫畫名創(chuàng)建總文件夾，按章節(jié)名創(chuàng)建子文件夾，按索引名命名下載圖片并放入對應章節(jié)名文件夾內。
4. 為提高效率，漫畫圖片下載同樣采用多進程
實現(xiàn)代碼如下，此處取漫畫信息數(shù)據(jù)方式采用的從數(shù)據(jù)庫獲?。?/strong>

# 傳入漫畫圖片字典信息 def save_img(info_dict): chapter_title = info_dict.get('chapter_title') referer = info_dict.get('Referer') img_url = info_dict.get('img_url') f_name = info_dict.get('img_index') # 重新構造請求頭，請求頭必須加入Referer來源，否則將被反爬攔截無法獲取數(shù)據(jù) headers = { "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.86 Safari/537.36", "Referer": referer } res = requests.get(img_url, headers=headers) if res.status_code == 200: img = res.content # ./代表當前目錄 path1 = "./%s" % CARTOON_NAME # 判斷是否存在文件夾,否則創(chuàng)建新文件夾 if not os.path.exists(path1): os.makedirs(path1) print("創(chuàng)建目錄文件夾--%s 成功" % CARTOON_NAME) path2 = "./%s/%s" % (CARTOON_NAME, chapter_title) if not os.path.exists(path2): os.makedirs(path2) print("創(chuàng)建漫畫目錄文件夾--%s 成功" % chapter_title) # 保存圖片,索名命名 with open("./%s/%s/%s.jpg" % (CARTOON_NAME, chapter_title, f_name), 'wb') as f: f.write(img) print("%s--第%s頁保存成功" % (chapter_title, f_name)) else: print("該頁下載失敗") if __name__ == '__main__': CARTOON_NAME = "賢者之孫" client = pymongo.MongoClient("localhost", 27017) db = client["1kkk_cartoon"] collection = db[CARTOON_NAME] # 從數(shù)據(jù)庫中取出漫畫頁信息,并轉換為列表 infos = list(collection.find()) p = Pool() for info in infos: p.apply_async(save_img, (info,)) p.close()

運行上述下載代碼,漫畫圖片將被快速的下載并結構化的保存下來.這樣,漫畫下載的主體已經(jīng)全部完成
我們只需需稍微重構下代碼,將所有代碼整合在一起即可,整合后,兩處多進程方法不變,即:

使用多進程將漫畫圖片信息保存到數(shù)據(jù)庫.

儲存完成后,自動從數(shù)據(jù)庫讀取數(shù)據(jù),采用多進程下載漫畫圖片并結構掛保存.

完整重構整合代碼如下:

from scrapy.selector import Selector import requests from selenium import webdriver from selenium.webdriver.chrome.options import Options import time import re from multiprocessing import Pool import pymongo import os # 約定夢幻島漫畫鏈接 CARTOON_NAME = "約定夢幻島" client = pymongo.MongoClient("localhost", 27017) db = client["1kkk_cartoon"] collection = db[CARTOON_NAME] chrome_options = Options() chrome_options.add_argument('blink-settings=imagesEnabled=false')#不加載圖片 chrome_options.add_argument('--headless')#不開啟可視化 browser = webdriver.Chrome(options=chrome_options) header = { "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.86 Safari/537.36" } start_url = "http://www.1kkk.com/manhua31328/" def get_chapter_list(start_url): res = requests.get(start_url, headers=header) selector = Selector(text=res.text) items = selector.xpath("http://ul[@id='detail-list-select-1']//li") # 用于存放所有章節(jié)信息 chapter_list = [] for item in items: # 構造絕對鏈接 chapter_url = "http://www.1kkk.com" + item.xpath("./a/@href").get() title = item.xpath("./a/text()").get().rstrip() # 若上述位置未匹配到標題,則換下面的匹配式 if not title: title = item.xpath("./a//p/text()").get().rstrip() dic = { "title": title, "chapter_url": chapter_url, } chapter_list.append(dic) # 按章節(jié)正需排序 chapter_list.reverse() total_len = len(chapter_list) print("\n【總共檢索到 {} 個章節(jié)信息如下】:\n{}".format(total_len, chapter_list)) return chapter_list def get_page(chapter_dic): chapter_title = chapter_dic.get("title") chapter_url = chapter_dic.get("chapter_url") image_info = [] browser.get(chapter_url) time.sleep(2) source = browser.page_source selector = Selector(text=source) # 獲取總頁數(shù) total_page = selector.xpath("http://div[@id='chapterpager']/a[last()]/text()").get() print(" ", chapter_title, "--總頁數(shù)：", total_page) # 循環(huán)點擊下一頁次數(shù)等于總頁數(shù) for index in range(1, int(total_page) + 1): page_source = browser.page_source selector2 = Selector(text=page_source) image_url = selector2.xpath("http://div[@id='cp_img']/img/@src").get() # 遇到加載緩慢時等待時間加長 # while image_url is None: # time.sleep(1) # page_source = browser.page_source # selector2 = Selector(text=page_source) # image_url = selector2.xpath("http://div[@id='cp_img']/img/@src").get() # 以索引順序命名圖片 f_name = str(index) # 下一頁標簽 next_page = browser.find_element_by_xpath("http://div[@class='container']/a[contains(text(),'下一頁')]") # 模擬點擊 next_page.click() time.sleep(2) # 將漫畫圖片關鍵信息存入字典，用需后續(xù)批量下載 page_info = { "chapter_title": chapter_title, 'Referer': chapter_url, 'img_url': image_url, 'img_index': f_name } image_info.append(page_info) print("-----已下載{}，第{}頁-----".format(chapter_title, index)) # 存入數(shù)據(jù)庫 collection.insert_one(page_info) return image_info def save_img(info_dict): chapter_title = info_dict.get('chapter_title') referer = info_dict.get('Referer') img_url = info_dict.get('img_url') f_name = info_dict.get('img_index') # 重新構造請求頭，請求頭必須加入Referer來源，否則將被反爬攔截無法獲取數(shù)據(jù) headers = { "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.86 Safari/537.36", "Referer": referer } res = requests.get(img_url, headers=headers) if res.status_code == 200: img = res.content # ./代表當前目錄 path1 = "./%s" % CARTOON_NAME # 判斷是否存在文件夾,否則創(chuàng)建新文件夾 if not os.path.exists(path1): os.makedirs(path1) print("創(chuàng)建目錄文件夾--%s 成功" % CARTOON_NAME) path2 = "./%s/%s" % (CARTOON_NAME, chapter_title) if not os.path.exists(path2): os.makedirs(path2) print("創(chuàng)建漫畫目錄文件夾--%s 成功" % chapter_title) # 保存圖片,索名命名 with open("./%s/%s/%s.jpg" % (CARTOON_NAME, chapter_title, f_name), 'wb') as f: f.write(img) print("%s--第%s頁保存成功" % (chapter_title, f_name)) else: print("該頁下載失敗") def main_info_to_database(): chapter_list = get_chapter_list(start_url) # 實例化進程池，不傳參數(shù)將默認以你當前電腦的cpu核心數(shù)來創(chuàng)建進程數(shù)量，比如我的電腦默認為Pool（4） p = Pool() for chapter_dic in chapter_list: p.apply_async(get_page,(chapter_dic,)) p.close() p.join() browser.close() def main_download_from_database(): collection = db[CARTOON_NAME] # 從數(shù)據(jù)庫中取出漫畫頁信息,并轉換為列表 infos = list(collection.find()) p = Pool() for info in infos: p.apply_async(save_img, (info,)) p.close() if __name__ == '__main__': main_info_to_database() main_download_from_database()

運行結果如下:

漫畫名文件夾

下載后的目錄結構:

章節(jié)目錄

image.png

章節(jié)頁目錄

借助本地閱讀軟件看起漫畫來就很開心了!
這篇爬蟲難度不大,但是很多必不可少分析思路,和一些常用爬取手段的使用,如遇到js加載時可用selenium、解析庫Selector的使用、多進程庫multiprocessing的使用，MongoDB數(shù)據(jù)庫的存取操作等。
當然，不想研究代碼的直接拷過去也能使用(前提是庫和webdriver都安裝好了)。
—————————————————————————————————————————————————————
本文先寫到這，好累啊，后續(xù)的檢索模塊、付費漫畫、限制級漫畫處理我先挖坑，休息了再補上~
自己寫個爬蟲要不了多久，寫文是真費時啊，哭！
如果本文對你有些幫助，請務必點個贊或者收藏下，跪求?。。?br> 內容有不明白的地方或建議歡迎留言交流。

4. 增加檢索功能模塊

之前的代碼只能完成下載給定漫畫名和完整漫畫鏈接的情況。
但更好的情況是模擬主頁內的檢索功能，用戶輸入漫畫名即可打印漫畫詳情，并完成自動下載。

分析漫畫搜索請求：

通過主頁內搜索并跟蹤鏈接,很容易就找到搜索請求的鏈接及參數(shù)構成:

搜索信息

請求參數(shù)構成

試驗刪掉language: 1的參數(shù)并未影響數(shù)據(jù)返回,因此構造參數(shù)只需要傳遞漫畫名title即可
最終檢索url構成為： http://www.1kkk.com/search?title={}
后續(xù)要做的事情僅僅就是構造請求，解析返回數(shù)據(jù)。

# 檢索功能 def search(name): # 利用傳遞的參數(shù)構造檢索鏈接 search_url = "http://www.1kkk.com/search?title={}".format(name) print("正在網(wǎng)站上檢索您輸入的漫畫：【{}】,請稍后...".format(name)) res = requests.get(search_url, headers=header) if res.status_code == 200: # 解析響應數(shù)據(jù)，獲取需要的漫畫信息并打印 selector = Selector(text=res.text) title = selector.xpath("http://div[@class='info']/p[@class='title']/a/text()").get() link = "http://www.1kkk.com" + selector.xpath("http://div[@class='info']/p[@class='title']/a/@href").get() author = "|".join(selector.xpath("http://div[@class='info']/p[@class='subtitle']/a/text()").getall()) types = "|".join(selector.xpath("http://div[@class='info']/p[@class='tip']/span[2]/a//text()").getall()) block = selector.xpath("http://div[@class='info']/p[@class='tip']/span[1]/span//text()").get() content = selector.xpath("http://div[@class='info']/p[@class='content']/text()").get().strip() print("【檢索完畢】") print("請確認以下搜索信息是否正確：") print("-------------------------------------------------------------------------------------------------") print("漫畫名：", title) print("作者：", author) print("類型：", types) print("狀態(tài)：", block) print("摘要：", content) print("-------------------------------------------------------------------------------------------------") print("漫畫【%s】鏈接為：%s" % (title,link)) # 用戶檢查檢索信息，確認是否繼續(xù)下載 conf = input("確認下載?Y/N：") if conf.lower() != "y": print("正在退出，謝謝使用，再見！") return None else: print("即將為您下載：%s" % title) # 返回該漫畫鏈接 return link else: print("訪問出現(xiàn)錯誤")

單獨運行效果檢查：

serch("約定夢幻島")

輸出如下：

正在網(wǎng)站上檢索您輸入的漫畫：【約定夢幻島】,請稍后... 【檢索完畢】請確認以下搜索信息是否正確： ------------------------------------------------------------------------------------------------- 漫畫名：約定的夢幻島作者：白井カイウ|出水ぽすか類型：冒險|科幻|懸疑狀態(tài)：連載中摘要：約定的夢幻島漫畫，媽媽說外面的世界好可怕，我不信；但是那一天、我深深地體會到了媽媽說的是真的！因為不僅外面的世界、就連媽媽也好可怕…… ------------------------------------------------------------------------------------------------- 漫畫【約定的夢幻島】鏈接為：http://www.1kkk.com/manhua31328/ 確認下載?Y/N：

輸入y或者Y都將正確return漫畫的鏈接，達到預期要求。
現(xiàn)只需將之前代碼中的start_url由指定鏈接變更為該函數(shù)即可，即：
將start_url = "http://www.1kkk.com/manhua31328/"替換為：start_url = search(CARTOON_NAME)
當需要下載漫畫時，只需改變參數(shù)CARTOON_NAME即可，后續(xù)的檢索下載、目錄命名、數(shù)據(jù)庫表名稱都不用操心，將會自動完成更改創(chuàng)建。
到此，基本的檢索模塊也完成了。

坑位二：付費漫畫處理

pass

坑位三：限制級漫畫處理

pass