需求

爬取網(wǎng)頁上小說的名字以及所有章節(jié)的內(nèi)容，保存到txt文件。

以下面這篇https://www.hongxiu.com/book/14088898305054404#Catalog
為例

思路

1、requests請(qǐng)求網(wǎng)頁獲得源代碼；
2、bs4解析，從源代碼中提取兩個(gè)內(nèi)容，一是作品的名字，二是提取各個(gè)章節(jié)的名字、鏈接；
3、再依次使用requests各個(gè)章節(jié)鏈接，bs4解析，提取每一章的內(nèi)容；
4、將上面提取的數(shù)據(jù)保存到txt文件中。

代碼實(shí)操

1、考慮內(nèi)容有點(diǎn)多，將整體代碼分塊，方便編寫和閱讀，如下定義了四個(gè)函數(shù)。


def get_url_info(url):#request請(qǐng)求網(wǎng)頁，bs4解析網(wǎng)頁

def get_name(url):#獲取小說名字

def get_catalog(url):#獲取小說章節(jié)目錄信息，包括1、章節(jié)名字，2、章節(jié)

def get_text(url):#獲取小說章節(jié)正文

2、get_url_info()函數(shù)負(fù)責(zé)請(qǐng)求網(wǎng)頁，解析網(wǎng)頁，因?yàn)檎?qǐng)求的次數(shù)比較多，專門編寫一個(gè)調(diào)用，代碼在在python爬蟲系列（3）- 網(wǎng)頁數(shù)據(jù)解析（bs4、lxml、Json庫）
里面介紹過。

def get_url_info(url):#request請(qǐng)求網(wǎng)頁，bs4解析網(wǎng)頁
    res = requests.get(url)
    html = res.text
    soup = bs(html, "lxml") 
    return soup

3、get_name(url)函數(shù)負(fù)責(zé)獲取小說名字

小說名字的源代碼如下獲取，可以看到是包裹在<h1>標(biāo)簽下的<em>標(biāo)簽，通過.find_all("h1")[1].find("em").text即可提取到。

標(biāo)題的源代碼

def get_name(url):#獲取小說名字
    soup = get_url_info(url)
    name = soup.find_all("h1")[1].find("em").text
    #獲取小說名字,類型為list,取第二個(gè)后就轉(zhuǎn)為str類型，此時(shí)獲得的小說名字包含作者，如“一個(gè)人的外貿(mào)江湖 Tessly 著”
    return name

因?yàn)樯厦娑x了 get_url_info()函數(shù)，所以可以直接調(diào)用而不需要再寫request那些代碼。

4、get_catalog()函數(shù)負(fù)責(zé)獲取小說章節(jié)目錄信息，包括章節(jié)名字和章節(jié)鏈接。
根據(jù)上面的方法，同樣可以查看到關(guān)于章節(jié)名字、鏈接的代碼，如下：

章節(jié)名字、鏈接的代碼

包裹在<ul>標(biāo)簽下的<li>標(biāo)簽，通過.find_all('ul')[3].find_all('li')將所有章節(jié)的信息先篩選出來，然后再用for循環(huán)，將每一章的名字、鏈接都提取出來，保存在catalog列表當(dāng)中。

def get_catalog(url):#
    soup = get_url_info(url)
    catalog_data = soup.find_all('ul')[3].find_all('li')   #所有章節(jié)的信息
    catalog = []    
    for i in catalog_data:  
        data = {
                    "title":i.text,#章節(jié)名字
                    "link":"http:"+i.find('a').get('href')#章節(jié)鏈接
                }
        catalog.append(data)
    return catalog

5、get_text()函數(shù)負(fù)責(zé)獲取小說章節(jié)正文，通過步驟4獲取到了每一章節(jié)的鏈接后，就可以提取里面的文章內(nèi)容了。

文章代碼如下：

文章代碼

正文是包裹在class_="read-content j_readContent"的<div>標(biāo)簽下，用.find_all('div',class_="read-content j_readContent")[0].text方法直接獲取。

def get_text(url):#獲取小說章節(jié)正文
    soup = get_url_info(url)
    novel = soup.find_all('div',class_="read-content j_readContent")[0].text    
    novel = '\n    '+'\n\n    '.join(novel.split())   #為了美觀排版用
    return novel

6、最后主程序直接調(diào)用上面的函數(shù)，就可以提取到所需要的文章信息，再直接寫入txt文件就大功告成了。

寫入文件用with open .write方法。

try...except...是目的是用來防止程序出錯(cuò)就掛掉而不運(yùn)行下去了。

if __name__ == '__main__':    
       
    name = get_name(start_url)#小說名字
    print("稍等，正在準(zhǔn)備下載...")   
    print("保存路徑為下載程序所在的目錄："+name+'.txt')
    for chapter in get_catalog(start_url):
        try:
            with open(name+'.txt','a+') as f:#寫入本地txt，文件名就是小說的名字
                print("正在下載："+chapter["title"])
                f.write(get_name(start_url)+'\n\n')#寫入小說名字
                f.write(chapter["title"]+'\n')#寫入章節(jié)名字
                f.write(get_text(chapter["link"])+'\n') #寫入章節(jié)內(nèi)容
                f.write('\n=====================分割線====================\n\n')              
        except:
            pass
    print("下載完成")

把上面步驟1-6組合起來就可以實(shí)現(xiàn)一鍵下載小說的效果了：

最終效果

小結(jié)要點(diǎn)

1、request請(qǐng)求網(wǎng)頁，主要就是requests.get()方法；
2、bs4解析網(wǎng)頁，提取網(wǎng)頁數(shù)據(jù)，主要運(yùn)用了.find_all()方法；
3、網(wǎng)頁的基本結(jié)構(gòu)，如標(biāo)簽、屬性等；
4、定義函數(shù)、for循環(huán)、列表、字典等python基本知識(shí)。

完整代碼

import requests
from bs4 import BeautifulSoup as bs

start_url = input("請(qǐng)輸入你要下載的文章鏈接：")+"#Catalog"


def get_url_info(url):#request請(qǐng)求網(wǎng)頁，bs4解析網(wǎng)頁
    res = requests.get(url)
    html = res.text
    soup = bs(html, "lxml") 
    return soup

def get_name(url):#獲取小說名字
    soup = get_url_info(url)
    name = soup.find_all("h1")[1].find("em").text
    #獲取小說名字,類型為list,取第二個(gè)后就轉(zhuǎn)為str類型，此時(shí)獲得的小說名字包含作者，如“一個(gè)人的外貿(mào)江湖 Tessly 著”
    return name


def get_catalog(url):#獲取小說章節(jié)目錄信息，包括1、章節(jié)名字，2、章節(jié)鏈接
    soup = get_url_info(url)
    catalog_data = soup.find_all('ul')[3].find_all('li')    
    catalog = []    
    for i in catalog_data:  
        data = {
                    "title":i.text,#章節(jié)名字
                    "link":"http:"+i.find('a').get('href')#章節(jié)鏈接
                }
        catalog.append(data)
    return catalog

def get_text(url):#獲取小說章節(jié)正文
    soup = get_url_info(url)
    novel = soup.find_all('div',class_="read-content j_readContent")[0].text    
    novel = '\n    '+'\n\n    '.join(novel.split())   
    return novel


if __name__ == '__main__':    
       
    name = get_name(start_url)#小說名字
    print("稍等，正在準(zhǔn)備下載...")   
    print("保存路徑為下載程序所在的目錄："+name+'.txt')
    for chapter in get_catalog(start_url):
        try:
            with open(name+'.txt','a+') as f:#寫入本地txt，文件名就是小說的名字
                print("正在下載："+chapter["title"])
                f.write(get_name(start_url)+'\n\n')
                f.write(chapter["title"]+'\n')
                f.write(get_text(chapter["link"])+'\n') 
                f.write('\n=====================分割線====================\n\n')              
        except:
            pass
    print("下載完成")

input("按任意鍵退出")

當(dāng)然，如有需要打包好的程序和代碼，直接在微信公號(hào)后臺(tái)回復(fù)「小說」即可！

爬蟲系列：

python爬蟲系列（2）- requests庫基本使用

python爬蟲系列（1）- 概述

python實(shí)例：

python幫你定制批量獲取你想要的信息

python幫你定制批量獲取智聯(lián)招聘的信息

用python定制網(wǎng)頁跟蹤神器，有信息更新第一時(shí)間通知你（附視頻演示）

用python助你一鍵下載在線小說

教你制作一個(gè)微信機(jī)器人陪你聊天，只要幾行代碼

Google圖片搜索出了大量滿意圖片，批量下載它們！

帶你看看不一樣的微信！