urllib、pyquery下載筆趣閣小說


author: sunny
title: urllib、pyquery下載筆趣閣小說
date: 2018-09-25 14:28:04
categories: 編程
tags: python


一、小說章節(jié)路徑獲取

1、爬取的小說名為摸金天師,小說首頁為http://www.biquge.com.tw/18_18128/,通過urllib.request.urlopen獲取頁面HTTPResposne類型的對象,在通過read()方法獲取頁面內(nèi)容

request = urllib.request.Request(url, headers=headers)
try:
    content = urllib.request.urlopen(request)
    text = str(content.read(), encoding = 'gbk')
    content.close()
    return text
except urllib.error.URLError as e:
    print(e.reason)
    return ''

2、審查章節(jié)元素,獲取章節(jié)路徑

chapters
def get_all_chapter(self):
    html = self.request(self.url)
    doc = pq(html)
    all_chapters = doc('#list a').items()
    for a in all_chapters:
        text = a.text()
        href = self.domain + a.attr('href')
        self.chapter_titles.append(text)
        self.chapter_urls.append(href)

3、審查頁面元素,獲取每個章節(jié)內(nèi)容

content
def get_content(self, url):
    html = self.request(url)
    doc = pq(html)
    content = doc('#content').text()
    content = content.replace('\xa0'*4, '\n\n')
    return content

4、將文章輸出txt

def write (self, name, path, txt):
        write_flag = True
        with open(path, 'a', encoding = 'utf-8') as f:
            f.write(name + '\n')
            f.writelines(txt)
            f.write('\n\n')

二、完整代碼

#!/usr/bin/env python3
# -*- coding: utf-8 -*-
# author: sunny

import urllib.request
from pyquery import PyQuery as pq
import sys, random, time

class DownloadNovel():
    def __init__(self, url):
        self.url = url
        self.chapter_urls = []
        self.chapter_titles = []
        self.domain = 'http://www.biquge.com.tw'
        self.sleep_download_time = 5
    def get_all_chapter(self):
        html = self.request(self.url)
        doc = pq(html)
        all_chapters = doc('#list a').items()
        for a in all_chapters:
            text = a.text()
            href = self.domain + a.attr('href')
            self.chapter_titles.append(text)
            self.chapter_urls.append(href)
    def get_content(self, url):
        html = self.request(url)
        doc = pq(html)
        content = doc('#content').text()
        content = content.replace('\xa0'*4, '\n\n')
        return content
    def write(self, name, path, txt):
        with open(path, 'a', encoding='utf-8') as f:
            f.write(name + '\n')
            f.writelines(txt)
            f.write('\n\n')
    def request(self, url):
        time.sleep(self.sleep_download_time)
        # 動態(tài)userAgent
        user_agent_list = [ \
            "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/22.0.1207.1 Safari/537.1" \
            "Mozilla/5.0 (X11; CrOS i686 2268.111.0) AppleWebKit/536.11 (KHTML, like Gecko) Chrome/20.0.1132.57 Safari/536.11", \
            "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.6 (KHTML, like Gecko) Chrome/20.0.1092.0 Safari/536.6", \
            "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.6 (KHTML, like Gecko) Chrome/20.0.1090.0 Safari/536.6", \
            "Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/19.77.34.5 Safari/537.1", \
            "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/536.5 (KHTML, like Gecko) Chrome/19.0.1084.9 Safari/536.5", \
            "Mozilla/5.0 (Windows NT 6.0) AppleWebKit/536.5 (KHTML, like Gecko) Chrome/19.0.1084.36 Safari/536.5", \
            "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3", \
            "Mozilla/5.0 (Windows NT 5.1) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3", \
            "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_0) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3", \
            "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1062.0 Safari/536.3", \
            "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1062.0 Safari/536.3", \
            "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3", \
            "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3", \
            "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3", \
            "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.0 Safari/536.3", \
            "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/535.24 (KHTML, like Gecko) Chrome/19.0.1055.1 Safari/535.24", \
            "Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/535.24 (KHTML, like Gecko) Chrome/19.0.1055.1 Safari/535.24"
        ]
        ua = random.choice(user_agent_list)
        headers = {
            'User-Agent': ua
        }
        request = urllib.request.Request(url, headers=headers)
        try:
            content = urllib.request.urlopen(request)
            text = str(content.read(), encoding = 'gbk')
            content.close()
            return text
        except urllib.error.URLError as e:
            print(e.reason)
            return ''

if __name__ == '__main__':
    dl = DownloadNovel('http://www.biquge.com.tw/18_18128/')
    dl.get_all_chapter()
    for i in range(len(dl.chapter_titles)):
        print('url=%s, title=%s' %(dl.chapter_urls[i],dl.chapter_titles[i]))
        txt = dl.get_content(dl.chapter_urls[i])
        dl.write(dl.chapter_titles[i], '摸金天師.txt', txt)
        sys.stdout.write('  已下載:%.3f%%' % float(i/len(dl.chapter_titles)) + '\r')
        sys.stdout.flush()
    print('下載完成')

三、效果

效果

四、源碼

源碼鏈接

?著作權(quán)歸作者所有,轉(zhuǎn)載或內(nèi)容合作請聯(lián)系作者
【社區(qū)內(nèi)容提示】社區(qū)部分內(nèi)容疑似由AI輔助生成,瀏覽時請結(jié)合常識與多方信息審慎甄別。
平臺聲明:文章內(nèi)容(如有圖片或視頻亦包括在內(nèi))由作者上傳并發(fā)布,文章內(nèi)容僅代表作者本人觀點,簡書系信息發(fā)布平臺,僅提供信息存儲服務(wù)。

相關(guān)閱讀更多精彩內(nèi)容

  • Android 自定義View的各種姿勢1 Activity的顯示之ViewRootImpl詳解 Activity...
    passiontim閱讀 178,765評論 25 709
  • 用兩張圖告訴你,為什么你的 App 會卡頓? - Android - 掘金 Cover 有什么料? 從這篇文章中你...
    hw1212閱讀 13,913評論 2 59
  • ¥開啟¥ 【iAPP實現(xiàn)進入界面執(zhí)行逐一顯】 〖2017-08-25 15:22:14〗 《//首先開一個線程,因...
    小菜c閱讀 7,295評論 0 17
  • https://www.cnblogs.com/xiao-apple36/p/8433400.html urlli...
    長風(fēng)哥哥閱讀 5,335評論 0 1
  • 維以戊戌之年,乙卯之月。烏云蔽日,大風(fēng)泱泱。余等逾三站,徇海陽,越諸港,臨工商。同行四人,且歌且狂。所為者...
    呂緯甫閱讀 1,632評論 0 0

友情鏈接更多精彩內(nèi)容