1網(wǎng)站整個圖片的意思是，網(wǎng)站有用的圖片，廣告推薦位，等等除外
萌新上路，老司機請略過

第一步找出網(wǎng)站url分頁的規(guī)律

選擇自己要爬取的分類（如果要所有的圖片可以不選，顯示的就是所有的照片，具體怎么操作請根據(jù)實際情況進行改進）

QQ截圖20190620144258.png

url地址的顯示

QQ截圖20190620144349.png

看分頁的url規(guī)律

QQ截圖20190620144417.png

url地址的顯示

由此可知分頁的參數(shù)就是 page/頁數(shù)

第二步獲取總頁數(shù)和進行url請求

1判斷頁數(shù)的幾種辦法，1最直接的從瀏覽器上眼看 2先數(shù)一頁完整的網(wǎng)頁一共有多少套圖片，假如有15套，如果有一頁少于15套那它就是最后一頁（不排除最后一頁也是15張）3和第一種方法差不多區(qū)別在于是用程序來查看總頁數(shù)的，4不管多少頁寫個http異常捕獲，如果get請求返回的是404那就是已經(jīng)爬完了 5頁面捕獲下一頁，如果沒有下一頁就證明爬取完成（但是有些數(shù)據(jù)少的頁面就沒有下一頁這個標(biāo)簽，這就尷尬了），這里以第三種方法為例
由圖可知總頁數(shù)

you

用程序捕捉頁數(shù)
由上圖可知翻頁的布局在一個div里正常情況下包括上一頁1 2 3 4 5 ... 101下一頁一共9個選擇項那么倒數(shù)第二個就是總頁數(shù)
通過xpath獲取標(biāo)簽的規(guī)則
這里點擊右鍵copy copy xpath

QQ截圖20190620150800.png

然后用到一個谷歌插件 xpath

剛才

把剛才復(fù)制的xpath 粘貼進去

QQ截圖20190620150845.png

可以看到獲取的是總頁數(shù)101 但是我們認為的是標(biāo)簽的倒數(shù)第二個才是總頁數(shù)，所以我們獲取的是一個列表而不是一個確定的值，因為翻頁的便簽的個數(shù)是會變的但是總頁數(shù)一直都是最后一個*（這里以我測試的網(wǎng)站為例，，一切以實際情況為準(zhǔn)）
獲取翻頁的列表

QQ截圖20190620150859.png

調(diào)用查找總頁數(shù)的方法返回第二個值就是總頁數(shù)

QQ截圖20190620152157.png

并做個判斷如果頁數(shù)大于總頁數(shù)的時候跳出循環(huán)

頁數(shù)判斷完畢進行圖片爬取

一個頁面有20組圖片通過xpath獲取這10組圖片的鏈接并進行請求

QQ截圖20190620152830.png

一共四步
1訪問第一頁抓取一共多少頁，

QQ截圖20190620183133.png

第二步抓取頁面10組圖詳情頁的連接

QQ截圖20190620183139.png

第三請求第一組圖片的詳情頁獲取多少張圖片，

![QQ截圖20190620183154.png](https://upload-images.jianshu.io/upload_images/18295040-d11db768fd21ecf0.png?imageMogr2/auto-orient/strip%7CimageView2/2/w/1240)

第四步請求每一頁的詳情頁并保存圖片，

QQ截圖20190620183154.png

）

其實可以整合成兩步，我這樣寫等于多請求了兩次，懶的改了，有興趣的話可以自己改一下

QQ截圖20190620183420.png

看我嗶嗶了那么多，其實沒啥用有用的才開始多線程

根據(jù)你的網(wǎng)速如果下一張圖片沒問題，那么100張100萬張呢？
整個http訪問的過程最慢的就是請求圖片鏈接進行保存，這是最慢的一步，因為圖片的資源大（這是廢話）
假如一組套圖有70張，保存一張就要3秒，70張是多少秒，我不知道（小學(xué)畢業(yè)），，，但是如果開了多線程，保存一張要3秒保存100張也要3秒（原理就不解釋了，大家都懂，上代碼了）

開啟隊列

QQ截圖20190620184013.png

整合圖片詳情頁的url添加到隊列里并開啟進程（其實可以一個for循環(huán)完成，但是我試了幾次老是添加的多了所以添加隊列和開啟進程就分開，你可以試試用一個循環(huán)）

QQ截圖20190620184041.png

每個線程結(jié)束后刪除一個相應(yīng)的隊列

QQ截圖20190620184121.png

（由于mac 和win的路徑方式不同我就沒有寫如果沒有創(chuàng)建文件夾就自動創(chuàng)建，所以運行之前請在代碼的同級目錄創(chuàng)建一個imgs文件夾）
看看速度的對比

多線程的前提是對訪問的頻率沒有限制，一般的小網(wǎng)站和見不得人的網(wǎng)站都沒有這樣限制，所以你懂得！

QQ截圖20190620190027.png

爬取相同的一組圖片

QQ截圖20190620190256.png

普通版

import requests
from lxml import etree
import random
import threading
from  time import sleep
from queue import Queue
class ImgSpider() :
    def __init__(self):
        self.urls = 'http://www.jitaotu.com/tag/meitui/page/{}'
        self.deatil = 'http://www.jitaotu.com/xinggan/{}'
        self.headers = {
            'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3',
            'Accept-Encoding': 'gzip, deflate',
            'Accept-Language': 'zh-CN,zh;q=0.9',
            'Cache-Control': 'max-age=0',
            'Cookie': 'UM_distinctid=16b7398b425391-0679d7e790c7ad-3e385b04-1fa400-16b7398b426663;Hm_lvt_7a498bb678e31981e74e8d7923b10a80=1561012516;CNZZDATA1270446221 = 1356308073 - 1561011117 - null % 7C1561021918;Hm_lpvt_7a498bb678e31981e74e8d7923b10a80 = 1561022022',
            'Host': 'www.jitaotu.com',
            'If-None-Match': '"5b2b5dc3-2f7a7"',
            'Proxy-Connection': 'keep-alive',
            'Referer': 'http://www.jitaotu.com/cosplay/68913_2.html',
            'Upgrade-Insecure-Requests': '1',
             'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.169 Safari/537.36',
                        }
        self.url_queue = Queue()

    def pages(self):
        #總頁數(shù)
        response = requests.get(self.urls.format(2))
        strs = response.content.decode()
        html = etree.HTML(strs)
        page = html.xpath('/html/body/section[1]/nav/div/a/text()')
        return page[-2]

    def html_list(self, page):
        #頁面組圖的連接
        print(self.urls.format(page))
        response = requests.get(self.urls.format(page))
        strs = response.content.decode()
        html = etree.HTML(strs)
        page = html.xpath('/html/body/section[1]/div/ul/li/div[1]/a/@href')
        return page

    def detail_page(self, imgde):
        #總圖片數(shù)
        response = requests.get(self.deatil.format(imgde))
        strs = response.content.decode()
        html = etree.HTML(strs)
        page = html.xpath("http://*[@id='imagecx']/div[4]/a[@class='page-numbers']/text()")
        return page[-1]
    def detail_list(self, imgde, page):
        #圖片詳情頁連接
        #截取鏈接關(guān)鍵碼
        urls = imgde[-10:-5]
        print('開始訪問圖片頁面并抓取圖片地址保存')
        for i in range(int(page)):
            print(self.deatil.format(urls+'_'+str(i+1)+'.html'))
            response = requests.get(self.deatil.format(urls+'_'+str(i+1)+'.html'))
            strs = response.content.decode()
            html = etree.HTML(strs)
            imgs = html.xpath('//*[@id="imagecx"]/div[3]/p/a/img/@src')
            #保存圖片
            self.save_img(imgs)

    def save_img(self, imgs):
        print(imgs[0]+'?tdsourcetag=s_pcqq_aiomsg')
        response = requests.get(imgs[0], headers=self.headers)
        strs = response.content
        s = random.sample('zyxwvutsrqponmlkjihgfedcba1234567890', 5)
        a = random.sample('zyxwvutsrqponmlkjihgfedcba1234567890', 5)
        with open("./imgs/" + str(a) + str(s) + ".jpg", "wb") as f:
            f.write(strs)
        print("保存圖片")
        return
    def run(self):
        page = 1
        # 獲取總頁數(shù)
        pageall = self.pages()
        print('總頁數(shù)'+str(pageall))
        while True:
            print('訪問第' + str(page)+'頁')
            #訪問頁面，獲取10組圖片的詳情頁鏈接
            html_list = self.html_list(page)
            #訪問圖片的詳情頁
            s =1
            for htmls in html_list:
                print('訪問第'+str(page)+'頁第'+str(s)+'組')
                imgdetalpage = self.detail_page(htmls)
                # 址遍歷詳情頁請求獲取圖片地
                print('第' + str(page) + '頁第' + str(s) + '組有'+str(imgdetalpage)+'張圖片')
                self.detail_list(htmls, imgdetalpage)
                s += 1
            page += 1
            if page > pageall:
                print('爬取完畢 退出循環(huán)')
                return

if __name__ == '__main__':
    Imgs = ImgSpider()
    Imgs.run()

多線程

import requests
from lxml import etree
import random
import threading
from  time import sleep
from queue import Queue
class ImgSpider() :
    def __init__(self):
        self.urls = 'http://www.jitaotu.com/tag/meitui/page/{}'
        self.deatil = 'http://www.jitaotu.com/xinggan/{}'
        self.headers = {
            'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3',
            'Accept-Encoding': 'gzip, deflate',
            'Accept-Language': 'zh-CN,zh;q=0.9',
            'Cache-Control': 'max-age=0',
            'Cookie': 'UM_distinctid=16b7398b425391-0679d7e790c7ad-3e385b04-1fa400-16b7398b426663;Hm_lvt_7a498bb678e31981e74e8d7923b10a80=1561012516;CNZZDATA1270446221 = 1356308073 - 1561011117 - null % 7C1561021918;Hm_lpvt_7a498bb678e31981e74e8d7923b10a80 = 1561022022',
            'Host': 'www.jitaotu.com',
            'If-None-Match': '"5b2b5dc3-2f7a7"',
            'Proxy-Connection': 'keep-alive',
            'Referer': 'http://www.jitaotu.com/cosplay/68913_2.html',
            'Upgrade-Insecure-Requests': '1',
             'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.169 Safari/537.36',
                        }
        self.url_queue = Queue()

    def pages(self):
        response = requests.get(self.urls.format(2))
        strs = response.content.decode()
        html = etree.HTML(strs)
        page = html.xpath('/html/body/section[1]/nav/div/a/text()')
        return page[-2]

    def html_list(self, page):
        print(self.urls.format(page))
        response = requests.get(self.urls.format(page))
        strs = response.content.decode()
        html = etree.HTML(strs)
        page = html.xpath('/html/body/section[1]/div/ul/li/div[1]/a/@href')
        return page

    def detail_page(self, imgde):
        response = requests.get(self.deatil.format(imgde))
        strs = response.content.decode()
        html = etree.HTML(strs)
        page = html.xpath("http://*[@id='imagecx']/div[4]/a[@class='page-numbers']/text()")
        return page[-1]
    def detail_list(self, imgde, page):
        #截取鏈接關(guān)鍵碼
        urls = imgde[-10:-5]
        print('開始訪問圖片頁面并抓取圖片地址保存')
        for i in range(int(page)):
            print(self.deatil.format(urls + '_' + str(i + 1) + '.html'))
            urlss = self.deatil.format(urls + '_' + str(i + 1) + '.html')
            self.url_queue.put(urlss)
        for i in range(int(page)):
            t_url = threading.Thread(target=self.More_list)
            # t_url.setDaemon(True)
            t_url.start()
        self.url_queue.join()
        print('主線程結(jié)束進行下一個')
    def More_list(self):
        urls = self.url_queue.get()
        response = requests.get(urls)
        strs = response.content.decode()
        html = etree.HTML(strs)
        imgs = html.xpath('//*[@id="imagecx"]/div[3]/p/a/img/@src')
        # 保存圖片
        self.save_img(imgs)
    def save_img(self, imgs):
        try:
            print(imgs[0])
            response = requests.get(imgs[0], headers=self.headers)
        except:
            print('超時跳過')
            self.url_queue.task_done()
            return
        else:
            strs = response.content
            s = random.sample('zyxwvutsrqponmlkjihgfedcba1234567890', 5)
            a = random.sample('zyxwvutsrqponmlkjihgfedcba1234567890', 5)
            with open("./imgsa/" + str(a) + str(s) + ".jpg", "wb") as f:
                f.write(strs)
            print("保存圖片")
            self.url_queue.task_done()
            return
    def run(self):
        page = 1
        # 獲取總頁數(shù)
        pageall = self.pages()
        print('總頁數(shù)'+str(pageall))
        while True:
            print('訪問第' + str(page)+'頁')
            #訪問頁面，獲取10組圖片的詳情頁鏈接
            html_list = self.html_list(page)
            #訪問圖片的詳情頁
            s =1
            for htmls in html_list:
                print('訪問第'+str(page)+'頁第'+str(s)+'組')
                imgdetalpage = self.detail_page(htmls)
                # 址遍歷詳情頁請求獲取圖片地
                print('第' + str(page) + '頁第' + str(s) + '組有'+str(imgdetalpage)+'張圖片')
                self.detail_list(htmls, imgdetalpage)
                s += 1
            page += 1
            if page > pageall:
                print('爬取完畢 退出循環(huán)')
                return

if __name__ == '__main__':
    Imgs = ImgSpider()
    Imgs.run()

看不懂不理解的可以問我，我也是新手可以交流交流 qq1341485724

色偷偷精品伊人,欧洲久久精品,欧美综合婷婷骚逼,国产AV主播,国产最新探花在线,九色在线视频一区,伊人大交九欧美,1769亚洲,黄色成人av

python 多線程爬取網(wǎng)站圖片（詳解）

python 多線程爬取網(wǎng)站圖片（詳解）

第一步找出網(wǎng)站url分頁的規(guī)律

第二步獲取總頁數(shù)和進行url請求

其實可以整合成兩步，我這樣寫等于多請求了兩次，懶的改了，有興趣的話可以自己改一下

看我嗶嗶了那么多，其實沒啥用有用的才開始多線程

多線程的前提是對訪問的頻率沒有限制，一般的小網(wǎng)站和見不得人的網(wǎng)站都沒有這樣限制，所以你懂得！

QQ截圖20190620190027.png

相關(guān)閱讀更多精彩內(nèi)容

友情鏈接更多精彩內(nèi)容

色偷偷精品伊人,欧洲久久精品,欧美综合婷婷骚逼,国产AV主播,国产最新探花在线,九色在线视频一区,伊人大交九 欧美,1769亚洲,黄色成人av

python 多線程爬取網(wǎng)站圖片（詳解）

第一步找出網(wǎng)站url分頁的規(guī)律

第二步獲取總頁數(shù)和進行url請求

其實可以整合成兩步，我這樣寫等于多請求了兩次，懶的改了，有興趣的話可以自己改一下

看我嗶嗶了那么多，其實沒啥用 有用的才開始 多線程

多線程的前提是對訪問的頻率沒有限制，一般的小網(wǎng)站和見不得人的網(wǎng)站都沒有這樣限制，所以你懂得！ QQ截圖20190620190027.png

相關(guān)閱讀更多精彩內(nèi)容

友情鏈接更多精彩內(nèi)容

色偷偷精品伊人,欧洲久久精品,欧美综合婷婷骚逼,国产AV主播,国产最新探花在线,九色在线视频一区,伊人大交九欧美,1769亚洲,黄色成人av

其實可以整合成兩步，我這樣寫等于多請求了兩次，懶的改了，有興趣的話可以自己改一下

看我嗶嗶了那么多，其實沒啥用有用的才開始多線程

多線程的前提是對訪問的頻率沒有限制，一般的小網(wǎng)站和見不得人的網(wǎng)站都沒有這樣限制，所以你懂得！

QQ截圖20190620190027.png