Scrapy 抓取圖片

目標(biāo):抓取圖片網(wǎng)站 http://hunter-its.com上的圖片

1.建立項(xiàng)目 beauty

scrapy startproject beauty

2.cd到目錄,并新建爬蟲(chóng),使用基礎(chǔ)模板

cd beauty

scrapy genspider hunter hunter-its.com

image.png

3.pycharm打開(kāi)項(xiàng)目,先編寫(xiě)item

打開(kāi)item.py文件,定義名字和地址

import scrapy

class BeautyItem(scrapy.Item):

    name = scrapy.Field()
    address = scrapy.Field()

image.png

4.編寫(xiě)spider,爬蟲(chóng)文件

導(dǎo)入之前定義的BeautyItem模塊,和Request模塊

from beauty.items import BeautyItem
from scrapy.http import Request

使用xpath獲取全部的圖片節(jié)點(diǎn)
pics = response.xpath('//div[@class="pic"]/ul/li')
循環(huán)獲取li節(jié)點(diǎn)中的所有圖片和地址

        for pic in pics:
            item = BeautyItem()
            name = pic.xpath('./a/img/@alt').extract()[0]
            address = pic.xpath('./a/img/@src').extract()[0]

            item['name'] = name
            item['address'] = address

            yield item

遞歸調(diào)用函數(shù),爬取多頁(yè)數(shù)據(jù)

            for i in range(2, 8):
                url = 'http://hunter-its.com/m/'+str(i)+'.html'
                print(url)
                yield Request(url, callback=self.parse)

完整代碼

# -*- coding: utf-8 -*-
import scrapy
from beauty.items import BeautyItem
from scrapy.http import Request


class HunterSpider(scrapy.Spider):
    name = 'hunter'
    allowed_domains = ['hunter-its.com']
    start_urls = ['http://hunter-its.com/m/1.html']

    def parse(self, response):
        #獲取全部的圖片節(jié)點(diǎn)
        pics = response.xpath('//div[@class="pic"]/ul/li')

        for pic in pics:
            item = BeautyItem()
            name = pic.xpath('./a/img/@alt').extract()[0]
            address = pic.xpath('./a/img/@src').extract()[0]

            item['name'] = name
            item['address'] = address

            yield item

            for i in range(2, 8):
                url = 'http://hunter-its.com/m/'+str(i)+'.html'
                print(url)
                yield Request(url, callback=self.parse)

image.png

5.編寫(xiě)數(shù)據(jù)處理腳本pipelines.py,導(dǎo)入requests模塊

import requests

class BeautyPipeline(object):
    def process_item(self, item, spider):

        #模擬瀏覽器
        headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64; rv:50.0) Gecko/20100101 Firefox/50.0'}
        #使用request模塊,發(fā)送get請(qǐng)求
        r = requests.get(url=item['address'], headers=headers, timeout=4)

        print(item['address'])
        #下載圖片,存儲(chǔ)在本地文件目錄下
        with open(r'/Users/vincentwen/Downloads/hunter/'+ item['name'] + '.jpg', 'wb') as f:
            f.write(r.content)

image.png

6.修改setting ITEM_PIPELINES

ITEM_PIPELINES = {
   'beauty.pipelines.BeautyPipeline': 100,
}
image.png

7.運(yùn)行爬蟲(chóng)

scrapy crawl hunter 
image.png
image.png

覺(jué)得文章有用,請(qǐng)用支付寶掃描,領(lǐng)取一下紅包!打賞一下

支付寶紅包碼
最后編輯于
?著作權(quán)歸作者所有,轉(zhuǎn)載或內(nèi)容合作請(qǐng)聯(lián)系作者
【社區(qū)內(nèi)容提示】社區(qū)部分內(nèi)容疑似由AI輔助生成,瀏覽時(shí)請(qǐng)結(jié)合常識(shí)與多方信息審慎甄別。
平臺(tái)聲明:文章內(nèi)容(如有圖片或視頻亦包括在內(nèi))由作者上傳并發(fā)布,文章內(nèi)容僅代表作者本人觀點(diǎn),簡(jiǎn)書(shū)系信息發(fā)布平臺(tái),僅提供信息存儲(chǔ)服務(wù)。

相關(guān)閱讀更多精彩內(nèi)容

友情鏈接更多精彩內(nèi)容