1. 引言
爬取20頁(yè)美眉圖片

Paste_Image.png
網(wǎng)址: http://weheartit.com/inspirations/taylorswift
2. 分析
- 頁(yè)面網(wǎng)址格式: http://weheartit.com/inspirations/taylorswift?page=1
頁(yè)數(shù)即為最后1個(gè)數(shù)字
- 每個(gè)導(dǎo)航頁(yè)中有24張圖片的地址
- 指定系統(tǒng)中一個(gè)文件夾, 將圖片下載保存至其中
- 每次下載前判斷圖片是否已經(jīng)保存了
3. 實(shí)現(xiàn)代碼
# vim spider_taylor.py //新建文件, 代碼如下
#!/usr/bin/env python3
# -*- coding: utf-8 -*-
from bs4 import BeautifulSoup
import requests
import os
headers = {
'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/51.0.2704.106 Safari/537.36'
}
# 下載圖片有點(diǎn)慢,用代理
proxies = {"http": "127.0.0.1:8118"}
# 定義列表用來(lái)存放圖片地址
pic_lists = []
# 定義獲取圖片地址函數(shù)
def get_pic_from(url):
wb_data = requests.get(url, headers=headers)
if wb_data.status_code != 200:
print('%s Request Error!' % wb_data.status_code)
else:
soup = BeautifulSoup(wb_data.text, 'lxml')
links = soup.select('a.js-entry-detail-link > img')
for link in links:
pic_link = link.get('src')
# 每獲取到一個(gè)地址,就放入列表pic_lists中
pic_lists.append(pic_link)
# 打印圖片地址
print(pic_link)
# 定義下載圖片函數(shù)
def down_pic_from(pic_url):
# 圖片存放路徑
file_folder = '/home/wjh/taylor/'
# 截取url中的圖片名字,并將其和路徑組合
file_path = file_folder + pic_url.split('/')[-2] + '.' + pic_url.split('.')[-1]
# 判斷圖片存在與否, 存在則輸出其路徑和大小
if os.path.exists(file_path):
print('"%s" File exists... : %d K' % (file_path, os.path.getsize(file_path)//1024))
else:
sub_data = requests.get(pic_url, proxies=proxies, headers=headers)
if sub_data.status_code != 200:
print('%s Request Error!' % sub_data.status_code)
# 保存圖片
with open(file_path, 'wb') as fs:
fs.write(sub_data.content)
# 保存完后打印圖片地址、路徑、大小
print("%s => %s - %d K" % (pic_url, file_path, os.path.getsize(file_path)//1024))
# 所有要請(qǐng)求的頁(yè)面地址存入一個(gè)列表中
urls = ['http://weheartit.com/inspirations/taylorswift?page={}'.format(i) for i in range(1, 21)]
# 獲取圖片地址列表
for url in urls:
get_pic_from(url)
# 從列表pic_lists中逐個(gè)取出圖片地址放入下載函數(shù)中以下載圖片
for url in pic_lists:
down_pic_from(url)
# python3 spider_taylor.py //運(yùn)行輸出如下
http://data.whicdn.com/images/247905262/superthumb.jpg
http://data.whicdn.com/images/247922201/superthumb.png
http://data.whicdn.com/images/247820953/superthumb.png
http://data.whicdn.com/images/120752508/superthumb.jpg
http://data.whicdn.com/images/127306344/superthumb.jpg
http://data.whicdn.com/images/247922493/superthumb.png
http://data.whicdn.com/images/247852730/superthumb.jpg
.
.
.
http://data.whicdn.com/images/247824232/superthumb.jpg
http://data.whicdn.com/images/246004790/superthumb.gif
http://data.whicdn.com/images/247923478/superthumb.jpg
.
.
.
"/home/wjh/taylor/189891880.gif" File exists... : 32 K
"/home/wjh/taylor/189891478.gif" File exists... : 60 K
"/home/wjh/taylor/222639313.gif" File exists... : 62 K
"/home/wjh/taylor/194658386.png" File exists... : 113 K
"/home/wjh/taylor/190061497.jpg" File exists... : 92 K
"/home/wjh/taylor/181241808.jpg" File exists... : 36 K
"/home/wjh/taylor/238329467.jpg" File exists... : 29 K
http://data.whicdn.com/images/247822005/superthumb.jpg => /home/wjh/taylor/247822005.jpg - 23 K
http://data.whicdn.com/images/247821899/superthumb.jpg => /home/wjh/taylor/247821899.jpg - 30 K
http://data.whicdn.com/images/247853275/superthumb.jpg => /home/wjh/taylor/247853275.jpg - 30 K
.
.
.
http://data.whicdn.com/images/247859540/superthumb.jpg => /home/wjh/taylor/247859540.jpg - 25 K
http://data.whicdn.com/images/247854181/superthumb.jpg => /home/wjh/taylor/247854181.jpg - 29 K
http://data.whicdn.com/images/247898534/superthumb.jpg => /home/wjh/taylor/247898534.jpg - 20 K
http://data.whicdn.com/images/247906728/superthumb.jpg => /home/wjh/taylor/247906728.jpg - 19 K
4. 總結(jié)
- with open方法有好幾種:
-
r: 只讀打開(kāi) -
w: 寫(xiě)打開(kāi) -
rb: 二進(jìn)制只讀打開(kāi) -
wb: 二進(jìn)制寫(xiě)打開(kāi)
-
- 用到的os.path幾個(gè)方法:
- exists(): 判斷文件在與否
- getsizes(): 判斷文件大小, 單位是byte