上一篇一共提到了四個模塊,這一篇我們來實現(xiàn)它們
- 請求模塊
- uid 解析模塊
- 數(shù)據(jù)爬取模塊
- 數(shù)據(jù)保存模塊
一、請求模塊
分析:
- 隨機選擇 user-agent:可以預(yù)先設(shè)置一個保存了許多 user-agent的數(shù)組,然后用 random庫從數(shù)組中隨機選取一個 user-agent
- 設(shè)置代理:使用 **kwargs參數(shù)直接傳遞給 request模塊
- 預(yù)處理:拋棄預(yù)處理,直接返回一個 xpath對象
隨機選擇 ua,將下面這段代碼單獨放到一個文件中(user-agent太多了╯︿╰):
#file random_user_agent.py
#-*- coding: utf-8 -*
import random
def randomUserAgent():
USER_AGENTS = [
"Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; AcooBrowser; .NET CLR 1.1.4322; .NET CLR 2.0.50727)",
"Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.0; Acoo Browser; SLCC1; .NET CLR 2.0.50727; Media Center PC 5.0; .NET CLR 3.0.04506)",
"Mozilla/4.0 (compatible; MSIE 7.0; AOL 9.5; AOLBuild 4337.35; Windows NT 5.1; .NET CLR 1.1.4322; .NET CLR 2.0.50727)",
... ...
]
return random.choice(USER_AGENTS)
請求模塊主體:
接受的參數(shù):url, **kwargs
先檢查 kwargs里是否有 headers,沒有的話使用默認(rèn)的 headers
第二步,用 url和 kwargs和 headers(如果有的話)發(fā)起 request請求(默認(rèn) get)
代理和一些其他的設(shè)置直接通過 kwargs傳遞給 requests請求
第三步,用 etree.HTML() 處理 requests的響應(yīng)
第四步,返回處理后的結(jié)果
代碼如下(前面講得很詳細了,我就沒打注釋了):
#-*- coding: utf-8 -*
import requests
from lxml import etree
from random_user_agent import randomUserAgent
def getResponse(url, **kwargs):
if 'headers' not in kwargs:
kwargs['headers'] = {
'User-Agent': randomUserAgent(),
}
r = requests.get(url, **kwargs)
dom = etree.HTML(r.text)
return dom
二、uid解析模塊
分析:
- 自動去重:從 uid模塊移除,uid模塊只負(fù)責(zé)返回 uid,去重的工作交給爬取模塊
- uid生成器:使用 yield
- 無限爬取:通過遞歸的方式將第一次爬取的信息作為參數(shù)再傳遞給 uid解析模塊
接受參數(shù)示例:
start_users = [{'uid': 'a3ea268aeb60', 'follow_num': 525, 'fans_num': 2521, 'article_num': 118}]
- uid:用戶 uid
- follow_num:用戶關(guān)注數(shù)量
- fans_num:用戶粉絲數(shù)量
- article_num:用戶文章數(shù)量
剛開始爬取時的種子用戶,為了方便只挑了一個用戶,實際爬取時應(yīng)該是一個由多個用戶組成的數(shù)組。
yield返回示例,與接受的參數(shù)一致:
{'uid': 'a3ea268aeb60', 'follow_num': 525, 'fans_num': 2521, 'article_num': 118}
代碼如下:
def getUserUids(start_users):
#保存本次爬取的用戶
next_users = []
#爬取 start_users里每個用戶的所有關(guān)注對象的 uid
for user in start_users:
uid = user['uid']
follow_num = user['follow_num']
#如果 follow_num可以整除每次請求返回的 uid數(shù)量,max_page為 int(follow_num / 9),否則為 int(follow_num / 9)+1, 這里 PER_NUM為 9
max_page = int(follow_num / PER_NUM) if (follow_num % PER_NUM) == 0 else int(follow_num / PER_NUM)+1
following_urls = ['http://www.itdecent.cn/users/{}/following?page={}'.format(uid, i)
for i in range(1, max_page+1)]
for following_url in following_urls:
dom = getResponse(following_url)
items = dom.xpath('//ul/li//div[@class="info"]')
for item in items:
user = {}
try:
user['uid'] = item.xpath('./a/@href')[0].split('/')[2]
user['follow_num'] = int(item.xpath('./div/span[1]/text()')[0].replace('關(guān)注','').strip())
user['fans_num'] = int(item.xpath('./div/span[2]/text()')[0].replace('粉絲', '').strip())
user['article_num'] = int(item.xpath('./div/span[3]/text()')[0].replace('文章','').strip())
next_users.append(user)
yield user
except ValueError:
pass
#遞歸 將本次的爬取結(jié)果作為參數(shù)再傳遞給 getUserUids()
next_user_uids = getUserUids(next_users)
#實現(xiàn)無限爬取
for user in next_user_uids:
yield user
這樣當(dāng)我們調(diào)用 getUserUids()時,就得到了一個可以無限生成 uid的生成器,使用方法如下:
start_users = [{'uid': 'a3ea268aeb60', 'follow_num': 525, 'fans_num': 2521, 'article_num': 118}]
uids = getUserUids(start_users)
for uid in uids:
print(uid)
理論上來說,上面這段代碼會一直在你的控制臺上打印 uid,直到打印完絕大部分簡書用戶或者你選擇停止運行
三、數(shù)據(jù)爬取模塊
數(shù)據(jù)爬取模塊可以直接復(fù)用之前的代碼
分析:
- 去重:用一個 seen數(shù)組保存已經(jīng)爬取過的 uid,每次爬取之前先判斷 uid是否在 seen數(shù)組內(nèi)
將之前的代碼整合為一個模塊:
def getArticleInfo(user):
uid = user['uid']
article_num = user['article_num']
#這里 PER_NUM為 9
max_page = int(article_num / PER_NUM) if (article_num % PER_NUM) == 0 else int(article_num / PER_NUM)+1
article_urls = ['http://www.itdecent.cn/u/{}?order_by=shared_at&page={}'.format(uid, i) for i in
range(1, max_page+1)]
details = []
for article_url in article_urls:
dom = getResponse(article_url)
items = dom.xpath('//ul[@class="note-list"]/li')
for item in items:
# 對每個 li標(biāo)簽再提取
details_xpath = {
'link': './div/a/@href',
'title': './div/a/text()',
'read_num': './/div[@class="meta"]/a[1]/text()',
'comment_num': './/div[@class="meta"]/a[2]/text()',
'heart_num': './/div[@class="meta"]/span[1]/text()',
}
key_and_path = details_xpath.items()
detail = {}
for key, path in key_and_path:
detail[key] = ''.join(item.xpath(path)).strip()
#將數(shù)字轉(zhuǎn)換為整數(shù)
for key in ['read_num', 'comment_num', 'heart_num']:
detail[key] = int(detail[key])
details.append(detail)
#返回爬取結(jié)果
return details
語句:
int(article_num / PER_NUM) if (article_num % PER_NUM) == 0 else int(article_num / PER_NUM)+1
使用了 python三目表達式 if else
使用方法:
seen = []
start_users = [{'uid': 'a3ea268aeb60', 'follow_num': 525, 'fans_num': 2521, 'article_num': 118}]
users = getUserUids(start_users)
for user in users:
if user['uid'] not in seen:
seen.append(user['uid'])
info = getArticleInfo(user)
四、數(shù)據(jù)保存模塊
分析:
- 接受一個字典列表:使用 csv庫的
DictWriter.writerows()方法 - 自動判斷文件是否已存在,選擇合適的模塊打開文件:用 os庫的
os.path.isfile(filepath)來判斷 - 將數(shù)據(jù)保存模塊定義為一個類,這樣方便對文件的管理
代碼如下:
class simplifiedCsv:
def __init__(self, filepath):
self.file, self.writer = self.openFile(filepath)
def __del__(self):
self.file.close()
def openFile(self, filepath):
if os.path.isfile(filepath):
file = open(filepath, 'a', encoding='utf-8', newline='')
writer = csv.DictWriter(file, ['link', 'title', 'read_num', 'comment_num', 'heart_num'])
return file, writer
else:
file = open(filepath, 'w', encoding='utf-8', newline='')
writer = csv.DictWriter(file, ['link', 'title', 'read_num', 'comment_num', 'heart_num'])
writer.writeheader()
return file, writer
def writerows(self, data_list):
self.writer.writerows(data_list)
使用方法:
seen = []
start_users = [{'uid': 'a3ea268aeb60', 'follow_num': 525, 'fans_num': 2521, 'article_num': 118}]
users = getUserUids(start_users)
writer = simplifiedCsv('data.csv')
for user in users:
if user['uid'] not in seen:
seen.append(user['uid'])
info = getArticleInfo(user)
writer.writerows(info)
以上就是我們上一篇講過的所有模塊的實現(xiàn),至于斷點續(xù)爬我們下一篇單獨講
這一次的代碼版本為 v1.0
代碼在 GitHub上的鏈接:version_1_simple_struct_all.py
下載后可以直接用 python運行(前提是安裝好了所需的庫)
程序停止后會在當(dāng)前目錄下生成一個 data.csv的文件
我試運行了十分鐘左右,爬取了大概 1萬 4千條數(shù)據(jù),大家也可以下載源碼自己測試一下,也算是完成了第一個小小目標(biāo),結(jié)果截圖:
最后,覺得不錯的話,記得關(guān)注、點贊、評論哦(? ω ?)