學(xué)習(xí)爬蟲已有半個(gè)月了,感覺這種網(wǎng)頁(yè)的爬取沒有太大的技術(shù)含量。
現(xiàn)在的學(xué)習(xí)卡在了多進(jìn)程這塊,另外爬大數(shù)據(jù)量的時(shí)候總是會(huì)出這樣那樣的BUG,很是頭疼,代理用的也不是很順,繼續(xù)學(xué)吧!
__author__ = 'Kalvin.Tse'
from bs4 import BeautifulSoup
import requests
import pymongo
import time
client = pymongo.MongoClient('localhost', 27017)
baike = client['baike']
joke_info = baike['joke_info']
def get_joke(pages):
for i in range(1,pages):
url = 'http://www.qiushibaike.com/8hr/page/{}'.format(str(i))
wb_data = requests.get(url)
time.sleep(1)
print('正在解析第' + str(i) + '頁(yè)')
print('--'*50)
if wb_data.status_code == 200:
analyse = BeautifulSoup(wb_data.text, 'lxml')
names = analyse.select('div.author.clearfix h2')
contents = analyse.select('div.content')
likes = analyse.select('div.stats span.stats-vote i.number')
for name,content,like in zip(names,contents,likes):
data = {
'用戶名': name.get_text(),
'內(nèi)容': content.get_text().strip(),
'喜歡人數(shù)': like.get_text()
}
print(data)
joke_info.insert_one(data)
else:
pass
get_joke(50) #調(diào)用函數(shù),爬前50頁(yè),有的頁(yè)碼沒有內(nèi)容的直接pass