最近事情比較多,所以從上周就開始寫的新浪微博爬蟲一直拖到了現(xiàn)在,不過不得不說新浪微博的反扒,我只想說我真的服氣了。
爬取數(shù)據(jù)前的準(zhǔn)備
向右奔跑老大說這次的就不限制要爬取哪些內(nèi)容了,但是給一個(gè)參考,有興趣的可以搞一搞:

當(dāng)我看到這個(gè)的時(shí)候感覺很有意思和搞頭就想去整一整,所以我的一個(gè)想法就是去找一個(gè)粉絲比較多的人去解析他的分析信息,然后再去解析他粉絲的粉絲,以此類推(感覺解析初始用戶的關(guān)注的人的粉絲會(huì)更好一點(diǎn),因?yàn)樗姆劢z比較多,他關(guān)注的人粉絲量肯定不會(huì)小),但是到后來我就想放棄這個(gè)想法了,因?yàn)橛龅降膯栴}真的一大堆,好了廢話不多說,來看一下我抓取的信息:
- 抓取的信息:
- 1.微博標(biāo)題
- 2.微博nickname
- 3.標(biāo)識id
- 4.微博等級
- 5.地區(qū)
- 6.畢業(yè)院校
- 7.關(guān)注量+URL
- 8.粉絲量+URL
- 9.微博量+URL
大致獲取的也就這么多信息,因?yàn)楹芏嗳说男畔⑹遣煌晟频?,所以就先抓這么多進(jìn)行測試。
一個(gè)基本的思路
確定了我們要找的信息,接下來就是去解析網(wǎng)頁了(一個(gè)大的難題要出現(xiàn)了),在我看來獲取網(wǎng)頁目前遇到的:1.解析源碼,2.抓包(json),但是新浪微博這個(gè)就比較煩了,他這個(gè)是在js中,并且是未加載的(只能用正則或者selenium模擬瀏覽器了),看到這個(gè)之后我想了一段時(shí)間并且問了羅羅攀 有沒有其他的方法,不行我就用selenium,他說還是推薦正則,解析快一點(diǎn),selenium是最后的選擇,沒辦法了只好硬著頭皮去寫正則了,這里在測試正則是否正確,可以使用在線測試工具,進(jìn)行正則的測試,不必去一遍又一遍運(yùn)行代碼。


找到這些信息,盯著源碼一直瞅,看的我頭都大了,其實(shí)又快捷的方法ctrl+f

現(xiàn)在信息的位置我們都清楚在哪了,那么就是寫匹配信息的正則了,這個(gè)只能是自己慢慢去寫,可以練習(xí)正則表達(dá)式。
URL+粉絲分頁問題
個(gè)人主頁URL
我們先來看一個(gè)示例:http://weibo.com/p/1005051497035431/home?from=page_100505&mod=TAB&is_hot=1#place,
這個(gè)URL,給大家提個(gè)醒直接用這個(gè)是看不到主頁信息的,但是在代碼的測試源碼中我們能看到一個(gè)location重定向的連接,是將#之后的部分替換為&retcode=6102,所以URL應(yīng)該為:http://weibo.com/p/1005051497035431/home?from=page_100505&mod=TAB&is_hot=1&retcode=6102,
我點(diǎn)擊連接測試了一下,看到的內(nèi)容和第一條連接一樣,并且還有一點(diǎn),我們之后獲取的所有連接都要替換#之后的內(nèi)容,來一個(gè)示例吧:
urls = re.findall(r'class=\\"t_link S_txt1\\" href=\\"(.*?)\\"',data)
careUrl = urls[0].replace('\\','').replace('#place','&retcode=6102')
fansUrl = urls[1].replace('\\','').replace('#place','&retcode=6102')
wbUrl = urls[2].replace('\\','').replace('#place','&retcode=6102')
如果不進(jìn)行替換,我們拿獲取后的仍然是無法獲取到我們要的源碼。
粉絲分頁問題
我本想可以解析一個(gè)人的粉絲,就可以獲取大量的數(shù)據(jù),可還是栽在了系統(tǒng)限制(我在爬取的時(shí)候第五頁之后就返回不到數(shù)據(jù))

看到這個(gè)之后,系統(tǒng)限制,這個(gè)又是什么,好吧只能看100個(gè)粉絲的信息,沒辦法了也只能繼續(xù)寫下去。所以說我們只要考慮5頁的數(shù)據(jù),總頁數(shù)大于5頁按五頁對待,小于5頁的正常去寫就可以,這個(gè)搞明白之后,就是要去解決分頁的連接了,通過三條URL進(jìn)行對比:
- 1.http://weibo.com/p/1005051110411735/follow?relate=fans&from=100505&wvr=6&mod=headfans¤t=fans#place
- 2.http://weibo.com/p/1005051110411735/follow?relate=fans&page=2#Pl_Official_HisRelation__60
- 2.http://weibo.com/p/1005051110411735/follow?relate=fans&page=3#Pl_Official_HisRelation__60
通過這兩個(gè)URL我們可以看出,差別就在后半部分,除了之前我說的要將是將#之后的部分替換為&retcode=6102,之外還要改動(dòng)一點(diǎn),就是follow?之后的內(nèi)容那么改動(dòng)后,我們就從第二頁去構(gòu)造URL。
示例代碼:
urls = ['http://weibo.com/p/1005051497035431/follow?relate=fans&page={}&retcode=6102'.format(i) for i in range(2,int(pages)+1)]
那么URL分頁問題就搞定了,也可以說解決了一個(gè)難題。如果你認(rèn)為新浪微博只有這些反扒的話,就太天真了,讓我們接著往下看。
布滿荊棘的路
整個(gè)獲取過程就是各種坑,之前主要是說了數(shù)據(jù)的獲取方式和URL及粉絲分頁的問題,現(xiàn)在我們來看一下新浪微博的一些反扒:
首先,在請求的時(shí)候必須加cookies進(jìn)行身份驗(yàn)證,這個(gè)挺正常的,但是在這來說他真的不是萬能的,因?yàn)閏ookie也是有生存期的,這個(gè)在獲取個(gè)人信息的時(shí)候還沒什么問題,但是在獲取粉絲頁面信息的時(shí)候就出現(xiàn)了過期的問題,那該怎么解決呢,想了很久,最后通過selenium模擬登錄解決了,這個(gè)之后在詳細(xì)說,總之,這一點(diǎn)要注意。
然后,另外一個(gè)點(diǎn),不是每一個(gè)人的源碼都是一樣的,怎么說呢最明顯的自己可以去對比下,登錄微博后看一下自己粉絲的分頁那部分源碼和你搜索的那個(gè)用戶的源碼一樣不,除此之外其他的源碼信息也有不一樣,我真的指向說一句,大公司就是厲害。


大家自習(xí)看應(yīng)該可以看出來不同,所以整體來說新浪微博挺難爬。
代碼
代碼這一塊,確實(shí)沒整好,問題也比較多,但是可以把核心代碼貼出來供大家參考和探討(感覺自己寫的有點(diǎn)亂)
說一下代碼結(jié)構(gòu),兩個(gè)主類和三個(gè)輔助類:
兩個(gè)主類:第一個(gè)類去解析粉絲id,另一個(gè)類去解析詳細(xì)信息(解析的時(shí)候會(huì)判斷id是否解析過)
三個(gè)輔助類:第一個(gè)去模擬登陸返回cookies(再爬取數(shù)據(jù)的過程中,好像是只調(diào)用了一次,可能是代碼的問題),第二個(gè)輔助類去返回一個(gè)隨機(jī)代理,第三個(gè)輔助類將個(gè)人信息寫入mysql。
下邊我就將兩個(gè)主類的源碼貼出來,把輔助類相關(guān)其他的信息去掉仍然是可以運(yùn)行的。
1.fansSpider.py
#-*- coding:utf-8 -*-
import requests
import re
import random
from proxy import Proxy
from getCookie import COOKIE
from time import sleep
from store_mysql import Mysql
from weibo_spider import weiboSpider
class fansSpider(object):
headers = [
{"user-agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/52.0.2743.116 Safari/537.36"},
{"user-agent": "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/22.0.1207.1 Safari/537.1"},
{"user-agent": "Mozilla/5.0 (X11; CrOS i686 2268.111.0) AppleWebKit/536.11 (KHTML, like Gecko) Chrome/20.0.1132.57 Safari/536.11"},
{"user-agent": "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.6 (KHTML, like Gecko) Chrome/20.0.1092.0 Safari/536.6"},
{"user-agent": "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.6 (KHTML, like Gecko) Chrome/20.0.1090.0 Safari/536.6"},
{"user-agent": "Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/19.77.34.5 Safari/537.1"},
{"user-agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/536.5 (KHTML, like Gecko) Chrome/19.0.1084.9 Safari/536.5"},
{"user-agent": "Mozilla/5.0 (Windows NT 6.0) AppleWebKit/536.5 (KHTML, like Gecko) Chrome/19.0.1084.36 Safari/536.5"},
{"user-agent": "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3"},
{"user-agent": "Mozilla/5.0 (Windows NT 5.1) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3"},
{"user-agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_0) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3"},
{"user-agent": "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1062.0 Safari/536.3"},
{"user-agent": "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1062.0 Safari/536.3"},
{"user-agent": "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3"},
{"user-agent": "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3"},
{"user-agent": "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3"},
{"user-agent": "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.0 Safari/536.3"},
{"user-agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/535.24 (KHTML, like Gecko) Chrome/19.0.1055.1 Safari/535.24"},
{"user-agent": "Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/535.24 (KHTML, like Gecko) Chrome/19.0.1055.1 Safari/535.24"}
]
def __init__(self):
self.wbspider = weiboSpider()
self.proxie = Proxy()
self.cookie = COOKIE()
self.cookies = self.cookie.getcookie()
field = ['id']
self.mysql = Mysql('sinaid', field, len(field) + 1)
self.key = 1
def getData(self,url):
self.url = url
proxies = self.proxie.popip()
print self.cookies
print proxies
r = requests.get("https://www.baidu.com", headers=random.choice(self.headers), proxies=proxies)
while r.status_code != requests.codes.ok:
proxies = self.proxie.popip()
r = requests.get("https://www.baidu.com", headers=random.choice(self.headers), proxies=proxies)
data = requests.get(self.url,headers=random.choice(self.headers), cookies=self.cookies, proxies=proxies,timeout=20).text
#print data
infos = re.findall(r'fnick=(.+?)&f=1\\',data)
if infos is None:
self.cookies = self.cookie.getcookie()
data = requests.get(self.url, headers=random.choice(self.headers), cookies=self.cookies, proxies=proxies,
timeout=20).text
infos = re.findall(r'fnick=(.+?)&f=1\\', data)
fans = []
for info in infos:
fans.append(info.split('&')[0])
try:
totalpage = re.findall(r'Pl_Official_HisRelation__6\d+\\">(\d+)<',data)[-1]
print totalpage
except:
totalpage = 1
# totalpage = re.findall(r'Pl_Official_HisRelation__\d+\\">(\d+)<', data)[-1]
Id = [one for one in re.findall(r'usercard=\\"id=(\d+)&',data)]
self.totalid = [Id[i] for i in range(1,len(fans)*2+1,2)]
if int(totalpage) == 1:
for one in self.totalid:
self.wbspider.getUserData(one)
item = {}
for one in self.totalid:
item[1] = one
self.mysql.insert(item)
fansurl = 'http://weibo.com/p/100505' + one + '/follow?from=page_100505&wvr=6&mod=headfollow&retcode=6102'
# fansurl = 'http://weibo.com/p/100505' + one + '/follow?relate=fans&from=100505&wvr=6&mod=headfans¤t=fans&retcode=6102'
fan.getData(fansurl)
elif int(totalpage) >= 5:
totalpage=5
self.mulpage(totalpage)
# if self.key == 1:
# self.mulpage(totalpage)
# else:
# self.carepage(totalpage)
# def carepage(self,pages):
# #self.key=1
# urls = ['http://weibo.com/p/1005051497035431/follow?page={}&retcode=6102'.format(i) for i in range(2, int(pages) + 1)]
# for url in urls:
# sleep(2)
# print url.split('&')[-2]
# proxies = self.proxie.popip()
# r = requests.get("https://www.baidu.com", headers=random.choice(self.headers), proxies=proxies)
# print r.status_code
# while r.status_code != requests.codes.ok:
# proxies = self.proxie.popip()
# r = requests.get("https://www.baidu.com", headers=random.choice(self.headers), proxies=proxies)
# data = requests.get(url, headers=random.choice(self.headers), cookies=self.cookies, proxies=proxies,
# timeout=20).text
# # print data
# infos = re.findall(r'fnick=(.+?)&f=1\\', data)
# if infos is None:
# self.cookies = self.cookie.getcookie()
# data = requests.get(self.url, headers=random.choice(self.headers), cookies=self.cookies,
# proxies=proxies,
# timeout=20).text
# infos = re.findall(r'fnick=(.+?)&f=1\\', data)
# fans = []
# for info in infos:
# fans.append(info.split('&')[0])
# Id = [one for one in re.findall(r'usercard=\\"id=(\d+)&', data)]
# totalid = [Id[i] for i in range(1, len(fans) * 2 + 1, 2)]
# for one in totalid:
# # print one
# self.totalid.append(one)
# for one in self.totalid:
# sleep(1)
# self.wbspider.getUserData(one)
# item = {}
# for one in self.totalid:
# item[1] = one
# self.mysql.insert(item)
# fansurl = 'http://weibo.com/p/100505'+one+'/follow?from=page_100505&wvr=6&mod=headfollow&retcode=6102'
# #fansurl = 'http://weibo.com/p/100505' + one + '/follow?relate=fans&from=100505&wvr=6&mod=headfans¤t=fans&retcode=6102'
# fan.getData(fansurl)
def mulpage(self,pages):
#self.key=2
urls = ['http://weibo.com/p/1005051497035431/follow?relate=fans&page={}&retcode=6102'.format(i) for i in range(2,int(pages)+1)]
for url in urls:
sleep(2)
print url.split('&')[-2]
proxies = self.proxie.popip()
r = requests.get("https://www.baidu.com", headers=random.choice(self.headers), proxies=proxies)
print r.status_code
while r.status_code != requests.codes.ok:
proxies = self.proxie.popip()
r = requests.get("https://www.baidu.com", headers=random.choice(self.headers), proxies=proxies)
data = requests.get(url, headers=random.choice(self.headers), cookies=self.cookies, proxies=proxies,
timeout=20).text
# print data
infos = re.findall(r'fnick=(.+?)&f=1\\', data)
if infos is None:
self.cookies = self.cookie.getcookie()
data = requests.get(self.url, headers=random.choice(self.headers), cookies=self.cookies,
proxies=proxies,
timeout=20).text
infos = re.findall(r'fnick=(.+?)&f=1\\', data)
fans = []
for info in infos:
fans.append(info.split('&')[0])
Id = [one for one in re.findall(r'usercard=\\"id=(\d+)&', data)]
totalid = [Id[i] for i in range(1, len(fans) * 2 + 1, 2)]
for one in totalid:
#print one
self.totalid.append(one)
for one in self.totalid:
sleep(1)
self.wbspider.getUserData(one)
item ={}
for one in self.totalid:
item[1]=one
self.mysql.insert(item)
#fansurl = 'http://weibo.com/p/1005055847228592/follow?from=page_100505&wvr=6&mod=headfollow&retcode=6102'
fansurl = 'http://weibo.com/p/100505'+one+'/follow?relate=fans&from=100505&wvr=6&mod=headfans¤t=fans&retcode=6102'
fan.getData(fansurl)
if __name__ == "__main__":
url = 'http://weibo.com/p/1005051497035431/follow?relate=fans&from=100505&wvr=6&mod=headfans¤t=fans&retcode=6102'
fan = fansSpider()
fan.getData(url)
<em>中間注釋的一部分,因?yàn)榇a在調(diào)試,大家參考正則和一些處理方式即可</em>
2.weibo_spider.py
# -*- coding:utf-8 -*-
import requests
import re
from store_mysql import Mysql
import MySQLdb
class weiboSpider(object):
headers = {
"user-agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/52.0.2743.116 Safari/537.36"
}
cookies = {
'TC-Page-G0':'1bbd8b9d418fd852a6ba73de929b3d0c',
'login_sid_t':'0554454a652ee2a19c672e92ecee3220',
'_s_tentry':'-',
'Apache':'8598167916889.414.1493773707704',
'SINAGLOBAL':'8598167916889.414.1493773707704',
'ULV':'1493773707718:1:1:1:8598167916889.414.1493773707704:',
'SCF':'An3a20Qu9caOfsjo36dVvRQh7tKzwKwWXX7CdmypYAwRoCoWM94zrQyZ-5QJPjjDRpp2fBxA_9d6-06C8vLD490.',
'SUB':'_2A250DV37DeThGeNO7FEX9i3IyziIHXVXe8gzrDV8PUNbmtAKLWbEkW8qBangfcJP4zc_n3aYnbcaf1aVNA..',
'SUBP':'0033WrSXqPxfM725Ws9jqgMF55529P9D9WhR6nHCyWoXhugM0PU8VZAu5JpX5K2hUgL.Fo-7S0ecSoeXehB2dJLoI7pX9PiEIgij9gpD9J-t',
'SUHB':'0jBY7fPNWFbwRJ',
'ALF':'1494378549',
'SSOLoginState':'1493773739',
'wvr':'6',
'UOR':',www.weibo.com,spr_sinamkt_buy_lhykj_weibo_t111',
'YF-Page-G0':'19f6802eb103b391998cb31325aed3bc',
'un':'fengshengjie5 @ live.com'
}
def __init__(self):
field = ['title', 'name', 'id', 'wblevel', 'addr', 'graduate', 'care', 'careurl', 'fans', 'fansurl', 'wbcount',
'wburl']
conn = MySQLdb.connect(user='root', passwd='123456', db='zhihu', charset='utf8')
conn.autocommit(True)
self.cursor = conn.cursor()
self.mysql = Mysql('sina', field, len(field) + 1)
def getUserData(self,id):
self.cursor.execute('select id from sina where id=%s',(id,))
data = self.cursor.fetchall()
if data:
pass
else:
item = {}
#test = [5321549625,1669879400,1497035431,1265189091,5705874800,5073663404,5850521726,1776845763]
url = 'http://weibo.com/u/'+id+'?topnav=1&wvr=6&retcode=6102'
data = requests.get(url,headers=self.headers,cookies=self.cookies).text
#print data
id = url.split('?')[0].split('/')[-1]
try:
title = re.findall(r'<title>(.*?)</title>',data)[0]
title = title.split('_')[0]
except:
title= u''
try:
name = re.findall(r'class=\\"username\\">(.+?)<',data)[0]
except:
name = u''
try:
totals = re.findall(r'class=\\"W_f\d+\\">(\d*)<',data)
care = totals[0]
fans = totals[1]
wbcount = totals[2]
except:
care = u''
fans = u''
wbcount = u''
try:
urls = re.findall(r'class=\\"t_link S_txt1\\" href=\\"(.*?)\\"',data)
careUrl = urls[0].replace('\\','').replace('#place','&retcode=6102')
fansUrl = urls[1].replace('\\','').replace('#place','&retcode=6102')
wbUrl = urls[2].replace('\\','').replace('#place','&retcode=6102')
except:
careUrl = u''
fansUrl = u''
wbUrl = u''
profile = re.findall(r'class=\\"item_text W_fl\\">(.+?)<',data)
try:
wblevel = re.findall(r'title=\\"(.*?)\\"',profile[0])[0]
addr = re.findall(u'[\u4e00-\u9fa5]+', profile[1])[0]# 地址
except:
profile1 = re.findall(r'class=\\"icon_group S_line1 W_fl\\">(.+?)<',data)
try:
wblevel = re.findall(r'title=\\"(.*?)\\"', profile1[0])[0]
except:
wblevel = u''
try:
addr = re.findall(u'[\u4e00-\u9fa5]+', profile[0])[0]
except:
addr = u''
try:
graduate = re.findall(r'profile&wvr=6\\">(.*?)<',data)[0]
except:
graduate = u''
item[1] = title
item[2] =name
item[3] =id
item[4] =wblevel
item[5] =addr
item[6] =graduate
item[7] =care
item[8] =careUrl
item[9] =fans
item[10] =fansUrl
item[11] =wbcount
item[12] =wbUrl
self.mysql.insert(item)
<em>寫的比較亂,大家將就著看,還是我說的只是一個(gè)Demo</em>
輔助類之一(存mysql,可以參考Mr_Cxy的python對Mysql數(shù)據(jù)庫的操作小例),其他的兩個(gè)關(guān)于隨機(jī)代理和獲取cookie,在下篇文章會(huì)詳細(xì)講解
運(yùn)行結(jié)果+數(shù)據(jù)結(jié)果


<em>200為狀態(tài)碼,說明請求成功</em>

總結(jié)
目前新浪微博是遇到問題最多的一個(gè),不過也學(xué)到了很多知識,比如正則表達(dá)式,隨機(jī)代理等等,在學(xué)習(xí)的過程中就是遇到的問題越多,積累的越多,進(jìn)步越快,所以遇到問題和出錯(cuò)也是幸事。說一下代碼運(yùn)行過程中存在遇到的問題吧(可以一塊交流解決):<strong>
- 1.有兩個(gè)id一直在循環(huán),可能是循環(huán)那一塊存在問題,可以一塊交流,解決后會(huì)更新文章。
- 2.解析的速度(單線程比較慢,后續(xù)寫scrapy版)
- 3.去重(目前是在將解析過的id寫入數(shù)據(jù)庫,然后在解析前進(jìn)行判斷)
這差不多就是一個(gè)簡單的思路,目前存在一些問題,可以作為參考,有問題的可以一塊交流解決(所有源碼,可以私聊參考)