使用Scrapy爬取知乎用戶信息

本文記錄了關(guān)于知乎用戶信息的模塊化抓取,使用到了Scrapy這個開源項(xiàng)目,對其不熟悉的同學(xué)建議提前了解

知乎是現(xiàn)在十分活躍的社區(qū),上面有關(guān)于人生、智慧、職業(yè)、技術(shù)等等的一系列的高質(zhì)量的問答和專欄文章,雖然總是有有一些負(fù)面,片面的觀點(diǎn),但是不得不承認(rèn)這是一個積極的、開放的社區(qū)

而作為學(xué)習(xí)者,需要做的總是抱著開放的心態(tài),去其糟粕而取之精華,知乎雖是虛擬,但更像現(xiàn)實(shí)。

起初我的思路是先使用Scrapy模擬登錄上知乎,獲得server分配的cookie,借鑒了Github上這個項(xiàng)目fuck-login,它的源碼使用的requests模擬登陸知乎

requests session在第一次登陸成功后,獲得cookie設(shè)置之后,requests session可以幫助我們管理cookie,這減少了很多的底層勞動。在下次爬取可以使用session來進(jìn)行get/post請求

當(dāng)然你也可以使用瀏覽器先登陸上知乎,直接將cookie持久化到本地,之后再在請求的頭中帶上cookie

原理

模擬登陸

要將這個項(xiàng)目集成到Scrapy當(dāng)中,需要使用ScrapyRequest重寫下,實(shí)現(xiàn)模擬登錄

要想登陸上知乎,需要在請求頭中修改User-Agent,并向login頁面POST一些數(shù)據(jù),注意這里重寫了需要是用賬號為手機(jī)號登陸,至于email登陸可以借鑒原項(xiàng)目

不知道要帶哪些東西就需要使用chrome開發(fā)者工具先看下正常請求會帶哪些東西,之后再COSPLAY,而fuck-login就幫助我們做好了這些事情

headers = {
    "Host": "www.zhihu.com",
    "Referer": "https://www.zhihu.com/",
    'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:53.0) Gecko/20100101 Firefox/53.0'
}

postdata = {
    '_xsrf': '',
    'password': 'xxxxxx',
    'phone_num': 'xxxxxx',
    'captcha': ''
}
  • _xsrf是藏在登陸頁面中的一串code,所以在請求到首頁,需要在callback函數(shù)中對這個值進(jìn)行提取
  • phone_num,password就不用多說了,就是我們的登錄名和密碼
  • captcha,這是驗(yàn)證碼,我們需要在登錄頁面輸入的驗(yàn)證碼

對于驗(yàn)證碼的操作,也用很多種方式,有編碼實(shí)現(xiàn),比如tesseract-ocr (google開源) 不推薦,也有在線打碼的平臺,準(zhǔn)確率尚可,成本較低,比如云打碼,還有其他人工打碼,準(zhǔn)確率最高,但是成本也是最高

這里遵循fuck-login的做法,將登陸頁面的驗(yàn)證碼圖片鏈接下載到本地,直接手動輸入

源碼如下:

# -*- coding: utf-8 -*-
import scrapy
import re,os,json
import time
try:
    from PIL import Image
except:
    pass

class ZhihuSpider(scrapy.Spider):
    name = "zhihu"
    allowed_domains = ["www.zhihu.com"]
    start_urls = ['https://www.zhihu.com/people/zhu-xiao-fei-47-24/following']

    headers = {
        "Host": "www.zhihu.com",
        "Referer": "https://www.zhihu.com/",
        'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:53.0) Gecko/20100101 Firefox/53.0'
    }

    postdata = {
        '_xsrf': '',
        'password': 'zhxfei..192',
        'phone_num': '15852937839',
        'captcha': ''
    }

    def parse(self, response):
        with open('index.html','w') as f:
            f.write(response.text)
        print('over')

    def start_requests(self):
        return [scrapy.Request('https://www.zhihu.com/#login',
                          headers=self.headers,callback=self.get_xsrf)]

    def get_xsrf(self,response):
        response_text = response.text
        mathch_obj = re.match('.*name="_xsrf" value="(.*?)"',response_text,re.DOTALL)
        if mathch_obj:
            self.postdata['_xsrf'] =  mathch_obj.group(1)
            t = str(int(time.time() * 1000))
            captcha_url = 'https://www.zhihu.com/captcha.gif?r='+t+"&type=login"
            return [scrapy.Request(
                    captcha_url,headers=self.headers,callback=self.get_captcha)]

    def get_captcha(self,response):
        with open('captcha.jpg', 'wb') as f:
            f.write(response.body)
            f.close()
        try:
            im = Image.open('captcha.jpg')
            im.show()
            #im.close()
        except:
            print('find captcha by your self')
        self.postdata['captcha'] = input("please input the captcha\n>").strip()
        if self.postdata['_xsrf'] and self.postdata['captcha']:
            post_url = 'https://www.zhihu.com/login/phone_num'
            return [scrapy.FormRequest(
                url=post_url,
                formdata=self.postdata,
                headers=self.headers,
                callback=self.check_login
            )]

    def check_login(self,response):
        json_text = json.loads(response.text)
        if 'msg' in json_text and json_text['msg'] == '登錄成功':
            for url in self.start_urls:
                yield scrapy.Request(url,dont_filter=True,headers=self.headers)  #no callback , turn into parse

信息提取

思路

Scrapy獲得了cookie就可以登陸上知乎了,剩下的就是爬蟲邏輯和信息的提取具體實(shí)現(xiàn)了

具體的邏輯是從Aljun那里獲得的靈感,首先從一個大V開始(比如我,哈哈哈~) 獲得所以其所關(guān)注的人,之后再獲得這些人的信息將一些小號給過濾掉并錄入數(shù)據(jù)庫,之后在從其關(guān)注的人再獲得其所關(guān)注的人,再獲得信息錄入數(shù)據(jù)庫,就這樣不間斷的獲取下去,而Scrapy自身就遵循了深度優(yōu)先的算法

觀察下知乎的頁面的請求流程可以發(fā)現(xiàn)知乎用戶模塊前后端是分離的,知乎后端的api看起來也和規(guī)范,前端用ajax到后端的API去拿數(shù)據(jù),模板渲染等工作交給了react

由于每刷新一次頁面都需要發(fā)起Ajax請求到后端去拿數(shù)據(jù)(為了保證數(shù)據(jù)的實(shí)時(shí)性),我們可以用開發(fā)者工具調(diào)試頁面,刷新一次將http請求拿出來看下所有請求的URL,沒有被緩存的請求都觀察一番就教容易找出了Ajax請求的接口

首先我們先設(shè)計(jì)數(shù)據(jù)庫,這里使用MySQL,我們可以根據(jù)感興趣的可以得到的用戶信息數(shù)據(jù)來設(shè)計(jì)我們的數(shù)據(jù)庫,知乎提供了一個API接口來獲得數(shù)據(jù)(先看看,我沒有用到這個接口)

知乎開放的可以獲取一個用戶的具體信息的API
https://www.zhihu.com/api/v4/members/zhu-xiao-fei-47-24,其中url中編碼一些查詢參數(shù),就可以獲得用戶對應(yīng)的信息

https://www.zhihu.com/api/v4/members/zhu-xiao-fei-47-24?include=locations%2Cemployments%2Cgender%2Ceducations%2Cbusiness%2Cvoteup_count%2Cthanked_Count%2Cfollower_count%2Cfollowing_count%2Ccover_url%2Cfollowing_topic_count%2Cfollowing_question_count%2Cfollowing_favlists_count%2Cfollowing_columns_count%2Cavatar_hue%2Canswer_count%2Carticles_count%2Cpins_count%2Cquestion_count%2Ccolumns_count%2Ccommercial_question_count%2Cfavorite_count%2Cfavorited_count%2Clogs_count%2Cmarked_answers_count%2Cmarked_answers_text%2Cmessage_thread_token%2Caccount_status%2Cis_active%2Cis_force_renamed%2Cis_bind_sina%2Csina_weibo_url%2Csina_weibo_name%2Cshow_sina_weibo%2Cis_blocking%2Cis_blocked%2Cis_following%2Cis_followed%2Cmutual_followees_count%2Cvote_to_count%2Cvote_from_count%2Cthank_to_count%2Cthank_from_count%2Cthanked_count%2Cdescription%2Chosted_live_count%2Cparticipated_live_count%2Callow_message%2Cindustry_category%2Corg_name%2Corg_homepage%2Cbadge[%3F(type%3Dbest_answerer)].topics

向這個接口請求發(fā)起一個GET請求,就可以獲得后臺發(fā)送來的JSON數(shù)據(jù),這個信息是比較完善的,當(dāng)我們知道可以獲取哪些信息,找出自己關(guān)注的信息,就可以設(shè)計(jì)我們的數(shù)據(jù)庫了,

這里需要注意的是,顯然這個數(shù)據(jù)太龐大了,我們應(yīng)該根據(jù)我們的需求編碼不同的參數(shù)進(jìn)去從而獲得我們想要的數(shù)據(jù),從而減少請求的JSON數(shù)據(jù)的大小,節(jié)省帶寬

如我們設(shè)計(jì)的Item是以下結(jié)構(gòu)(和mysql中的數(shù)據(jù)表的列相互對應(yīng))

class ZhihuUserItem(Item):
    name = scrapy.Field()
    id = scrapy.Field()
    url_token = scrapy.Field()
    headline = scrapy.Field()
    answer_count = scrapy.Field()
    articles_count = scrapy.Field()
    gender = scrapy.Field()
    avatar_url = scrapy.Field()
    user_type = scrapy.Field()
    following_count = scrapy.Field()
    follower_count = scrapy.Field()
    thxd_count = scrapy.Field()
    agreed_count = scrapy.Field()
    collected_count = scrapy.Field()
    badge = scrapy.Field()
    craw_time = scrapy.Field()

而我想獲得這樣的JSON數(shù)據(jù):

{
  "following_count": 35,
  "user_type": "people",
  "id": "113d0e23a9ada1a61faf0272b4acf6c4",
  "favorited_count": 38,
  "voteup_count": 31,
  "headline": "學(xué)生",
  "url_token": "zhu-xiao-fei-47-24",
  "follower_count": 22,
  "avatar_url_template": "https://pic2.zhimg.com/20108b43c7b928229ba5cfafccca1235_{size}.jpg",
  "name": "朱曉飛",
  "thanked_count": 17,
  "gender": 1,
  "articles_count": 0,
  "badge": [ ],
  "answer_count": 32,
}

可以編碼這樣的請求進(jìn)去

https://www.zhihu.com/api/v4/members/zhu-xiao-fei-47-24?include=gender%2Cvoteup_count%2Cthanked_Count%2Cfollower_count%2Cfollowing_count%2Canswer_count%2Carticles_count%2Cfavorite_count%2Cfavorited_count%2Cthanked_count%2Cbadge[%3F(type%3Dbest_answerer)].topics

同理

知乎后臺開放了獲取一個用戶關(guān)注者的API
https://www.zhihu.com/api/v4/members/zhu-xiao-fei-47-24/followees,顯然這個接口和用戶相關(guān)如zhu-xiao-fei-47-24,這是用戶的一個屬性url_token,我們可以利用用戶的url_token來拼接出獲得其關(guān)注者的url

由于查詢的HTTP MethodGet,查詢的參數(shù)是編碼到url中,我們也可以在urlencode一些請求的參數(shù)進(jìn)去,來獲得對應(yīng)的數(shù)據(jù),如

https://www.zhihu.com/api/v4/members/zhu-xiao-fei-47-24/followees?include=data[*].answer_count%2Carticles_count%2Cgender%2Cfollower_count%2Cis_followed%2Cis_following%2Cbadge[%3F(type%3Dbest_answerer)].topics&offset=20&limit=20

向這個請求發(fā)起一個GET請求,就可以獲得后臺發(fā)送來的Json數(shù)據(jù),截取部分實(shí)例如下:

{

    "paging": {
        "is_end": true,
        "totals": 35,
        "previous": "http://www.zhihu.com/api/v4/members/zhu-xiao-fei-47-24/followees?include=data%5B%2A%5D.answer_count%2Carticles_count%2Cgender%2Cfollower_count%2Cis_followed%2Cis_following%2Cbadge%5B%3F%28type%3Dbest_answerer%29%5D.topics&limit=20&offset=0",
        "is_start": false,
        "next": "http://www.zhihu.com/api/v4/members/zhu-xiao-fei-47-24/followees?include=data%5B%2A%5D.answer_count%2Carticles_count%2Cgender%2Cfollower_count%2Cis_followed%2Cis_following%2Cbadge%5B%3F%28type%3Dbest_answerer%29%5D.topics&limit=20&offset=40"
    },
    "data": [
        {
            "is_followed": false,
            "avatar_url_template": "https://pic1.zhimg.com/85fa1149f2d02e930498508dc71e6790_{size}.jpg",
            "user_type": "people",
            "answer_count": 391,
            "is_following": true,
            "url": "http://www.zhihu.com/api/v4/people/29d476a5746f5dd5ae8a296354e817de",
            "type": "people",
            "url_token": "chexiaopang",
            "id": "29d476a5746f5dd5ae8a296354e817de",
            "articles_count": 28,
            "name": "車小胖",
            "headline": "網(wǎng)絡(luò)主治大夫,專治疑難雜癥",
            "gender": 1,
            "is_advertiser": false,
            "avatar_url": "https://pic1.zhimg.com/85fa1149f2d02e930498508dc71e6790_is.jpg",
            "is_org": false,
            "follower_count": 25580,
            "badge": [
                {
                    "topics": [
                        {
                            "url": "http://www.zhihu.com/api/v4/topics/19572894",
                            "avatar_url": "https://pic3.zhimg.com/e0bd139b2_is.jpg",
                            "name": "計(jì)算機(jī)網(wǎng)絡(luò)",
                            "introduction": "計(jì)算機(jī)網(wǎng)絡(luò)( <a href=\"http://www.wikiwand.com/en/Computer_Networks\" data-editable=\"true\" data-title=\"Computer Networks\">Computer Networks</a> )指將地理位置不同的多臺計(jì)算機(jī)及其外部設(shè)備,通過通信線路連接起來,在網(wǎng)絡(luò)操作系統(tǒng)及網(wǎng)絡(luò)通信協(xié)議的管理和協(xié)調(diào)下,實(shí)現(xiàn)資源共享和信息傳遞的計(jì)算機(jī)系統(tǒng)。",
                            "type": "topic",
                            "excerpt": "計(jì)算機(jī)網(wǎng)絡(luò)( Computer Networks )指將地理位置不同的多臺計(jì)算機(jī)及其外部設(shè)備,通過通信線路連接起來,在網(wǎng)絡(luò)操作系統(tǒng)及網(wǎng)絡(luò)通信協(xié)議的管理和協(xié)調(diào)下,實(shí)現(xiàn)資源共享和信息傳遞的計(jì)算機(jī)系統(tǒng)。",
                            "id": "19572894"
                        }
                    ],
                    "type": "best_answerer",
                    "description": "優(yōu)秀回答者"
                }
            ]
        },
        //.... 還有19個用戶的數(shù)據(jù)
    ]
}

可以看到,在這個API里也可以返回關(guān)注用戶的信息,也就是每一個data字段里面的信息。這就是我們要的接口了!

我們可以根據(jù)我們的需求構(gòu)造出URL去獲取我們想要的對應(yīng)的數(shù)據(jù)。這個接口可以加三個參數(shù)

第一個include就是我們可以請求到的用戶的信息,第二個offset是偏移量表征當(dāng)前返回的第一個記錄相對第一個following person的數(shù)量,第三個limit是返回限制數(shù)量,后面兩個貌似起不到控制作用,所以可以無視,但是Spider對于一個沒有提取過following person的時(shí)候,需要將offset設(shè)置為0。

而第一個參數(shù)include就是關(guān)注人的信息,我們可以將用戶的屬性如感謝數(shù)使用thanked_Count%2C拼接起來:所以根據(jù)上面的需求,我們可以這么編碼

https://www.zhihu.com/api/v4/members/zhu-xiao-fei-47-24/followees?include=data[*].gender%2Cvoteup_count%2Cthanked_Count%2Cfollower_count%2Cfollowing_count%2Canswer_count%2Carticles_count%2Cfavorite_count%2Cfavorited_count%2Cthanked_count%2Cbadge[%3F(type%3Dbest_answerer)].topics&offset=20&limit=20

請求這個接口,就可以獲得我們數(shù)據(jù)庫所需要的信息,并且可以不傳輸大量的數(shù)據(jù),如下:

{

    "paging": {
        "is_end": true,
        "totals": 35,
        "previous": "http://www.zhihu.com/api/v4/members/zhu-xiao-fei-47-24/followees?include=data%5B%2A%5D.gender%2Cvoteup_count%2Cthanked_Count%2Cfollower_count%2Cfollowing_count%2Canswer_count%2Carticles_count%2Cfavorite_count%2Cfavorited_count%2Cthanked_count%2Cbadge%5B%3F%28type%3Dbest_answerer%29%5D.topics&limit=20&offset=0",
        "is_start": false,
        "next": "http://www.zhihu.com/api/v4/members/zhu-xiao-fei-47-24/followees?include=data%5B%2A%5D.gender%2Cvoteup_count%2Cthanked_Count%2Cfollower_count%2Cfollowing_count%2Canswer_count%2Carticles_count%2Cfavorite_count%2Cfavorited_count%2Cthanked_count%2Cbadge%5B%3F%28type%3Dbest_answerer%29%5D.topics&limit=20&offset=40"
    },
    "data": [
        {
            "avatar_url_template": "https://pic1.zhimg.com/85fa1149f2d02e930498508dc71e6790_{size}.jpg",
            "following_count": 118,
            "user_type": "people",
            "answer_count": 391,
            "headline": "網(wǎng)絡(luò)主治大夫,專治疑難雜癥",
            "url_token": "chexiaopang",
            "id": "29d476a5746f5dd5ae8a296354e817de",
            "favorite_count": 0,
            "articles_count": 28,
            "type": "people",
            "name": "車小胖",
            "url": "http://www.zhihu.com/api/v4/people/29d476a5746f5dd5ae8a296354e817de",
            "gender": 1,
            "favorited_count": 27051,
            "is_advertiser": false,
            "avatar_url": "https://pic1.zhimg.com/85fa1149f2d02e930498508dc71e6790_is.jpg",
            "is_org": false,
            "thanked_count": 7161,
            "follower_count": 25588,
            "voteup_count": 30449,
            "badge": [
                {
                    "topics": [
                        {
                            "introduction": "計(jì)算機(jī)網(wǎng)絡(luò)( <a href=\"http://www.wikiwand.com/en/Computer_Networks\" data-editable=\"true\" data-title=\"Computer Networks\">Computer Networks</a> )指將地理位置不同的多臺計(jì)算機(jī)及其外部設(shè)備,通過通信線路連接起來,在網(wǎng)絡(luò)操作系統(tǒng)及網(wǎng)絡(luò)通信協(xié)議的管理和協(xié)調(diào)下,實(shí)現(xiàn)資源共享和信息傳遞的計(jì)算機(jī)系統(tǒng)。",
                            "avatar_url": "https://pic3.zhimg.com/e0bd139b2_is.jpg",
                            "name": "計(jì)算機(jī)網(wǎng)絡(luò)",
                            "url": "http://www.zhihu.com/api/v4/topics/19572894",
                            "type": "topic",
                            "excerpt": "計(jì)算機(jī)網(wǎng)絡(luò)( Computer Networks )指將地理位置不同的多臺計(jì)算機(jī)及其外部設(shè)備,通過通信線路連接起來,在網(wǎng)絡(luò)操作系統(tǒng)及網(wǎng)絡(luò)通信協(xié)議的管理和協(xié)調(diào)下,實(shí)現(xiàn)資源共享和信息傳遞的計(jì)算機(jī)系統(tǒng)。",
                            "id": "19572894"
                        }
                    ],
                    "type": "best_answerer",
                    "description": "優(yōu)秀回答者"
                }
            ]
        },
        //.... 還有19個用戶的數(shù)據(jù)
    ]
}

注意
在我們獲取我們想要的數(shù)據(jù)的時(shí)候,我們的爬蟲應(yīng)該遵守一個原則就是:

盡可能減少我們的HTTP次數(shù)

在我們調(diào)整請求的URL之后,相當(dāng)于一個HTTP請求,就可以獲得20item,而不是一個請求獲得url_token,每一個用戶的信息再需要一次http request獲得,光這項(xiàng)的修改相當(dāng)于提升了爬蟲20倍的性能,當(dāng)然說的有些夸張。但是,爬蟲的瓶頸逐漸不是信息的獲取,可能性能會損耗在在我們的數(shù)據(jù)庫的寫入

實(shí)現(xiàn)

此時(shí),即可在模擬登陸的基礎(chǔ)上,完善我們的spider,主要增加parse這個實(shí)例方法

following_api = "https://www.zhihu.com/api/v4/members/{}/followees?include=data[*].gender%2Cvoteup_count%2Cthanked_Count%2Cfollower_count%2Cfollowing_count%2Canswer_count%2Carticles_count%2Cfavorite_count%2Cfavorited_count%2Cthanked_count%2Cbadge[%3F(type%3Dbest_answerer)].topics&offset=20&limit=20"


class ZhihuSpider(scrapy.Spider):
    start_urls = [following_api.format('teng-xun-ke-ji')]

    '''
    模擬登陸的代碼
    '''

    def parse(self, response):
        jsonresponse = json.loads(response.body_as_unicode())

        if not jsonresponse['paging']['is_end']:
            yield scrapy.Request(url=jsonresponse['paging']['next'])

        if jsonresponse['data']:
            for data in jsonresponse['data']:
                url_token = data.get('url_token')
                if url_token:
                    yield scrapy.Request(url=following_api.format(url_token))

                    agreed_count = data['voteup_count']
                    thxd_count = data['thanked_count']
                    collected_count = data['favorited_count']
                    if thxd_count or collected_count:
                        item_loader = ZhihuUserItemLoader(item=ZhihuUserItem(), response=response)
                        item_loader.add_value('name',data['name'])
                        item_loader.add_value('id',data['id'])
                        item_loader.add_value('url_token',data['url_token'])
                        item_loader.add_value('headline',data['headline']
                                                            if data['headline'] else "無")
                        item_loader.add_value('answer_count',data['answer_count'])
                        item_loader.add_value('articles_count',data['articles_count'])
                        item_loader.add_value('gender',data['gender']
                                                            if data['gender'] else 0)
                        item_loader.add_value('avatar_url',data['avatar_url_template'].format(size='xl'))
                        item_loader.add_value('user_type',data['user_type'])
                        item_loader.add_value('badge',','.join([badge.get('description') for badge in data['badge']])
                                                            if data.get('badge') else "無")
                        item_loader.add_value('follower_count',data['follower_count'])
                        item_loader.add_value('following_count',data['following_count'])
                        item_loader.add_value('agreed_count',agreed_count)
                        item_loader.add_value('thxd_count',thxd_count)
                        item_loader.add_value('collected_count',collected_count)
                        item_loader.add_value('craw_time',datetime.now().strftime("%Y-%m-%d %H:%M:%S"))
                        zhihu_item = item_loader.load_item()
                        yield zhihu_item

數(shù)據(jù)入庫
獲得到數(shù)據(jù),即可將item的信息就可以插入到MySQL中,可以添加一個pipeline

class MysqlTwsitedPipeline(object):
    def __init__(self,dbpool):
        self.dbpool = dbpool

    @classmethod
    def from_settings(cls,settings):
        dbpara=dict(
            host=settings['MYSQL_HOST'],
            db=settings['MYSQL_NAME'],
            user=settings['MYSQL_USER'],
            passwd=settings['MYSQL_PASS'],
            charset='utf8',
            cursorclass=MySQLdb.cursors.DictCursor,
            use_unicode=True
        )
        dbpool = adbapi.ConnectionPool("MySQLdb",**dbpara)
        return cls(dbpool)

    def process_item(self, item, spider):
        query = self.dbpool.runInteraction(self.do_insert,item)
        query.addErrback(self.handle_error)
        return item

    def handle_error(self,failure):
        print(failure)

    def do_insert(self,cursor,item):
        insert_sql = item.get_insert_sql()
        cursor.execute(insert_sql)

完整的item:

class ZhihuUserItem(Item):
    name = scrapy.Field()
    id = scrapy.Field()
    url_token = scrapy.Field()
    headline = scrapy.Field()
    answer_count = scrapy.Field()
    articles_count = scrapy.Field()
    gender = scrapy.Field()
    avatar_url = scrapy.Field()
    user_type = scrapy.Field()
    following_count = scrapy.Field()
    follower_count = scrapy.Field()
    thxd_count = scrapy.Field()
    agreed_count = scrapy.Field()
    collected_count = scrapy.Field()
    badge = scrapy.Field()
    craw_time = scrapy.Field()

    def get_insert_sql(self):
        insert_sql = '''
            replace into zhihu_user1 values('{}','{}','{}','{}',{},{},{},'{}','{}',{},{},{},{},{},'{}','{}')
            '''.format(self['name'], self['id'], self['url_token'], self['headline'], self['answer_count'],
                       self['articles_count'], self['gender'], self['avatar_url'], self['user_type'],
                       self['following_count'], self['follower_count'], self['thxd_count'],
                       self['agreed_count'], self['collected_count'], self['badge'], self['craw_time'])

        return insert_sql

piplinehandle_error(異常處理處)函數(shù)內(nèi)打上斷點(diǎn),使用DEBUG調(diào)試程序,觀察到有數(shù)據(jù)入庫即可

然而運(yùn)行我們的project,沒抓到幾百個用戶數(shù)據(jù),就會出現(xiàn)http 429甚至http 403的情況

http 429解決辦法
http 429意思請求的速率太快,顯然知乎后臺開放的API會做一些調(diào)用限制,比如對IP、調(diào)用的用戶等。

Scrapy目前我沒有想到行知有效的方式,我認(rèn)為最為方便的就是設(shè)置下載延時(shí),在setting.py文件中設(shè)置DOWNLOAD_DELAY這個變量

并隨機(jī)切換代理和User-Agent,編寫Scrapymiddleware,如下:

class ProxyMiddleware(object):
    # overwrite process request

    def process_request(self, request, spider):
        # Set the location of the proxy

        proxy_ip_list = [
            'http://127.0.0.1:9999',
            'http://120.xxx.x.x:xxx',
            #....
        ]

        user_agent_list = [
            "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 "
            "(KHTML, like Gecko) Chrome/22.0.1207.1 Safari/537.1",
            "Mozilla/5.0 (X11; CrOS i686 2268.111.0) AppleWebKit/536.11 "
            "(KHTML, like Gecko) Chrome/20.0.1132.57 Safari/536.11",
            "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.6 "
            "(KHTML, like Gecko) Chrome/20.0.1092.0 Safari/536.6",
            "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.6 "
            "(KHTML, like Gecko) Chrome/20.0.1090.0 Safari/536.6",
            "Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/537.1 "
            "(KHTML, like Gecko) Chrome/19.77.34.5 Safari/537.1",
            "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/536.5 "
            "(KHTML, like Gecko) Chrome/19.0.1084.9 Safari/536.5",
            "Mozilla/5.0 (Windows NT 6.0) AppleWebKit/536.5 "
            "(KHTML, like Gecko) Chrome/19.0.1084.36 Safari/536.5",
            "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 "
            "(KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3",
            "Mozilla/5.0 (Windows NT 5.1) AppleWebKit/536.3 "
            "(KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3",
            "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_0) AppleWebKit/536.3 "
            "(KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3"
        ]

        proxy_ip = random.choice(proxy_ip_list)
        ua = random.choice(user_agent_list)
        request.headers.setdefault('User-Agent', ua)

        request.meta['proxy'] = proxy_ip
        print('the Current ip address is', proxy_ip)

setting中設(shè)置使用上編寫好的下載中間件

至此我們的爬蟲大部分已經(jīng)完成了。再次運(yùn)行我們的爬蟲,當(dāng)爬數(shù)據(jù)到一定數(shù)量(也沒多少),開始報(bào)http 403,拷貝對應(yīng)的請求url到瀏覽器中,發(fā)現(xiàn)需要重新填寫驗(yàn)證碼。沒一會,知乎便對我的id進(jìn)行了限制,手機(jī)客戶端也無法正常瀏覽

我猜測,知乎可能根據(jù)cookie對觸發(fā)請求閾值的用戶識別后禁止...所以需要在setting.py中設(shè)置COOKIES_ENABLES=False

使用Oauth匿名爬取

前幾天看到靜謐的做法

其根本就不要模擬登陸,但是為了可以訪問到信息,需要探測出Oauth的值加入到header

這個Oauth相當(dāng)一個令牌,我對其目前還不太了解,先不做闡述。

需要注意的是,我們在上面的ProxyMiddleware中重寫了header,所以需要在ProxyMiddleware里面加上這個header

request.headers.setdefault('authorization',
                            'oauth c3cef7c66a1843f8b3a9e6a1e3160e20')

于是乎,我們只需要關(guān)注優(yōu)質(zhì)、穩(wěn)定的代理,設(shè)置好下載延時(shí),就可以進(jìn)行爬取了

當(dāng)我們以匿名的形式,也就沒有之前模擬登陸的許多限制,但是也是需要注意設(shè)置延時(shí)和代理、隨機(jī)User-Agent。

在單機(jī)設(shè)置為DOWNLOAD_DELAY = 0.4,設(shè)置兩個代理的情況下,每小時(shí)大概能抓到2W+的用戶數(shù)據(jù),以這種形式我們的爬蟲的速率已經(jīng)算是比較高了。

最后編輯于
?著作權(quán)歸作者所有,轉(zhuǎn)載或內(nèi)容合作請聯(lián)系作者
【社區(qū)內(nèi)容提示】社區(qū)部分內(nèi)容疑似由AI輔助生成,瀏覽時(shí)請結(jié)合常識與多方信息審慎甄別。
平臺聲明:文章內(nèi)容(如有圖片或視頻亦包括在內(nèi))由作者上傳并發(fā)布,文章內(nèi)容僅代表作者本人觀點(diǎn),簡書系信息發(fā)布平臺,僅提供信息存儲服務(wù)。

相關(guān)閱讀更多精彩內(nèi)容

友情鏈接更多精彩內(nèi)容