爬取巴比特快訊遇到狀態(tài)碼“521”

最近在爬區(qū)塊鏈相關(guān)的快訊,上周巴比特改版后重寫(xiě)了爬蟲(chóng),跑了一天就掛了。原來(lái)是網(wǎng)站使用了加速樂(lè)的服務(wù),爬蟲(chóng)每次都返回521的狀態(tài)碼。

瀏覽器訪問(wèn)網(wǎng)站時(shí):
第一次請(qǐng)求:返回521狀態(tài)碼和一段js代碼。js會(huì)生成一段cookie并重新請(qǐng)求訪問(wèn)。
第二次請(qǐng)求:帶著第一次得到的cookie去請(qǐng)求然后正確返回狀態(tài)碼200

而爬蟲(chóng)不能像瀏覽器一樣執(zhí)行js所以一直報(bào)錯(cuò)521
解決辦法:

讓爬蟲(chóng)模擬瀏覽器的行為:
將返回的js代碼放在一個(gè)字符串中,然后利用execjs對(duì)這段代碼進(jìn)行解密,得到cookie放入下一次訪問(wèn)請(qǐng)求的頭部中。

具體過(guò)程:

直接請(qǐng)求

將返回的這段js代碼整理下:

<html>
<body>
<script language="javascript"> window.onload=setTimeout("ar(75)", 200); 
function ar(YH) {
    var qo, mo="", no="", oo = [0xc2,0x0c,0x22,0xa2,0x68,0x21,0xe8,0x3d,0x1e,0xbb,0x94,0x15,0x16,0x17,0x95,0x17,0x58,0x18,0xce,0xc6,0xc1,0xd6,0x16,0xb5,0x36,0xd6,0x96,0xd6,0xd0,0x2f,0x6f,0x50,0xd0,0x90,0x47,0x18,0xcd,0xa3,0x39,0x57,0x37,0x77,0x89,0x49,0x47,0x9d,0xdd,0x14,0x6a,0xab,0x8b,0x81,0x3f,0x15,0x4c,0xc2,0x49,0x68,0x08,0x1f,0x36,0xb6,0xec,0xaa,0x63,0x39,0x57,0xd7,0x6d,0x26,0x08,0x9d,0x1e,0x74,0x8b,0x44,0x84,0xb1,0x8f,0xe5,0x1d,0xd5,0xec,0x8c,0xa0,0xe0,0x18,0xd7,0x0f,0x46,0xe5,0x23,0x00,0xb6,0x37,0xb7,0x70,0xa6,0x4e,0x04,0x7a,0x18,0x0e,0xc3,0x79,0x4a,0x68,0xbe,0x74,0xeb,0x04,0xc3,0x67,0x86,0xa4,0xe5,0x44,0x04,0x82,0xcb,0x82,0x47,0x48,0x21,0xb9,0xd1,0xfa,0x51,0x6f,0x28,0x64,0x22,0x22,0xc0,0x71,0xaf,0xc6,0xde,0xf4,0x0c,0xd4,0x2c,0xe1,0xff,0x57,0xad,0x63,0x8c,0xa4,0xa8,0x65,0x07,0x7e,0x96,0xa7,0x47,0x48,0x01,0x41,0x82,0x63,0x33,0xe9,0xc2,0xd9,0x3a,0xdf,0x60,0x73,0x4c,0xcc,0xcd,0x8e,0x06,0x1e,0x1b,0x39,0x79,0x1f,0x40,0xf6,0xef,0xa3,0x9b,0x13,0x2b,0x29,0x6a,0x4b,0x6b,0x0b,0x0c,0x0a,0xe2,0x82,0x83,0x27,0xa7,0x65,0x26,0xe5,0xc6,0x64,0xef,0xc8,0x61,0x62,0xe2,0x23,0xc8,0xd0,0x0a,0x0b,0xeb,0xa2,0x42,0x43,0xee,0x6f,0x2d,0xed,0xad,0x8e,0x2c,0xfc,0xd5,0x97,0xf1,0xf0,0x3b];
    qo = "qo=228; do{oo[qo]=(-oo[qo])&0xff; oo[qo]=(((oo[qo]>>3)|((oo[qo]<<5)&0xff))-251)&0xff;} while(--qo>=2);"; 
    eval(qo);
    qo = 227; 
    do { 
        oo[qo] = (oo[qo] - oo[qo - 1]) & 0xff; 
    } while (-- qo >= 3 );
    qo = 1; 
    for (;;) { 
        if (qo > 227) break; 
        oo[qo] = ((((((oo[qo] + 28) & 0xff) + 148) & 0xff) << 6) & 0xff) | (((((oo[qo] + 28) & 0xff) + 148) & 0xff) >> 2); 
        qo++;
    }
    po = ""; 
    for (qo = 1; qo < oo.length - 1; qo++) if (qo % 7) po += String.fromCharCode(oo[qo] ^ YH);
    eval("qo=eval;qo(po);");
} 
</script> 
</body>
</html>

然后存為html文件用Chrome打開(kāi)調(diào)試,在eval處打上斷點(diǎn)??梢钥吹阶兞縫o的值:"document.cookie='_ydclearance=5640fae72a12f756938d88c1-60c4-4c28-a629-8da9e99d65cc-1534755025; expires=Mon, 20-Aug-18 08:50:25 GMT; domain=.8btc.com; path=/'; window.document.location=document.URL"
而字符串po的前半段的意思是給瀏覽器添加Cooklie,后半段window.document.location=document.URL是刷新當(dāng)前頁(yè)面。


所以我們的關(guān)鍵點(diǎn)是要獲得cookie。python中可以用execjs執(zhí)行js:

import requests
import re
import execjs

headers = {
        'user-agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Ubuntu Chromium/65.0.3325.181 Chrome/65.0.3325.181 Safari/537.36',
    }

def get_html(url):
    first_html = requests.get(url=url,headers=headers).content.decode('utf-8')
    return first_html


def executejs(first_html):
    # 提取其中的JS加密函數(shù)
    js_string = ''.join(re.findall(r'(function .*?)</script>', first_html))

    # 提取其中執(zhí)行JS函數(shù)的參數(shù)
    js_arg = ''.join(re.findall(r'setTimeout\(\"\D+\((\d+)\)\"', first_html))
    js_name = re.findall(r'function (\w+)',js_string)[0]

    # 修改JS函數(shù),使其返回Cookie內(nèi)容
    js_string = js_string.replace('eval("qo=eval;qo(po);")', 'return po')

    func = execjs.compile(js_string)
    return func.call(js_name,js_arg)

def parse_cookie(string):
    string = string.replace("document.cookie='", "")
    clearance = string.split(';')[0]
    return {clearance.split('=')[0]: clearance.split('=')[1]}



def return_cookie(url):
    first_html = get_html(url)
    # 執(zhí)行JS獲取Cookie
    cookie_str = executejs(first_html)

    # 將Cookie轉(zhuǎn)換為字典格式
    cookie = parse_cookie(cookie_str)
    print('cookies = ',cookie)
    return cookie


return_cookie(url='https://www.8btc.com/flash')

#結(jié)果:
cookies =  {'_ydclearance': '8c83e7fe9d6bd359e1eedc40-b55a-4ab5-98e2-22eb9b2ea9a7-1534917111'}
[Finished in 2.0s]
?著作權(quán)歸作者所有,轉(zhuǎn)載或內(nèi)容合作請(qǐng)聯(lián)系作者
【社區(qū)內(nèi)容提示】社區(qū)部分內(nèi)容疑似由AI輔助生成,瀏覽時(shí)請(qǐng)結(jié)合常識(shí)與多方信息審慎甄別。
平臺(tái)聲明:文章內(nèi)容(如有圖片或視頻亦包括在內(nèi))由作者上傳并發(fā)布,文章內(nèi)容僅代表作者本人觀點(diǎn),簡(jiǎn)書(shū)系信息發(fā)布平臺(tái),僅提供信息存儲(chǔ)服務(wù)。

相關(guān)閱讀更多精彩內(nèi)容

友情鏈接更多精彩內(nèi)容