Python爬蟲—破解JS加密的Cookie

前言

在GitHub上維護(hù)了一個(gè)代理池的項(xiàng)目,代理來源是抓取一些免費(fèi)的代理發(fā)布網(wǎng)站。上午有個(gè)小哥告訴我說有個(gè)代理抓取接口不能用了,返回狀態(tài)521。抱著幫人解決問題的心態(tài)去跑了一遍代碼。發(fā)現(xiàn)果真是這樣。

  通過Fiddler抓包比較,基本可以確定是JavaScript生成加密Cookie導(dǎo)致原來的請(qǐng)求返回521。

發(fā)現(xiàn)問題

打開Fiddler軟件,用瀏覽器打開目標(biāo)站點(diǎn)(http://www.kuaidaili.com/proxylist/2/) ??梢园l(fā)現(xiàn)瀏覽器對(duì)這個(gè)頁面加載了兩次,第一次返回521,第二次才正常返回?cái)?shù)據(jù)。很多沒有寫過網(wǎng)站或是爬蟲經(jīng)驗(yàn)不足的童鞋,可能就會(huì)覺得奇怪為什么會(huì)這樣?為什么瀏覽器可能正常返回?cái)?shù)據(jù)而代碼卻不行?

仔細(xì)觀察兩次返回的結(jié)果可以發(fā)現(xiàn):

1、第二次請(qǐng)求比第一次請(qǐng)求的Cookie內(nèi)容多了個(gè)這個(gè)_ydclearance=0c316df6ea04c5281b421aa8-5570-47ae-9768-2510d9fe9107-1490254971

  2、第一次返回的內(nèi)容一些復(fù)雜看不懂的JS代碼,第二次返回的就是正確的內(nèi)容

  其實(shí)這是網(wǎng)站反爬蟲的常用手段。大致過程是這樣的:首次請(qǐng)求數(shù)據(jù)時(shí),服務(wù)端返回動(dòng)態(tài)的混淆加密過的JS,而這段JS的作用是給Cookie添加新的內(nèi)容用于服務(wù)端驗(yàn)證,此時(shí)返回的狀態(tài)碼是521。瀏覽器帶上新的Cookie再次請(qǐng)求,服務(wù)端驗(yàn)證Cookie通過返回?cái)?shù)據(jù)(這也是為嘛代碼不能返回?cái)?shù)據(jù)的原因)。

歡迎加入我的QQ群`923414804`與我一起學(xué)習(xí),群里有我學(xué)習(xí)過程中整理的大量學(xué)習(xí)資料。加群即可免費(fèi)獲取

解決問題

  其實(shí)我第一次遇到這樣的問題是,一開始想的就是既然你是用JS生成的Cookie, 那么我也可以將JS函數(shù)翻譯成Python運(yùn)行。但是最后還是發(fā)現(xiàn)我太傻太天真,因?yàn)楝F(xiàn)在的JS都流行混淆加密,原始的JS這樣的:

function lq(VA){varqo, mo ="", no ="", oo = [0x8c,0xcd,0x4c,0xf9,0xd7,0x4d,0x25,0xba,0x3c,0x16,0x96,0x44,0x8d,0x0b,0x90,0x1e,0xa3,0x39,0xc9,0x86,0x23,0x61,0x2f,0xc8,0x30,0xdd,0x57,0xec,0x92,0x84,0xc4,0x6a,0xeb,0x99,0x37,0xeb,0x25,0x0e,0xbb,0xb0,0x95,0x76,0x45,0xde,0x80,0x59,0xf6,0x9c,0x58,0x39,0x12,0xc7,0x9c,0x8d,0x18,0xe0,0xc5,0x77,0x50,0x39,0x01,0xed,0x93,0x39,0x02,0x7e,0x72,0x4f,0x24,0x01,0xe9,0x66,0x75,0x4e,0x2b,0xd8,0x6e,0xe2,0xfa,0xc7,0xa4,0x85,0x4e,0xc2,0xa5,0x96,0x6b,0x58,0x39,0xd2,0x7f,0x44,0xe5,0x7b,0x48,0x2d,0xf6,0xdf,0xbc,0x31,0x1e,0xf6,0xbf,0x84,0x6d,0x5e,0x33,0x0c,0x97,0x5c,0x39,0x26,0xf2,0x9b,0x77,0x0d,0xd6,0xc0,0x46,0x38,0x5f,0xf4,0xe2,0x9f,0xf1,0x7b,0xe8,0xbe,0x37,0xdf,0xd0,0xbd,0xb9,0x36,0x2c,0xd1,0xc3,0x40,0xe7,0xcc,0xa9,0x52,0x3b,0x20,0x40,0x09,0xe1,0xd2,0xa3,0x80,0x25,0x0a,0xb2,0xd8,0xce,0x21,0x69,0x3e,0xe6,0x80,0xfd,0x73,0xab,0x51,0xde,0x60,0x15,0x95,0x07,0x94,0x6a,0x18,0x9d,0x37,0x31,0xde,0x64,0xdd,0x63,0xe3,0x57,0x05,0x82,0xff,0xcc,0x75,0x79,0x63,0x09,0xe2,0x6c,0x21,0x5c,0xe0,0x7d,0x4a,0xf2,0xd8,0x9c,0x22,0xa3,0x3d,0xba,0xa0,0xaf,0x30,0xc1,0x47,0xf4,0xca,0xee,0x64,0xf9,0x7b,0x55,0xd5,0xd2,0x4c,0xc9,0x7f,0x25,0xfe,0x48,0xcd,0x4b,0xcc,0x81,0x1b,0x05,0x82,0x38,0x0e,0x83,0x19,0xe3,0x65,0x3f,0xbf,0x16,0x88,0x93,0xdd,0x3b];? ? qo ="qo=241; do{oo[qo]=(-oo[qo])&0xff; oo[qo]=(((oo[qo]>>3)|((oo[qo]<<5)&0xff))-70)&0xff;} while(--qo>=2);";eval(qo);? ? qo =240;do{? ? ? ? oo[qo] = (oo[qo] - oo[qo -1]) &0xff;? ? }while(--qo >=3);? ? qo =1;for(; ;) {if(qo >240)break;? ? ? ? oo[qo] = ((((((oo[qo] +2) &0xff) +76) &0xff) <<1) &0xff) | (((((oo[qo] +2) &0xff) +76) &0xff) >>7);? ? ? ? qo++;? ? }? ? po ="";for(qo =1; qo < oo.length -1; qo++)if(qo %6) po +=String.fromCharCode(oo[qo] ^ VA);eval("qo=eval;qo(po);");}

  看到這樣的JS代碼,我只能說原諒我JS能力差,還原不了。。。

  但是前端經(jīng)驗(yàn)豐富的童鞋馬上就能想到還有種方法可解,那就是利用瀏覽器的JS代碼調(diào)試功能。這樣一切就迎刃而解,新建一個(gè)html文件,將第一次返回的html原文復(fù)制進(jìn)去,保存用瀏覽器打開,在eval之前打上斷點(diǎn),看到這樣的輸出:

可以看到這個(gè)變量po為document.cookie='_ydclearance=0c316df6ea04c5281b421aa8-5570-47ae-9768-2510d9fe9107-1490254971; expires=Thu, 23-Mar-17 07:42:51 GMT; domain=.kuaidaili.com; path=/'; window.document.location=document.URL,下面還有個(gè)eval("qo=eval;qo(po);")。JS里面的eval和Python的差不多,第二句的意思就是將eval方法賦給qo。然后去eval字符串po。而字符串po的前半段的意思是給瀏覽器添加Cooklie,后半段window.document.location=document.URL是刷新當(dāng)前頁面。

  這也印證了我上面的說法,首次請(qǐng)求沒有Cookie,服務(wù)端回返回一段生成Cookie并自動(dòng)刷新的JS代碼。瀏覽器拿到代碼能夠成功執(zhí)行,帶著新的Cookie再次請(qǐng)求獲取數(shù)據(jù)。而Python拿到這段代碼就只能停留在第一步。

  那么如何才能使Python也能執(zhí)行這段JS呢,答案是PyV8。V8是Chromium中內(nèi)嵌的javascript引擎,號(hào)稱跑的最快。PyV8是用Python在V8的外部API包裝了一個(gè)python殼,這樣便可以使python可以直接與javascript操作。PyV8的安裝大家可以自行百度。

代碼

  分析完成,下面切入正題擼代碼。

  首先是正常請(qǐng)求網(wǎng)頁,返回帶加密的JS函數(shù)的html:

importreimportPyV8importrequestsTARGET_URL ="http://www.kuaidaili.com/proxylist/1/"def getHtml(url, cookie=None):header = {"Host":"www.kuaidaili.com",'Connection':'keep-alive','Cache-Control':'max-age=0','Upgrade-Insecure-Requests':'1','User-Agent':'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/49.0.2623.87 Safari/537.36','Accept':'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8','Accept-Encoding':'gzip, deflate, sdch','Accept-Language':'zh-CN,zh;q=0.8',? ? }? ? html = requests.get(url=url, headers=header, timeout=30, cookies=cookie).contentreturnhtml# 第一次訪問獲取動(dòng)態(tài)加密的JSfirst_html = getHtml(TARGET_URL)

  由于返回的是html,并不單純的JS函數(shù),所以需要用正則提取JS函數(shù)的參數(shù)的參數(shù)。

# 提取其中的JS加密函數(shù)js_func =''.join(re.findall(r'(function .*?)</script>', first_html))print'get js func:\n', js_func# 提取其中執(zhí)行JS函數(shù)的參數(shù)js_arg =''.join(re.findall(r'setTimeout\(\"\D+\((\d+)\)\"', first_html))print'get ja arg:\n', js_arg

還有一點(diǎn)需要注意,在JS函數(shù)中并沒有返回cookie,而是直接將cookie set到瀏覽器,所以我們需要將eval("qo=eval;qo(po);")替換成return po。這樣就能成功返回po中的內(nèi)容。

def executeJS(js_func_string, arg):ctxt = PyV8.JSContext()? ? ctxt.enter()? ? func = ctxt.eval("({js})".format(js=js_func_string))returnfunc(arg)# 修改JS函數(shù),使其返回Cookie內(nèi)容js_func = js_func.replace('eval("qo=eval;qo(po);")','return po')# 執(zhí)行JS獲取Cookiecookie_str = executeJS(js_func, js_arg)

  這樣返回的cookie是字符串格式,但是用requests.get()需要字典形式,所以將其轉(zhuǎn)換成字典:

def parseCookie(string):string = string.replace("document.cookie='","")? ? clearance = string.split(';')[0]return{clearance.split('=')[0]: clearance.split('=')[1]}# 將Cookie轉(zhuǎn)換為字典格式cookie = parseCookie(cookie_str)

  最后帶上解析出來的Cookie再次訪問網(wǎng)頁,成功獲取數(shù)據(jù):

# 帶上Cookie再次訪問url,獲取正確數(shù)據(jù)printgetHtml(TARGET_URL, cookie)[0:500]

下面是完整代碼:

# -*- coding: utf-8 -*-""" ------------------------------------------------- File Name: demo_1.py.py Description : Python爬蟲—破解JS加密的Cookie 快代理網(wǎng)站為例:http://www.kuaidaili.com/proxylist/1/ Document: Author : JHao date: 2017/3/23 ------------------------------------------------- Change Activity: 2017/3/23: 破解JS加密的Cookie ------------------------------------------------- """__author__ ='JHao'importreimportPyV8importrequestsTARGET_URL ="http://www.kuaidaili.com/proxylist/1/"def getHtml(url, cookie=None):header = {"Host":"www.kuaidaili.com",'Connection':'keep-alive','Cache-Control':'max-age=0','Upgrade-Insecure-Requests':'1','User-Agent':'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/49.0.2623.87 Safari/537.36','Accept':'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8','Accept-Encoding':'gzip, deflate, sdch','Accept-Language':'zh-CN,zh;q=0.8',? ? }? ? html = requests.get(url=url, headers=header, timeout=30, cookies=cookie).contentreturnhtmldef executeJS(js_func_string, arg):ctxt = PyV8.JSContext()? ? ctxt.enter()? ? func = ctxt.eval("({js})".format(js=js_func_string))returnfunc(arg)def parseCookie(string):string = string.replace("document.cookie='","")? ? clearance = string.split(';')[0]return{clearance.split('=')[0]: clearance.split('=')[1]}# 第一次訪問獲取動(dòng)態(tài)加密的JSfirst_html = getHtml(TARGET_URL)# first_html = """# <html><body><script language="javascript"> window.onload=setTimeout("lu(158)", 200); function lu(OE) {var qo, mo="", no="", oo = [0x64,0xaa,0x98,0x3d,0x56,0x64,0x8b,0xb0,0x88,0xe1,0x0d,0xf4,0x99,0x31,0xd8,0xb6,0x5d,0x73,0x98,0xc3,0xc4,0x7a,0x1e,0x38,0x9d,0xe8,0x8d,0xe4,0x0a,0x2e,0x6c,0x45,0x69,0x41,0xe5,0xd0,0xe5,0x11,0x0b,0x35,0x7b,0xe4,0x09,0xb1,0x2b,0x6d,0x82,0x7c,0x25,0xdd,0x70,0x5a,0xc4,0xaa,0xd3,0x74,0x98,0x42,0x3c,0x60,0x2d,0x42,0x66,0xe0,0x0a,0x2e,0x96,0xbb,0xe2,0x1d,0x38,0xdc,0xb1,0xd6,0x0e,0x0d,0x76,0xae,0xc3,0xa9,0x3b,0x62,0x47,0x40,0x15,0x93,0xb7,0xee,0xc3,0x3e,0xfd,0xd3,0x0d,0xf6,0x61,0xdc,0xf1,0x2c,0x54,0x8c,0x90,0xfa,0x24,0x5b,0x83,0x0c,0x75,0xaf,0x18,0x01,0x7e,0x68,0xe0,0x0a,0x72,0x1e,0x88,0x33,0xa7,0xcc,0x31,0x9b,0xf3,0x1a,0xf2,0x9a,0xbf,0x58,0x83,0xe4,0x87,0xed,0x07,0x7e,0xe2,0x00,0xe9,0x92,0xc9,0xe8,0x59,0x7d,0x56,0x8d,0xb5,0xb2,0x6c,0xe0,0x49,0x73,0xfc,0xe7,0x20,0x49,0x34,0x09,0x71,0xeb,0x60,0xfd,0x8e,0xad,0x0f,0xb9,0x2e,0x77,0xdc,0x74,0x9b,0xbf,0x8f,0xa5,0x8d,0xb8,0xb0,0x06,0xac,0xc5,0xe9,0x10,0x12,0x77,0x9b,0xb1,0x19,0x4e,0x64,0x5c,0x00,0x98,0xc6,0xed,0x98,0x0d,0x65,0x11,0x35,0x9e,0xf4,0x30,0x93,0x4b,0x00,0xab,0x20,0x8f,0x29,0x4f,0x27,0x8c,0xc2,0x6a,0x04,0xfb,0x51,0xa3,0x4b,0xef,0x09,0x30,0x28,0x4d,0x25,0x8e,0x76,0x58,0xbf,0x57,0xfb,0x20,0x78,0xd1,0xf7,0x9f,0x77,0x0f,0x3a,0x9f,0x37,0xdb,0xd3,0xfc,0x14,0x39,0x11,0x3b,0x94,0x8c,0xad,0x8e,0x5c,0xd3,0x3b];qo = "qo=251; do{oo[qo]=(-oo[qo])&0xff; oo[qo]=(((oo[qo]>>4)|((oo[qo]<<4)&0xff))-0)&0xff;} while(--qo>=2);"; eval(qo);qo = 250; do { oo[qo] = (oo[qo] - oo[qo - 1]) & 0xff; } while (-- qo >= 3 );qo = 1; for (;;) { if (qo > 250) break; oo[qo] = ((((((oo[qo] + 200) & 0xff) + 121) & 0xff) << 6) & 0xff) | (((((oo[qo] + 200) & 0xff) + 121) & 0xff) >> 2); qo++;}po = ""; for (qo = 1; qo < oo.length - 1; qo++) if (qo % 5) po += String.fromCharCode(oo[qo] ^ OE);eval("qo=eval;qo(po);");} </script> </body></html># """# 提取其中的JS加密函數(shù)js_func =''.join(re.findall(r'(function .*?)</script>', first_html))print'get js func:\n', js_func# 提取其中執(zhí)行JS函數(shù)的參數(shù)js_arg =''.join(re.findall(r'setTimeout\(\"\D+\((\d+)\)\"', first_html))print'get ja arg:\n', js_arg# 修改JS函數(shù),使其返回Cookie內(nèi)容js_func = js_func.replace('eval("qo=eval;qo(po);")','return po')# 執(zhí)行JS獲取Cookiecookie_str = executeJS(js_func, js_arg)# 將Cookie轉(zhuǎn)換為字典格式cookie = parseCookie(cookie_str)printcookie# 帶上Cookie再次訪問url,獲取正確數(shù)據(jù)printgetHtml(TARGET_URL, cookie)[0:500]

?著作權(quán)歸作者所有,轉(zhuǎn)載或內(nèi)容合作請(qǐng)聯(lián)系作者
【社區(qū)內(nèi)容提示】社區(qū)部分內(nèi)容疑似由AI輔助生成,瀏覽時(shí)請(qǐng)結(jié)合常識(shí)與多方信息審慎甄別。
平臺(tái)聲明:文章內(nèi)容(如有圖片或視頻亦包括在內(nèi))由作者上傳并發(fā)布,文章內(nèi)容僅代表作者本人觀點(diǎn),簡書系信息發(fā)布平臺(tái),僅提供信息存儲(chǔ)服務(wù)。

相關(guān)閱讀更多精彩內(nèi)容

  • 作者:Jerry 鏈接:https://zhuanlan.zhihu.com/p/25957793 來源:知乎 著...
    Lauzanhing閱讀 2,320評(píng)論 1 13
  • 原文出處 jhao 現(xiàn)在很多網(wǎng)站為了防范爬蟲,做了很多反扒處理,同樣對(duì)于開發(fā)者來講,上有政策,下有對(duì)策,于是今天來...
    Panda_phc閱讀 3,004評(píng)論 0 7
  • 第一部分 HTML&CSS整理答案 1. 什么是HTML5? 答:HTML5是最新的HTML標(biāo)準(zhǔn)。 注意:講述HT...
    kismetajun閱讀 28,828評(píng)論 1 45
  • 【一起蹲在堆滿屎的屎坑里拉了屎,終于懂了什么叫入鄉(xiāng)隨俗】 今天早上起床,小G突然問我: 你要上廁所嗎? 我說:要 ...
    Starryn_n閱讀 203評(píng)論 0 0
  • 姓名:沈丹萍 公司:寧波大發(fā)化纖有限公司 《六項(xiàng)精進(jìn)》289期學(xué)員 組名:反省二組 【日精進(jìn)打卡87天】 【知~學(xué)...
    好運(yùn)到來閱讀 179評(píng)論 0 0

友情鏈接更多精彩內(nèi)容