通過python 來抓取有道翻譯的內(nèi)容
第一步:
先打開有道翻譯的網(wǎng)頁(yè):http://fanyi.youdao.com/
第二步:
通過按F12,進(jìn)入調(diào)試模式,找到Network→XHR,就可以找到請(qǐng)求翻譯的接口,如下圖:

第三步:
點(diǎn)開Name底下的鏈接,如果Name底下為空的話,證明你沒有請(qǐng)求翻譯,在網(wǎng)頁(yè)上請(qǐng)求一下翻譯就出來了,點(diǎn)開就可以看見請(qǐng)求的鏈接,請(qǐng)求方式,請(qǐng)求頭,以及請(qǐng)求的參數(shù),如下圖:


如上,這些信息都是我們python請(qǐng)求需要用到的,有了這些信息,就可以編寫我們的python代碼了。
第四步:
編寫相應(yīng)的python代碼。
首先,可以看到請(qǐng)求方式是post請(qǐng)求,那么我們就先寫post請(qǐng)求方法:
import requests
def post(url,header=None,data=None):
if header is None:
header = {
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.138 Safari/537.36"
}
result = requests.post(url,header,data)
return result.text
然后,通過我們得到的請(qǐng)求的url,請(qǐng)求的參數(shù),請(qǐng)求的頭部,就可以編寫我們的請(qǐng)求的方法:
def getTranslate(data):
salt = get_salt()
jsonStr = post(
url='<http://fanyi.youdao.com/translate_o?smartresult=dict&smartresult=rule>',
headers = {
'Cookie': 'OUTFOX_SEARCH_USER_ID=1389460813@123.125.1.12',
'Referer': '<http://fanyi.youdao.com/>',
'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OSX10_14_2) AppleWebKit/537.36(KHTML, like Gecko) Chrome/75.0.3770.100 Safari/537.36'
},data=json.dumps({
'i':data,
'from':'zh-CHS',
'to':'en',
'smartresult':'dict',
'client':'fanyideskweb',
'salt':'15901129623685',
'sign':'7ebd4fe242ddf38339c8cc2ef09875bb',
'ts':'1590110946423',
'bv':'8abdac95d8a218e4fd5735f7cd2ab169',
'doctype':'json',
'version':'2.1',
'keyfrom':'fanyi.web',
'action':'FY_BY_REALTlME'
},ensure_ascii=False))
print(jsonStr)
調(diào)用方法后,會(huì)看到返回的是{"errorCode":50}
證明我們請(qǐng)求的接口沒有成功,參數(shù)和請(qǐng)求頭都沒問題,那么我們翻翻反爬蟲手冊(cè):http://www.itdecent.cn/p/ea6f47ad2042
從而判斷出,是有道翻譯接口通過加密方式阻止了訪問,導(dǎo)致訪問失敗。分析Form Data,發(fā)現(xiàn)salt,sign,ts這三個(gè)參數(shù)是動(dòng)態(tài)變化的,那問題就出在這三個(gè)參數(shù)身上,通過請(qǐng)求網(wǎng)頁(yè)發(fā)現(xiàn),每次都調(diào)用了fanyi.min.js這個(gè)js,很有可能就是通過js加密方式。我們找到這個(gè)js,吧里面的內(nèi)容復(fù)制出來,通過js解析網(wǎng)站:https://tool.oschina.net/codeformat/js
找到了js的關(guān)鍵代碼:
function(e, t) {
var n = e("./jquery-1.7");
e("./utils");
e("./md5");
var r = function(e) {
var t = n.md5(navigator.appVersion),
r = "" + (new Date).getTime(),
i = r + parseInt(10 * Math.random(), 10);
return {
ts: r,
bv: t,
salt: i,
sign: n.md5("fanyideskweb" + e + i + "Nw(nmmbP%A-r6U3EUn]Aj")
}
};
t.recordUpdate = function(e) {
var t = e.i,
i = r(t);
n.ajax({
type: "POST",
contentType: "application/x-www-form-urlencoded; charset=UTF-8",
url: "/bettertranslation",
data: {
i: e.i,
client: "fanyideskweb",
salt: i.salt,
sign: i.sign,
ts: i.ts,
bv: [i.bv](<http://i.bv/>),
tgt: e.tgt,
modifiedTgt: e.modifiedTgt,
from: e.from,
to: [e.to](<http://e.to/>)
},
success: function(e) {},
error: function(e) {}
})
}
可以看出關(guān)鍵代碼:
r = "" + (new Date).getTime(),
i = r + parseInt(10 * Math.random(), 10);
ts: r,
salt: i,
sign: n.md5("fanyideskweb" + e + i + "Nw(nmmbP%A-r6U3EUn]Aj")
那我們通過編寫python相關(guān)方法去替換著三個(gè)值就行了,方法如下:
def get_salt():
import time, random
salt = int(time.time() * 1000) + random.randint(0, 10)
return salt
def get_md5(v):
import hashlib
md5 = hashlib.md5() # md5對(duì)象,md5不能反解,但是加密是固定的,就是關(guān)系是一一對(duì)應(yīng),所以有缺陷,可以被對(duì)撞出來
md5.update(v.encode('utf-8')) # 要對(duì)哪個(gè)字符串進(jìn)行加密,就放這里
value = md5.hexdigest() # 拿到加密字符串
return value
def get_sign(key, salt):
sign = 'fanyideskweb' + key + str(salt) + 'n%A-rKaT5fb[Gy?;N5@Tj'
sign = get_md5(sign)
return sign
def get_ts(self):
# 根據(jù)當(dāng)前時(shí)間戳獲取ts參數(shù)
s = int(time.time() * 1000)
return str(s)
然后就可以編寫完整的程序了:
import requests
import json
import hashlib
import random
import time
def get(url,header=None,data=None):
if header is None:
header = {
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.138 Safari/537.36"
}
result = requests.get(url,header,data)
return result.text
def post(url,header=None,data=None):
if header is None:
header = {
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.138 Safari/537.36"
}
result = requests.post(url,header,data)
return result.text
def get_salt():
import time, random
salt = int(time.time() * 1000) + random.randint(0, 10)
return salt
def get_md5(v):
import hashlib
md5 = hashlib.md5() # md5對(duì)象,md5不能反解,但是加密是固定的,就是關(guān)系是一一對(duì)應(yīng),所以有缺陷,可以被對(duì)撞出來
md5.update(v.encode('utf-8')) # 要對(duì)哪個(gè)字符串進(jìn)行加密,就放這里
value = md5.hexdigest() # 拿到加密字符串
return value
def get_sign(key, salt):
sign = 'fanyideskweb' + key + str(salt) + 'n%A-rKaT5fb[Gy?;N5@Tj'
sign = get_md5(sign)
return sign
def get_ts(self):
# 根據(jù)當(dāng)前時(shí)間戳獲取ts參數(shù)
s = int(time.time() * 1000)
return str(s)
from urllib import request,parse
def youdao(key):
# 請(qǐng)求地址
url = '<http://fanyi.youdao.com/translate_o?smartresult=dict&smartresult=rule>'
salt = get_salt()
# 請(qǐng)求體
data = {
"i": key,
"from": "AUTO",
"to": "AUTO",
"smartresult": "dict",
"client": "fanyideskweb",
"salt": str(salt), ### 很長(zhǎng)的隨機(jī)串,防止用字典反推
"sign": get_sign(key, salt), ## 簽名:js加密
"doctype": "json",
"version": "2.1",
"keyfrom": "fanyi.web",
"action": "FY_BY_REALTIME",
"typoResult": "false"
}
# 數(shù)據(jù)編碼
data = parse.urlencode(data).encode()
headers = {
'Cookie': 'OUTFOX_SEARCH_USER_ID=1389460813@123.125.1.12',
'Referer': '<http://fanyi.youdao.com/>',
'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OSX10_14_2) AppleWebKit/537.36(KHTML, like Gecko) Chrome/75.0.3770.100 Safari/537.36'
}
req = request.Request(url=url, data=data, headers=headers)
rsp = request.urlopen(req)
html = rsp.read().decode()
print(html)
if __name__ == '__main__':
youdao('你好')
運(yùn)行結(jié)果如下圖:
