爬蟲遇到的問(wèn)題:關(guān)于網(wǎng)頁(yè)需要跳轉(zhuǎn)后才能訪問(wèn)的問(wèn)題
有時(shí)候訪問(wèn)一些網(wǎng)頁(yè),顯示網(wǎng)頁(yè)需要等待5s才能跳轉(zhuǎn)到所需要的網(wǎng)頁(yè):Your browser will redirect to your requested content shortly;
這是為了防止是爬蟲去訪問(wèn),requests沒(méi)有提供相應(yīng)的方法,如果每次用chromedriver訪問(wèn)影響效率太低下;
經(jīng)觀察網(wǎng)頁(yè)首次打開(kāi)會(huì)出現(xiàn)需要等待的問(wèn)題,再次刷新就不會(huì)了,那應(yīng)該是設(shè)置了cookie的問(wèn)題;
復(fù)制cookie跑下程序,訪問(wèn)結(jié)果就是想要的json數(shù)據(jù),那就證明首次訪問(wèn)設(shè)置了cookie,后面就一直刷新就直接訪問(wèn)數(shù)據(jù)了;
復(fù)制的cookie過(guò)一段時(shí)間再訪問(wèn)還是會(huì)跳出需要等待5s跳轉(zhuǎn)的問(wèn)題,說(shuō)明cookie會(huì)過(guò)期;
解決思路:
首次使用chromedriver訪問(wèn)獲取cookie,再用request.session保持cookie更新,使用無(wú)頭瀏覽器拿到cookie后再訪問(wèn)仍不OK,經(jīng)排查是user-agent和chromedriver訪問(wèn)時(shí)的user-agent不一樣,需要設(shè)置無(wú)頭瀏覽器的user-agent,再訪問(wèn)就OK了,代碼如下:
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
chrome_options = Options()
chrome_options.add_argument('--headless')
chrome_options.add_argument('user-agent=Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3626.119 Safari/537.36')? ?# 此處設(shè)置的user-agent要和后面每次訪問(wèn)的user-agent一致才不會(huì)報(bào)錯(cuò)
chrome_options.add_argument('--no-sandbox')
driver = webdriver.Chrome(chrome_options=chrome_options)
driver.get(url)
time.sleep(10)
cookies = {}
for cookie in driver.get_cookies():
cookies[cookie['name']] = cookie['value']
print(cookies)
s = requests.Session()
s.cookies = requests.utils.cookiejar_from_dict(cookies, cookiejar=None, overwrite=True)
driver.quit()
res = s.get(url, headers=headers)
print(res)
再拿著這個(gè)url就可以去反復(fù)請(qǐng)求url都不會(huì)跳出需要等待5s的問(wèn)題