HTTP “請求頭信息”Request Header是向服務(wù)端提供客戶端的信息,“響應(yīng)頭信息”Response Header是服務(wù)端向客戶端提供請求文檔信息或服務(wù)器的狀態(tài)信息,服務(wù)端判斷服務(wù)端的身份,就是通過Header來判斷的,所以爬蟲通過設(shè)置Header來隱藏自己相當(dāng)重要。
HTTP請求
一個(gè)完整的HTTP請求包含以下部分:
請求方法 URL HTTP版本
請求頭信息
請求數(shù)據(jù)
<一個(gè)空行,請求的結(jié)束行>
常見的請求頭:
Accept:客戶端接收的數(shù)據(jù)類型,如:Accept:text/html
User Agent:客戶端軟件類型
Authorization:認(rèn)證消息,包括用戶名和口令
Referer:用戶獲取的Web頁面
真實(shí)的請求頭信息會(huì)更多,下面是豆瓣某短評的真實(shí)請求頭:
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3
Accept-Encoding: gzip, deflate, br
Accept-Language: zh-CN,zh;q=0.9,en;q=0.8,zh-TW;q=0.7
Cache-Control: max-age=0
Connection: keep-alive
Cookie: douban-fav-remind=1; _vwo_uuid_v2=DA9D8506AF55689A98FC4EC5458A1F005|9e3e2c5da18b4341ef7c7c5b1e6bc17d; __utmv=30149280.19413; douban-profile-remind=1; ll="118281"; __gads=ID=84d32737c7eb0e14:T=1564540928:S=ALNI_MYeYoLNcsUs74D0ASArxlCoDpjBIA; viewed="24872560"; bid=SdT44rmbqnQ; __utmz=223695111.1572102190.1.1.utmcsr=douban.com|utmccn=(referral)|utmcmd=referral|utmcct=/; __yadk_uid=FTuDNkSGj6E7mIoNRiLhR0HOeQGlFstY; push_noty_num=0; push_doumail_num=0; __utmz=30149280.1572351033.43.22.utmcsr=douban.com|utmccn=(referral)|utmcmd=referral|utmcct=/people/194130217/; _pk_ref.100001.4cf6=%5B%22%22%2C%22%22%2C1576563221%2C%22https%3A%2F%2Fwww.douban.com%2F%22%5D; _pk_id.100001.4cf6=c746db049db29d28.1572102190.8.1576563221.1574420366.; _pk_ses.100001.4cf6=*; __utma=30149280.1363678983.1539603396.1574420366.1576563222.48; __utmb=30149280.0.10.1576563222; __utmc=30149280; __utma=223695111.1512084094.1572102190.1574420367.1576563222.8; __utmb=223695111.0.10.1576563222; __utmc=223695111
Host: movie.douban.com
Sec-Fetch-Mode: navigate
Sec-Fetch-Site: none
Sec-Fetch-User: ?1
Upgrade-Insecure-Requests: 1
User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.108 Safari/537.36
HTTP響應(yīng)
一個(gè)完整的HTTP響應(yīng)包含以下部分:
狀態(tài)行
響應(yīng)頭
響應(yīng)數(shù)據(jù)
常見的狀態(tài)行:
| 響應(yīng)碼 | 說明 |
|---|---|
| 請求成功 | 20x |
| 200 | OK,請求成功 |
| 201 | OK,建立新的資源(POST) |
| 202 | 請求被接受,但處理未完成 |
| 204 | OK,但沒有內(nèi)容返回 |
| 重定向 | 30x |
| 301 | 所請求的資源已被指派為新的固定URL |
| 302 | 所請求的資源臨時(shí)位于另外的URL |
| 304 | 文檔沒有修改(GET) |
| 客戶差錯(cuò) | 40x |
| 400 | 錯(cuò)誤的請求 |
| 401 | 未被授權(quán) |
| 403 | 不明原因的禁止 |
| 404 | 沒有找到 |
| 服務(wù)器差錯(cuò) | 50x |
| 500 | 內(nèi)部服務(wù)器差錯(cuò) |
| 501 | 沒有實(shí)現(xiàn) |
| 502 | 錯(cuò)誤的網(wǎng)關(guān),網(wǎng)關(guān)或上游服務(wù)器來的無效響應(yīng) |
| 503 | 服務(wù)器暫時(shí)失效 |
更多狀態(tài)碼查看: HTTP狀態(tài)碼
常見的響應(yīng)頭:
Server:Web服務(wù)器程序的信息
Date:當(dāng)前服務(wù)器的日期和時(shí)間
Last Modified:請求文檔最近一次修改的時(shí)間
Expires:請求文檔過期時(shí)間
Content-length:數(shù)據(jù)長度(字節(jié))
Content-type:數(shù)據(jù)MIME類型
WWW-authenticate:用于通知客戶方需要的認(rèn)證信息,如用戶名,口令等
下面是豆瓣某短評的真實(shí)響應(yīng)頭:
Cache-Control: must-revalidate, no-cache, private
Connection: keep-alive
Content-Encoding: br
Content-Type: text/html; charset=utf-8
Date: Tue, 17 Dec 2019 06:13:57 GMT
Expires: Sun, 1 Jan 2006 01:00:00 GMT
Keep-Alive: timeout=30
Pragma: no-cache
Server: dae
Transfer-Encoding: chunked
Vary: Accept-Encoding
Vary: Accept-Encoding
X-Content-Type-Options: nosniff
X-DAE-App: movie
X-DAE-Instance: default
X-Douban-Mobileapp: 0
X-Xss-Protection: 1; mode=block
Requests設(shè)置Header
Python使用Requests來請求的時(shí)候,如果沒有設(shè)置Header,Header是空的,設(shè)置Header的方法如下:
headers = {
'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) '
'Chrome/41.0.2227.1 Safari/537.36',
'Cache-Control': 'max-age=0',
'Cookie': 'douban-fav-remind=1; _vwo_uuid_v2=DA9D8506AF55689A98FC4EC5458A1F005|9e3e2c5da18b4341ef7c7c5b1e6b'
'c17d; __'
'utmv=30149280.19413; douban-profile-remind=1; ll="118281"; __gads=ID=84d32737c7eb0e14:T=15645409'
'28:S=ALNI_MYeYoLNcsUs74D0ASArxlCoDpjBIA; viewed="24872560"; bid=SdT44rmbqnQ; __utmz=223695111.15'
'72102190.1.1.utmcsr=douban.com|utmccn=(referral)|utmcmd=referral|utmcct=/; __yadk_uid=FTuDNkSGj6'
'E7mIoNRiLhR0HOeQGlFstY; dbcl2="194130217:uUZw2E9T6DY"; push_noty_num=0; push_doumail_num=0; __'
'utmz=30149280.1572351033.43.22.utmcsr=douban.com|utmccn=(referral)|utmcmd=referral|utmcct='
'/people/194130217/; ck=rIT_; ap_v=0,6.0; _pk_ref.100001.4cf6=%5B%22%22%2C%22%22%2C1574385397%2C%'
'22https%'
'3A%2F%2Fwww.douban.com%2F%22%5D; _pk_ses.100001.4cf6=*; __utma=30149280.1363678983.1539603396.15'
'72478349.1574385398.46; __utmb=30149280.0.10.1574385398; __utmc=30149280; __utma=223695111.15120'
'84094.1572102190.1572478349.1574385398.6; __utmb=223695111.0.10.1574385398; __utmc=223695111; _'
'pk_id.100001.4cf6=c746db049db29d28.1572102190.6.1574385406.1572478372.',
}
# 設(shè)置請求超時(shí)時(shí)間,header
r = requests.get(url, timeout=20, headers=headers)