7個主要方法
| 方法 |
說明 |
| requests.request() |
構(gòu)造一個請求,支撐以下各方法的基礎(chǔ)方法 |
| requests.get() |
獲取HTML網(wǎng)頁的主要方法,對應于HTTP的GET |
| requests.head() |
獲取HTML網(wǎng)頁頭信息的方法,對應于HTTP的HEAD |
| requests.post() |
向HTML網(wǎng)頁提交POST請求的方法,對應于HTTP的POST |
| requests.put() |
向HTML網(wǎng)頁提交PUT請求的方法,對應于HTTP的PUT |
| requests.patch() |
向HTML網(wǎng)頁提交局部修改請求,對應于HTTP的PATCH |
| requests.delete() |
向HTML頁面提交刪除請求,對應于HTTP的DELETE |
Response對象的屬性
| 屬性 |
說明 |
| r.status_code |
HTTP請求的返回狀態(tài),200表示連接成功,404表示失敗 |
| r.text |
HTTP響應內(nèi)容的字符串形式,即,url對應的頁面內(nèi)容 |
| r.encoding |
從HTTP header中猜測的響應內(nèi)容編碼方式 |
| r.apparent_encoding |
從內(nèi)容中分析出的響應內(nèi)容編碼方式(備選編碼方式) |
| r.content |
HTTP響應內(nèi)容的二進制形式 |
r.encoding:如果header中不存在charset,則認為編碼為ISO‐8859‐1
r.text根據(jù)r.encoding顯示網(wǎng)頁內(nèi)容
r.apparent_encoding:根據(jù)網(wǎng)頁內(nèi)容分析出的編碼方式
可以看作是r.encoding的備選
Requests庫和Response的異常
| 異常 |
說明 |
| requests.ConnectionError |
網(wǎng)絡連接錯誤異常,如DNS查詢失敗、拒絕連接等 |
| requests.HTTPError |
HTTP錯誤異常 |
| requests.URLRequired |
URL缺失異常 |
| requests.TooManyRedirects |
超過最大重定向次數(shù),產(chǎn)生重定向異常 |
| requests.ConnectTimeout |
連接遠程服務器超時異常 |
| requests.Timeout |
請求URL超時,產(chǎn)生超時異常 |
| r.raise_for_status() |
如果不是200,產(chǎn)生異常requests.HTTPError |
r.raise_for_status()在方法內(nèi)部判斷r.status_code是否等于200,不需要
增加額外的if語句,該語句便于利用try‐except進行異常處理
通用代碼框架
import requests
def getHTMLText(url):
try:
r = requests.get(url, timeout=30)
r.raise_for_status() # 如果狀態(tài)碼不是200,則引發(fā)HTTPError異常
r.encoding = r.apparent_encoding
return r.text
except:
return '產(chǎn)生異常'
if __name__ == '__main__':
url = ''
...
HTTP協(xié)議對資源的操作
| 方法 |
說明 |
| GET |
請求獲取URL位置的資源 |
| HEAD |
請求獲取URL位置資源的響應消息報告,即獲得該資源的頭部信息 |
| POST |
請求向URL位置的資源后附加新的數(shù)據(jù) |
| PUT |
請求向URL位置存儲一個資源,覆蓋原URL位置的資源 |
| PATCH |
請求局部更新URL位置的資源,即改變該處資源的部分內(nèi)容 |
| DELETE |
請求刪除URL位置存儲的資源 |
PATCH和 PUT區(qū)別
假設(shè)URL位置有一組數(shù)據(jù)UserInfo,包括UserID、UserName等20個字段
需求:用戶修改了UserName,其他不變
? 采用PATCH,僅向URL提交UserName的局部更新請求
? 采用PUT,必須將所有20個字段一并提交到URL,未提交字段被刪除
PATCH的最主要好處:節(jié)省網(wǎng)絡帶寬
HTTP協(xié)議中的方法與Reqests庫方法一一對應
7個方法詳解
requests.request(method, url, **kwargs)
| 參數(shù) |
解釋 |
| method |
請求方式,對應get/put/post等7種 |
| url |
擬獲取頁面的url鏈接 |
| **kwargs |
控制訪問的參數(shù),共13個 |
參數(shù)詳解
| **kwargs |
解釋 |
| params |
字典或字節(jié)序列,作為參數(shù)增加到url中 |
| data |
字典、字節(jié)序列或文件對象,作為Request的內(nèi)容 |
| json |
JSON格式的數(shù)據(jù),作為Request的內(nèi)容 |
| headers |
字典,HTTP定制頭 |
| cookies |
字典或CookieJar,Request中的cookie |
| auth |
元組,支持HTTP認證功能 |
| files |
字典類型,傳輸文件 |
| timeout |
設(shè)定超時時間,秒為單位 |
| proxies |
字典類型,設(shè)定訪問代理服務器,可以增加登錄認證 |
| allow_redirects |
True/False,默認為True,重定向開關(guān) |
| stream |
True/False,默認為True,獲取內(nèi)容立即下載開關(guān) |
| verify |
True/False,默認為True,認證SSL證書開關(guān) |
| cert |
本地SSL證書路徑 |
kv = {'key1': 'value1', 'key2': 'value2'}
r = requests.request('GET', 'http://python123.io/ws', params=kv)
print(r.url) # http://python123.io/ws?key1=value1&key2=value2
kv = {'key1': 'value1', 'key2': 'value2'}
r = requests.request('POST', 'http://python123.io/ws', data=kv)
body = '主體內(nèi)容'
r = requests.request('POST', 'http://python123.io/ws', data=body)
kv = {'key1': 'value1'}
r = requests.request('POST', 'http://python123.io/ws', json=kv)
hd = {'user‐agent': 'Chrome/10'}
r = requests.request('POST', 'http://python123.io/ws', headers=hd)
# 向某個連接提交文件
fs = {'file': open('data.xls', 'rb')}
r = requests.request('POST', 'http://python123.io/ws', files=fs)
pxs = { 'http': 'http://user:pass@10.10.10.1:1234', 'https': 'https://10.10.10.1:4321' }
r = requests.request('GET', 'http://www.baidu.com', proxies=pxs)
requests.get(url, params=None, **kwargs)
| 參數(shù) |
解釋 |
| url |
擬獲取頁面的url鏈接 |
| params |
url中的額外參數(shù),字典或字節(jié)流格式,可選 |
| **kwargs |
12個控制訪問的參數(shù) |
import requests
# 發(fā)送get請求
r = requests.get('http://www/baidu.com/s', params={'kw': 'python'})
r = requests.get('http://www.baidu.com/s', params=dict(wd='python'))
# 下載圖片
url = 'https://timgsa.baidu.com/timg?image&quality=80&size=b9999_10000&sec=1531368811695&di=6c119a485d39d482d54fcf31856b6de1&imgtype=0&src=http%3A%2F%2Fimg.shouyoutan.com%2FUploads-s%2Fnews%2F2016-04-11%2F570b63dbc401b.jpg'
r = requests.get(url)
with open('demo.jpg', 'wb') as f:
f.write(r.content)
# 下載大文件
url = 'http://sqdownb.onlinedown.net/down/FastStoneCapture44264.zip'
filename = url.split('/')[-1]
r = requests.get(url, stream=True)
with open(filename, 'wb') as f:
for chunk in r.iter_content(chunk_size=512):
if chunk:
f.write(chunk)
# requests請求響應信息
import requests
headers = {
'User-Agent':'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWe'
'bKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.87 Safari/537.36'
}
resp = requests.get('http://www.baidu.com/', headers=headers)
### 示例
# 獲取響應中的頭信息
print(resp.headers)
# 獲取請求的頭信息
headers = {
'User-Agent':'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWe'
'bKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.87 Safari/537.36'
}
resp = requests.get('http://www.baidu.com/', headers=headers)
# 獲取響應中的頭信息
print(resp.headers)
# 獲取請求的頭信息
print(resp.request.headers)
print(resp.text) # 字符串格式
print(resp.content) # 二進制格式
print(resp.raw) # 原始響應內(nèi)容
print(resp.json) # json格式
# 重定向
r = requests.get('http://www.baidu.com/', allow_redirects=False) # 獲得重定向之前的內(nèi)容
r = requests.get('http://10.31.161.59:8888/admin', allow_redirect=True) # 獲得重定向之后的內(nèi)容
# 發(fā)送cookie
url = 'http://10.31.161.59:8888'
cookies = dict(
test = 'testing'
)
resp = requests.get(url, cookies=cookies)
requests.head(url, **kwargs)
| 參數(shù) |
解釋 |
| url |
擬獲取頁面的url鏈接 |
| **kwargs |
12個控制訪問的參數(shù) |
requests.post(url, data=None, json=None, **kwargs)
| 參數(shù) |
解釋 |
| url |
擬更新頁面的url鏈接 |
| data |
字典、字節(jié)序列或文件,Request的內(nèi)容 |
| json |
JSON格式的數(shù)據(jù),Request的內(nèi)容 |
| **kwargs |
12個控制訪問的參數(shù) |
# 發(fā)送post請求
url = 'http://10.31.161.59:8888/user/register/'
r = requests.post(url, data=dict(username='python', password='123456'))
print(r.content
# 文件上傳
url = ''
files = {
'file': open('somefile.txt', 'rb')
}
r = requests.post(url, files=files)
# 發(fā)送json數(shù)據(jù)
import json
url = ''
user = {
'username': 'gavin'
}
requests.post(url, data=json.dumps(user))
requests.post(url, json=user)
# 使用代理
proxies = {
'http':'http://101.236.35.98:8866',
'https':'https://180.121.132.184:808'
}
r = requests.get('http://www.baidu.com', proxies=proxies)
print(r.headers)
requests.put(url, data=None, **kwargs) 同Patch
| 參數(shù) |
解釋 |
| url |
擬更新頁面的url鏈接 |
| data |
字典、字節(jié)序列或文件,Request的內(nèi)容 |
| **kwargs |
12個控制訪問的參數(shù) |
requests.delete(url, **kwargs)
| 參數(shù) |
解釋 |
| url |
擬刪除頁面的url鏈接 |
| **kwargs |
12個控制訪問的參數(shù) |
requests請求響應信息
import requests
headers = {
'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64)'
' AppleWebKit/537.36 (KHTML, like Gecko) Chrome/65.0.3325.181 Safari/537.36'
}
resp = requests.get("http://10.31.161.59:8888",headers=headers)
# 獲得響應中的頭信息
print(resp.headers)
# 獲得請求的頭信息
print(resp.request.headers)
# 獲得響應的內(nèi)容
print(resp.text) # 字符串格式
print(resp.content) # bytes格式
print(resp.raw) # 原始響應內(nèi)容
print(resp.json()) # json格式
requests超時時間
import requests
url = 'http://www.baidu.com'
try:
r = requests.get(url=url,timeout=1)
except Exception as e:
print("{url}請求超時.".format(url=url))
print(type(r.request))
頁面下載及解決編碼問題
decode encode
str(byte類型) ---------> str(Unicode) ---------> str(byte類型)
import requests
import codecs
# pip install requests
def fetch(url):
"""
獲取url對應的頁面的內(nèi)容.
參數(shù):
-url- : 網(wǎng)頁的地址
返回值:
如果頁面成功獲取,以r.content二進制內(nèi)容返回,有2個方式保存為文件,直接以wb方式寫入,或者先將二進制數(shù)據(jù)decode成str,然后指定編碼格式 encoding='GBK',然后寫入文件。
如果頁面以r.text返回,需要和通用代碼框架中一樣設(shè)定 r.encoding = 'gbk'或 r.encoding = r.apparent_encoding,然后直接寫入文件即可。
"""
r= requests.get(url)
if r.status_code is 200:
return r.content
return None
def main():
result = fetch("https://www.poxiao.com/")
if result is not None:
with open("poxiao.html","wb") as f:
f.write(result)
return None
# if result is not None:
# with codecs.open("poxiao.html","w",encoding='GBK') as f:
# f.write(result.decode('GBK'))
# return None
if __name__ == '__main__':
main()
保存數(shù)據(jù)到數(shù)據(jù)庫
官方版本
import pymysql
import pymysql.cursors
from bs4 import BeautifulSoup
with open('poxiao.html','rb') as f:
# pip install lxml
poxiao = BeautifulSoup(f,'lxml')
# div content clear
# 找到div
content_div = poxiao.find(name='div',class_='content clear')
# 從div中找所有l(wèi)i
li_list = content_div.findAll(name='li')
# Connect to the database
connection = pymysql.connect(host='localhost',
user= 'spider',
password='123456',
db='spider',
charset='utf8mb4',
cursorclass=pymysql.cursors.DictCursor)
# 獲得每個li中的文本和超鏈接
try:
for li in li_list:
print(li.a.get("href"))
print(li.get_text())
href = li.a.get("href")
title = li.get_text()
with connection.cursor() as cursor:
# Create a new record
sql = "INSERT INTO `movie` (`title`, `link`) VALUES (%s, %s)"
cursor.execute(sql, (title, href))
# connection is not autocommit by default. So you must commit to save
# your changes.
connection.commit()
with connection.cursor() as cursor:
# Read a single record
sql = "SELECT `title`, `link` FROM `movie` WHERE `movie_id`=%s"
cursor.execute(sql, (1,))
result = cursor.fetchone()
print(result)
finally:
connection.close()
簡寫版本
import pymysql
from bs4 import BeautifulSoup
with open('poxiao.html', 'rb') as f:
poxiao = BeautifulSoup(f, 'lxml')
content_div = poxiao.find(name='div', class_='content clear')
li_list = content_div.findAll(name='li')
db = pymysql.connect('127.0.0.1', 'root', '123456', 'spider')
db.set_charset('utf8')
cursor = db.cursor()
try:
for li in li_list:
+ li.a.get('href')
title = li.get_text()
# sql = 'insert into movie values(null, "%s", "%s")' % (title, href)
sql = "INSERT INTO `movie` (`title`, `link`) VALUES (%s, %s)"
cursor.execute(sql, (title, href))
db.commit()
except:
db.rollback()
db.close()
requests會話
import requests
s = requests.Session()
# 地址級別的參數(shù)可以跨請求保持
s.get('http://httpbin.org/cookies/set/sessioncookie/123456789')
r = s.get("http://httpbin.org/cookies")
print(r.text)
# 函數(shù)級別的參數(shù)不會被跨請求保持
resp = s.get('http://httpbin.org/cookies', cookies={'name':'xiaoming'})
print(resp.text)
resp = s.get('http://httpbin.org/cookies')
print(resp.text)
# 會話可以用來提供默認數(shù)據(jù)
session = requests.Session()
session.auth = ('dai','123456')
# 會話也可用來為請求方法提供缺省數(shù)據(jù)。這是通過為會話對象的屬性提供數(shù)據(jù)來實現(xiàn)的:
s = requests.Session()
s.auth = ('user', 'pass')
s.headers.update({'x-test': 'true'})
# both 'x-test' and 'x-test2' are sent
r= s.get('http://httpbin.org/headers', headers={'x-test2': 'true'})
print(r.request.headers)
# 在 with 前后文管理器中使用session,確保session自動關(guān)閉
with requests.Session as session:
session.get('http://www.baidu.com')
預請求
from requests import Request, Session
session = Session()
url = 'http://www.baidu.com'
# 構(gòu)建一個請求對象
req = Request('GET', url, headers=None)
# 獲得帶有狀態(tài)的預請求對象
prepped = session.prepare_request(req)
print(type(prepped))
# <class 'requests.models.PreparedRequest'>
# 發(fā)送請求
with session.send(prepped, timeout=0.5) as response:
print(response.status_code)
requests下載圖片
import requests
url = "http://pic171.nipic.com/file/20180705/5053868_230805401034_2.jpg"
r = requests.get(url)
with open('demo.jpg','wb') as f:
f.write(r.content)
requests下載大文件
url = 'http://sqdownb.onlinedown.net/down/FastStoneCapture44264.zip'
import requests
filename = url.split('/')[-1]
r = requests.get(url,stream=True)
with open(filename,'wb') as f:
for chunk in r.iter_content(chunk_size=512):
if chunk:
f.write(chunk)
requests發(fā)送cookie
import requests
url = 'http://10.31.161.59:8888/'
headers = {
'User-Agent':'afasdfasfasdf',
}
cookies = dict(
sessionid = 'swkbut2xrt2c3rhhwubamp80oa2p7tzw',
csrftoken = 'e6JLs6n2u4gPDrd8l3XVkUhhvCldxUvXhJ0SHJGiV1oUbnTwv12CR9GTvzX6l9EC'
)
resp = requests.get(url,headers=headers,cookies=cookies)
print(resp.text)
requests文件上傳
import requests
url = ''
files = {
'file':open('somefile.txt','rb')
}
r = requests.post(url,files=files)