Python網(wǎng)絡(luò)數(shù)據(jù)采集4-POST提交與Cookie的處理

POST提交

之前訪問頁面都是用的get提交方式，有些網(wǎng)頁需要登錄才能訪問，此時需要提交參數(shù)。雖然在一些網(wǎng)頁，get方式也能提交參參數(shù)。比如https://www.some-web-site.com?param1=username&param2=age。但是在登錄這種需要安全性的地方。還是通過表單提交的方式好。此時就需要用到post提交了。這在requests庫中特別簡單。指定data參數(shù)就行了。

表單提交例子這個網(wǎng)頁有個表單。

<form action="processing.php" method="post">
First name: <input name="firstname" type="text"><br>
Last name: <input name="lastname" type="text"><br>
<input id="submit" type="submit" value="Submit">
</form>

method屬性里可以看到提交方式是POST。action屬性里可以看到，我們表單提交后實際上會轉(zhuǎn)到processing.php這個頁面進行表單處理。所以我們應(yīng)該訪問這個頁面，進行表單參數(shù)的傳遞。

在往requests的data傳入?yún)?shù)的時候，注意對應(yīng)input標(biāo)簽的name屬性就行。他們分別是firstname、lastname。

import requests

url = 'https://pythonscraping.com/pages/files/processing.php'
params = {'firstname': 'Sun', 'lastname': 'Haiyu'}

r = requests.post(url, data=params, allow_redirects=False)
print(r.text)

Hello there, Sun Haiyu!

上傳文件

雖然在爬蟲中，上傳文件幾乎用不到。但是有必要了解下基本用法。使用requests的files參數(shù)就可以輕松實現(xiàn)。

這個網(wǎng)頁可以上傳圖片。同樣是一個表單。

<form action="processing2.php" enctype="multipart/form-data" method="post">
  Submit a jpg, png, or gif: <input name="uploadFile" type="file"><br>
  <input type="submit" value="Upload File">
</form>

和上面例子一樣，我們需要訪問的實際頁面是processing2.php，提交方法依然是POST。參數(shù)name為uploadFile。

import requests

url = 'https://pythonscraping.com/pages/files/processing2.php'
files = {'uploadFile': open('abc.PNG', 'rb')}
r = requests.post(url, files=files)
print(r.text)

Sorry, there was an error uploading your file.

代碼是沒有問題的，而且在瀏覽器里是上傳也是這個結(jié)果。估計書中提供的網(wǎng)址有問題吧...

處理登錄和Cookie

Cookie用來跟蹤用戶是否已經(jīng)登錄的狀態(tài)信息。一旦網(wǎng)站認證了我們的登錄，就會將cookie存到瀏覽器中，里面包含了服務(wù)器生成的令牌、登錄有效時長、狀態(tài)跟蹤信息。當(dāng)?shù)顷懹行r長達到，我們的登錄狀態(tài)就被清空，想要訪問其他需要登錄后才能訪問的頁面也就不能成功了。還是先登錄，然后獲取cookie。

這里有個登錄頁面

<form action="welcome.php" method="post">
Username (use anything!): <input name="username" type="text"><br>
Password (try "password"): <input name="password" type="password"><br>
<input type="submit" value="Login">
</form>

可以看到，登錄后會進入welcome.php，輸入賬號和密碼(賬號任意, 密碼必須是password)。

登錄成功后，可以使用get方式訪問簡介頁面

注意如果直接requests.get('https://pythonscraping.com/pages/cookies/profile.php')瀏覽器不知道我們“已經(jīng)登錄了”這個狀態(tài)，所以拒絕返回內(nèi)容。但是若是傳入登錄成功后得到的cookie，這個信息讓瀏覽器知道我已經(jīng)登錄，所以請給我看profile.php，瀏覽器看到這個令牌就會同意。

import requests
url = 'https://pythonscraping.com/pages/cookies/welcome.php'

params = {'username': 'Ryan', 'password': 'password'}

r = requests.post(url, params)

print(r.cookies.get_dict())
res = requests.get('https://pythonscraping.com/pages/cookies/profile.php', cookies=r.cookies)
print(res.text)

{'loggedin': '1', 'username': 'Ryan'}
Hey Ryan! Looks like you're still logged into the site!

Session

對于簡單的訪問這樣處理沒有問題，但是如果你面對的網(wǎng)站比較復(fù)雜，它經(jīng)常暗自調(diào)整cookie，這時候可以使用requests的Session對象了。它可以持續(xù)跟蹤會話信息，如cookie、header甚至包括運行HTTP協(xié)議的信息。

import requests

session = requests.Session()

params = {'username':'admin', 'password': 'password'}
s = session.post('https://pythonscraping.com/pages/cookies/welcome.php', params)
print(s.cookies.get_dict())
print('Go to profile page')
# 這里并不像上面一樣傳入了cookie
s = session.get('https://pythonscraping.com/pages/cookies/profile.php')
print(s.text)

{'loggedin': '1', 'username': 'admin'}
Go to profile page
Hey admin! Looks like you're still logged into the site!

其他登錄認證方式

還有一些登錄認證方式，比如HTTP基本接入認證。使用requests的auth參數(shù)。

這個頁面需要輸入賬號和密碼登錄

import requests

url = 'https://pythonscraping.com/pages/auth/login.php'

res = requests.get(url, auth=('sun', '123456'))
print(res.text)

<p>Hello sun.</p><p>You entered 123456 as your password.</p>

向auth傳入一個含有兩個元素的元組，分別是賬號和密碼，就能成功登錄了。

by @sunhaiyu

2017.7.17

色偷偷精品伊人,欧洲久久精品,欧美综合婷婷骚逼,国产AV主播,国产最新探花在线,九色在线视频一区,伊人大交九欧美,1769亚洲,黄色成人av

Python網(wǎng)絡(luò)數(shù)據(jù)采集4-POST提交與Cookie的處理