97se爱,最近中文字幕9,91久久精品狠狠操

單元1：Requests庫入門

Requests庫安裝

命令提示符，pip安裝

Requests庫的7個主要方法

方法	說明
`requests.request()`	構(gòu)造一個請求，支撐以下各方法的基礎
`requests.get()`	獲取HTML網(wǎng)頁的主要方法，對應于HTTP的GET
`requests.head()`	獲取HTML網(wǎng)頁頭信息的方法，對應于HTTP的HEAD
`requests.post()`	向HTML網(wǎng)頁提交POST請求的方法，對應于HTTP的POST
`requests.put()`	向HTML網(wǎng)頁提交PUT請求的方法，對應于HTTP的PUT
`requests.patch()`	向HTML網(wǎng)頁提交局部修改請求，對應于HTTP的PATCH
`requests.delete()`	向HTML頁面提交刪除請求，對應于HTTP的DELETE

Requests庫的get()方法

r=requests.get(url)

構(gòu)造一個向服務器請求資源的Requests對象，返回一個包含服務器資源的Response對象。
完整方法為

requests.get(url, params = None, **kwargs)

url：獲取頁面的url鏈接
params：url中的額外參數(shù)，字典或字節(jié)流格式，可選
**kwargs：12個控制訪問的參數(shù)

調(diào)用.requests()方法實現(xiàn)

2個重要對象：Response和Requests
Response對象包含爬蟲返回的內(nèi)容

Response對象的屬性

屬性	說明
`r.status_code`	HTTP請求的返回狀態(tài)，200表示連接成功，404表示失敗
`r.text`	HTTP響應內(nèi)容的字符串形式，即，url對應的頁面內(nèi)容
`r.encoding`	從HTTP header中猜測的響應內(nèi)容編碼方式
`r.apparent_encoding`	從內(nèi)容中分析出的響應內(nèi)容編碼方式（備用編碼方式）
`r.content`	HTTP響應內(nèi)容的二進制形式

爬取網(wǎng)頁的通用代碼框架

網(wǎng)絡連接有危險，異常處理很重要。
理解Requests的異常

異常	說明
`requests.ConnectionError`	網(wǎng)絡連接錯誤異常，如DNS查詢失敗、拒絕連接等
`requests.HTTPError`	HTTP錯誤異常
`requests.URLRequired`	HRL缺失異常
`requests.TooManyRedirects`	超過最大重定向次數(shù)，產(chǎn)生重定向異常
`requests.ConnectTimeout`	連接遠程服務器超時異常
`requests.Timeout`	請求URL超時，產(chǎn)生超時異常

理解Requests庫的異常

異常	說明
`r.raise_for_status()`	如果不是200，產(chǎn)生異常requests.HTTPError

爬取網(wǎng)頁的通用代碼框架

def getHTMLText(url):
    try:
        r = requests.get(url, timeout = 30)
        r.raise_for_status()
        r.encoding = r.apparent_encoding
        return r.text
    except:
        return "產(chǎn)生異常"

使用該框架，例如：

if __name__ == "__main__":
    url = "http://www.baidu.com"
    print(getHTMLText(url))

HTTP協(xié)議及Requests庫方法

Requests有7個主要方法
理解HTTP協(xié)議：

超文本傳輸協(xié)議
基于請求與響應模式的、無狀態(tài)的應用層協(xié)議
采用URL作為定位網(wǎng)絡資源的標識http://host[:port][path]，host：合法的Internet主機域名或地址；port：端口號，默認80；path：請求資源的路徑
URL：通過HTTP協(xié)議存取資源的Internet路徑，一個URL對應一個數(shù)據(jù)資源

HTTP協(xié)議對資源的操作

方法	說明
GET	請求獲取URL位置的資源
HEAD	請求獲取URL位置資源的響應消息報告，即獲得該資源的頭部信息
POST	請求向URL位置的資源后附加新的數(shù)據(jù)
PUT	請求向URL位置存儲一個資源，覆蓋原URL位置的資源
PATCH	請求局部更新URL位置的資源，即改變該處資源的部分內(nèi)容
DELETE	請求刪除URL位置存儲的資源

HTTP協(xié)議對資源的操作

每一次操作獨立無狀態(tài)。

理解PATCH和PUT區(qū)別：
假設URL位置有一組數(shù)據(jù)UserInfo，包括UserID、UserName等20個字段。
需求：用戶只修改了UserName。

采用PATCH，僅向URL提交UserName的局部更新請求
采用PUT，必須將所有20個字段一并提交到URL，未提交字段被刪除

與Requests庫方法一一對應

head()方法

用很少流量獲取網(wǎng)絡資源概要信息

post()方法

向URL POST一個字典，自動編碼為form（表單）
POST一個字符串，自動編碼為data

put()方法：

與post()方法類似，將原數(shù)據(jù)覆蓋掉

Requests庫主要方法解析

7個主要方法

requests.request(method, url, **kwargs)

method：請求方式，對應get/put/post等7種
**kwargs：控制訪問的參數(shù)，13種：
params：字典或字節(jié)序列，作為參數(shù)增加到url中
data：字典、字節(jié)序列或文件對象，作為Request的內(nèi)容
json：JSON格式的數(shù)據(jù)，作為Requests的內(nèi)容
headers：字典，HTTP定制頭
cookies：字典或CookieJar，Request中的cookie
auth：元組，支持HTTP認證功能
files：字典類型，傳輸文件
timeout：設定超時時間，秒為單位
proxies：字典類型，設定訪問代理服務器，可以增加登錄認證
allow_redirects：True/False，默認為True，重定向開關
stream：True/False，默認為True，獲取內(nèi)容立即下載開關
verify：True/False，默認為True，認證SSL證書開關
cert：本地SSL證書路徑

requests.get(url, params = None, **kwargs)

params：url中的額外參數(shù)，字典或字節(jié)流格式，可選
**kwargs：12個控制訪問的參數(shù)

requests.head(url, **kwargs)

**kwargs：13個控制訪問參數(shù)

requests.post(url, data = None, json = None, **kwargs)

data：字典、字節(jié)序列或文件，Request的內(nèi)容
json：JSON格式的數(shù)據(jù)，Request的內(nèi)容
**kwargs：11個控制訪問參數(shù)

requests.put(url, data = None, **kwargs)

data：字典、字節(jié)序列或文件，Request的內(nèi)容
**kwargs：12個控制訪問參數(shù)

requests.patch(url, data = None, **kwargs)

data：字典、字節(jié)序列或文件，Request的內(nèi)容
**kwargs：12個控制訪問參數(shù)

requests.delete(url, **kwargs)

**kwargs：13個控制訪問參數(shù)

由于網(wǎng)絡安全限制，最常用的是get()方法

單元2：網(wǎng)絡爬蟲的“盜亦有道”

網(wǎng)絡爬蟲引發(fā)的問題

網(wǎng)絡爬蟲尺寸：

爬取網(wǎng)頁，玩轉(zhuǎn)網(wǎng)頁：小規(guī)模，數(shù)據(jù)量小，爬取速度不敏感，Requests庫
爬取網(wǎng)站，爬取系列網(wǎng)站：中規(guī)模，數(shù)據(jù)規(guī)模較大，爬取速度敏感，Scrapy庫
爬取全網(wǎng)：大規(guī)模，搜索引擎，爬取速度關鍵，定制開發(fā)

網(wǎng)絡爬蟲的“騷擾”

受限于編寫水平和目的，網(wǎng)絡爬蟲會對Web服務器帶來巨大的資源開銷。

網(wǎng)絡爬蟲的法律風險

服務器上的數(shù)據(jù)有產(chǎn)權歸屬。
網(wǎng)絡爬蟲獲取數(shù)據(jù)后牟利將帶來法律風險。

網(wǎng)絡爬蟲泄露隱私

網(wǎng)絡爬蟲可能具備突破簡單訪問控制的能力，獲得被保護數(shù)據(jù)從而泄露個人隱私。

限制網(wǎng)絡爬蟲

來源審查：判斷User-Agent進行限制

只響應瀏覽器或友好爬蟲的訪問

發(fā)布公告：Robots協(xié)議

告知所有爬蟲網(wǎng)站的爬取策略，需要爬蟲遵守

Robots協(xié)議

Robots Exclusion Standard 網(wǎng)絡爬蟲排除標準
作用：告知網(wǎng)絡爬蟲哪些頁面可以抓取，哪些不行。
形式：在網(wǎng)站根目錄下的robots.txt文件
例如：京東的Robots協(xié)議：https://www.jd.com/robots.txt

User-agent: *
Disallow: /?*
Disallow: /pop/.html
Disallow: /pinpai/.html?*
User-agent: EtaoSpider
Disallow: /
User-agent: HuihuiSpider
Disallow: /
User-agent: GwdangSpider
Disallow: /
User-agent: WochachaSpider
Disallow: /

Robots協(xié)議基本語法

# 注釋，*代表所有，/代表根目錄
User-agent: *
Disallow: /

Robots協(xié)議的遵守方式

Robots協(xié)議的使用

網(wǎng)絡爬蟲：自動或人工識別robots.txt文件，再進行內(nèi)容爬取。
約束性：Robots協(xié)議是建議但非約束性，如果不遵守可能存在法律風險。
類人行為可以不參考Robots協(xié)議。

單元3：Requests網(wǎng)絡爬蟲實戰(zhàn)

實例1：京東商品頁面的爬取

爬取商品：https://item.jd.com/100002716279.html

>>> import requests
>>> r = requests.get("https://item.jd.com/100002716279.html")
>>> r.status_code
200
>>> r.encoding
'gbk'
>>> r.text[:1000]
'<!DOCTYPE HTML>\n<html lang="zh-CN">\n<head>\n    <!-- shouji -->\n    <meta http-equiv="Content-Type" content="text/html; charset=gbk" />\n    <title>【AppleiPad mini】Apple iPad mini 5 2019年新款平板電腦 7.9英寸（64G WLAN版/A12芯片/MUQW2CH/A）深空灰色【行情 報價 價格 評測】-京東</title>\n    <meta name="keywords" content="AppleiPad mini,AppleiPad mini,AppleiPad mini報價,AppleiPad mini報價"/>\n    <meta name="description" content="【AppleiPad mini】京東JD.COM提供AppleiPad mini正品行貨，并包括AppleiPad mini網(wǎng)購指南，以及AppleiPad mini圖片、iPad mini參數(shù)、iPad mini評論、iPad mini心得、iPad mini技巧等信息，網(wǎng)購AppleiPad mini上京東,放心又輕松" />\n    <meta name="format-detection" content="telephone=no">\n    <meta http-equiv="mobile-agent" content="format=xhtml; url=//item.m.jd.com/product/100002716279.html">\n    <meta http-equiv="mobile-agent" content="format=html5; url=//item.m.jd.com/product/100002716279.html">\n    <meta http-equiv="X-UA-Compatible" content="IE=Edge">\n    <link rel="canonical" />\n        <link rel="dns-prefetch" href="http://mi'

說明頁面返回信息，全部代碼為：

import requests
url = "https://item.jd.com/100002716279.html"
try:
    r = requests.get(url)
    r.raise_for_status()
    r.encoding = r.apparent_encoding
    print(r.text[:1000])
except:
    print("爬取失敗")

實例2：亞馬遜商品頁面的爬取

爬取商品：https://www.amazon.cn/dp/B005T63BEM/ref=lp_1559274071_1_1?s=electronics&ie=UTF8&qid=1569760137&sr=1-1

>>> import requests
>>> r = requests.get("https://www.amazon.cn/dp/B005T63BEM/ref=lp_1559274071_1_1?s=electronics&ie=UTF8&qid=1569760137&sr=1-1")
>>> r.status_code
200
>>> r.encoding
'UTF-8'
>>> r.text[:1000]
'\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n  \n  \n\n\n\n\n\n\n\n    \n\n    \n\n\n \n \n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n        \n            \n            \n        \n\n\n\n\n\n \n \n\n\n    \n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n    \n\n\n\n\n\n    <!doctype html><html class="a-no-js" data-19ax5a9jf="dingo">\n    <head><script>var aPageStart = (new Date()).getTime();</script><meta charset="utf-8">\n\n    \n\n\n\n    <link rel="dns-prefetch" >\n\n\n\n\n  \n\n\n\n\n    \n\n\n\n\n\n\n    \n    \n\n\n\n\n\n\n\n  \n  \n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n  \n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n  \n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n    \n    \n\n\n    \n    \n\n\n\n\n\n\n\n\n\n\n\n\n<script type="text/javascript">\nvar iUrl = "https://images-cn.ssl-images-amazon.com/images/I/415tpDfFbTL._SX300_QL70_.jpg";\n(function(){var i=new Image; i.src = iUrl;})();\n</script>\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n  \n\n   \n    \n\n\n\n\n\n\n\n<!--  -->\n<link rel="stylesheet" 
>>> r.request.headers
{'User-Agent': 'python-requests/2.22.0', 'Accept-Encoding': 'gzip, deflate', 'Accept': '*/*', 'Connection': 'keep-alive'}

User-Agent為'python-requests/2.22.0，如果被拒絕，可以更改頭部信息的User-Agent為其他瀏覽器。

>>> kv = {'User-Agent':'Mozilla/5.0'}
>>> url = 'https://www.amazon.cn/dp/B005T63BEM/ref=lp_1559274071_1_1?s=electronics&ie=UTF8&qid=1569760137&sr=1-1'
>>> r = requests.get(url, headers = kv)
>>> r.status_code
200
>>> r.request.headers
{'User-Agent': 'Mozilla/5.0', 'Accept-Encoding': 'gzip, deflate', 'Accept': '*/*', 'Connection': 'keep-alive'}
>>> r.text[:1000]
'\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n  \n  \n\n\n\n\n\n\n\n    \n\n    \n\n\n \n \n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n \n \n\n\n    \n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n    \n\n\n\n\n\n    <!doctype html><html class="a-no-js" data-19ax5a9jf="dingo">\n    <head>\n<script type="text/javascript">var ue_t0=ue_t0||+new Date();</script>\n<script type="text/javascript">\nwindow.ue_ihb = (window.ue_ihb || window.ueinit || 0) + 1;\nif (window.ue_ihb === 1) {\nvar ue_hob=+new Date();\nvar ue_id=\'E5W4931DMCTPC168HBBY\',\nue_csm = window,\nue_err_chan = \'jserr-rw\',\nue = {};\n(function(d){var e=d.ue=d.ue||{},f=Date.now||function(){return+new Date};e.d=function(b){return f()-(b?0:d.ue_t0)};e.stub=function(b,a){if(!b[a]){var c=[];b[a]=function(){c.push([c.slice.call(arguments),e.d(),d.ue_id])};b[a].replay=function(b){for(var a;a=c.shift();)b(a[0],a[1],a[2])};b[a].isStub=1}};e.exec=function(b,a){return function(){try{return b.apply(this,arguments)}catch(c){ueLogError(c,{attribution:a||"undefined",logLevel:"WARN"})}}}})(ue_csm);\n\nue.stub(ue,"log");ue.stub(ue,"onunload");ue.stub(ue,"onflu'

全部代碼

import requests
url = 'https://www.amazon.cn/dp/B005T63BEM/ref=lp_1559274071_1_1?s=electronics&ie=UTF8&qid=1569760137&sr=1-1'
try:
    kv = {'User-Agent':'Mozilla/5.0'}
    r = requsts.get(url, headers = kv)
    r.raise_for_status()
    r.encoding = r.apparent_encoding
    print(r.text[1000:2000])
except:
    print('爬取失敗')

實例3：百度/360搜索關鍵詞提交

自動提交關鍵詞并獲得結(jié)果。
搜索引起關鍵詞提交接口
百度關鍵詞接口：http://www.baidu.com/s?wd=keyword
360關鍵詞接口：http://www.so.com/s?q=keyword
使用params

>>> import requests
>>> kv = {'wd':'Python'}
>>> r = requests.get('http://www.baidu.com/s', params = kv)
>>> r.status_code
200
>>> r.request.url
'http://www.baidu.com/s?wd=Python'
>>> len(r.text)
358258

完整代碼

import requests
keyword = 'Python'
try:
    kv = {'wd':keyword}
    r = requests.get('http://www.baidu.com/s', params = kv)
    print(r.request.url)
    r.raise_for_status()
    print(len(r.text))
except:
    print('爬取失敗')

360的方法類似，代碼為

import requests
keyword = 'Python'
try:
    kv = {'q':keyword}
    r = requests.get('http://www.so.com/s', params = kv)
    print(r.request.url)
    r.raise_for_status()
    print(len(r.text))
except:
    print('爬取失敗')

實例4：網(wǎng)路圖片的爬取和存儲

網(wǎng)絡圖片鏈接的格式：
http://www.example.com/picture.jpg
圖片網(wǎng)址為：https://www.nationalgeographic.com/content/dam/expeditions/landing-pages/North-America/hero-national-parks2.adapt.1900.1.jpg
保存為C盤abc.jpg

>>> import requests
>>> path = 'C:/abc.jpg'
>>> url = 'https://www.nationalgeographic.com/content/dam/expeditions/landing-pages/North-America/hero-national-parks2.adapt.1900.1.jpg'
>>> r = requests.get(url)
>>> r.status_code
200
>>> with open(path, 'wb') as f:
    f.write(r.content)

    
300340
>>> f.close()

完整代碼為

import requests
import os
url = 'https://www.nationalgeographic.com/content/dam/expeditions/landing-pages/North-America/hero-national-parks2.adapt.1900.1.jpg'
root = "C://"
path = root + url.split('/')[-1] #保存圖片原有名字
try:
    if not os.path.exists(root):    #判斷根目錄是否存在，若不存在，建立根目錄
        os.mkdir(root)
    if not os.path.exists(path):    #判斷文件是否存在，若不存在，獲取文件
        r = requests.get(url)
        with open(path, 'wb') as f:
            f.write(r.content)
            f.close()
            print('文件保存成功')
    else:
        print('文件已存在')
except:
    print('爬取失敗')

考慮出現(xiàn)的問題，并對相應的異常進行處理。
修改代碼，可以獲取不同的資源。

實例5：IP地址歸屬地的自動查詢

通過ip138網(wǎng)站提交ip地址后，鏈接變?yōu)椋?br> http://m.ip138.com/ip.asp?ip=ipaddress

>>> f.close()
>>> import requests
>>> url = 'http://m.ip138.com/ip.asp?ip='
>>> r = requests.get(url + '202.204.80.112')
>>> r.status_code
200
>>> r.text[-500:]
'value="查詢" class="form-btn" />\r\n\t\t\t\t\t</form>\r\n\t\t\t\t</div>\r\n\t\t\t\t<div class="query-hd">ip138.com IP查詢(搜索IP地址的地理位置)</div>\r\n\t\t\t\t<h1 class="query">您查詢的IP：202.204.80.112</h1><p class="result">本站主數(shù)據(jù)：北京市海淀區(qū) 北京理工大學 教育網(wǎng)</p><p class="result">參考數(shù)據(jù)一：北京市 北京理工大學</p>\r\n\r\n\t\t\t</div>\r\n\t\t</div>\r\n\r\n\t\t<div class="footer">\r\n\t\t\t<a  rel="nofollow" target="_blank">滬ICP備10013467號-1</a>\r\n\t\t</div>\r\n\t</div>\r\n\r\n\t<script type="text/javascript" src="/script/common.js"></script></body>\r\n</html>\r\n'

完整代碼

import requests
url = 'http://m.ip138.com/ip.asp?ip='
try:
    r = requests.get(url + '202.204.80.112')
    r.raise_for_status()
    r.encoding = r.apparent_encoding
    print(r.text[-500:])
except:
    print('爬取失敗')

許多網(wǎng)站人機交互都是以鏈接的方式提交，知道提交的鏈接形式可以用Python模擬提交。

色偷偷精品伊人,欧洲久久精品,欧美综合婷婷骚逼,国产AV主播,国产最新探花在线,九色在线视频一区,伊人大交九欧美,1769亚洲,黄色成人av

【第1周】網(wǎng)絡爬蟲之規(guī)則

【第1周】網(wǎng)絡爬蟲之規(guī)則

單元1：Requests庫入門

Requests庫安裝

Requests庫的7個主要方法

Requests庫的get()方法

爬取網(wǎng)頁的通用代碼框架

HTTP協(xié)議及Requests庫方法

Requests庫主要方法解析

單元2：網(wǎng)絡爬蟲的“盜亦有道”

網(wǎng)絡爬蟲引發(fā)的問題

網(wǎng)絡爬蟲尺寸：

網(wǎng)絡爬蟲的“騷擾”

網(wǎng)絡爬蟲的法律風險

網(wǎng)絡爬蟲泄露隱私

限制網(wǎng)絡爬蟲

Robots協(xié)議

Robots協(xié)議的遵守方式

Robots協(xié)議的使用

單元3：Requests網(wǎng)絡爬蟲實戰(zhàn)

實例1：京東商品頁面的爬取

實例2：亞馬遜商品頁面的爬取

實例3：百度/360搜索關鍵詞提交

實例4：網(wǎng)路圖片的爬取和存儲

實例5：IP地址歸屬地的自動查詢

相關閱讀更多精彩內(nèi)容

友情鏈接更多精彩內(nèi)容

色偷偷精品伊人,欧洲久久精品,欧美综合婷婷骚逼,国产AV主播,国产最新探花在线,九色在线视频一区,伊人大交九 欧美,1769亚洲,黄色成人av

【第1周】網(wǎng)絡爬蟲之規(guī)則

單元1：Requests庫入門

Requests庫安裝

Requests庫的7個主要方法

Requests庫的get()方法

爬取網(wǎng)頁的通用代碼框架

HTTP協(xié)議及Requests庫方法

Requests庫主要方法解析

單元2：網(wǎng)絡爬蟲的“盜亦有道”

網(wǎng)絡爬蟲引發(fā)的問題

網(wǎng)絡爬蟲尺寸：

網(wǎng)絡爬蟲的“騷擾”

網(wǎng)絡爬蟲的法律風險

網(wǎng)絡爬蟲泄露隱私

限制網(wǎng)絡爬蟲

Robots協(xié)議

Robots協(xié)議的遵守方式

Robots協(xié)議的使用

單元3：Requests網(wǎng)絡爬蟲實戰(zhàn)

實例1：京東商品頁面的爬取

實例2：亞馬遜商品頁面的爬取

實例3：百度/360搜索關鍵詞提交

實例4：網(wǎng)路圖片的爬取和存儲

實例5：IP地址歸屬地的自動查詢

相關閱讀更多精彩內(nèi)容

友情鏈接更多精彩內(nèi)容

色偷偷精品伊人,欧洲久久精品,欧美综合婷婷骚逼,国产AV主播,国产最新探花在线,九色在线视频一区,伊人大交九欧美,1769亚洲,黄色成人av