大師兄的Python學(xué)習(xí)筆記(二十): 爬蟲(一)

大師兄的Python學(xué)習(xí)筆記(十九): Python與(XML和JSON)
師兄的Python學(xué)習(xí)筆記(二十一): 爬蟲(二)

一、關(guān)于爬蟲(Spider)

1. Python之外的學(xué)習(xí)前置條件

1) 網(wǎng)頁前端知識(shí)

  • html
  • css
  • javescript
  • ajax

2) 網(wǎng)絡(luò)通訊知識(shí)

  • url
  • http協(xié)議

3) 內(nèi)容處理

  • re
  • xpath
  • xml
2. 什么是爬蟲
  • 爬蟲是一種程序或腳本,可以模擬人的行為,按照一定規(guī)則,自動(dòng)在互聯(lián)網(wǎng)上收集網(wǎng)站的數(shù)據(jù)和信息。
3. 爬蟲分類
  • 通用爬蟲: 主要目的是將互聯(lián)網(wǎng)上的網(wǎng)頁下載到本地,形成一個(gè)互聯(lián)網(wǎng)內(nèi)容的鏡像備份。
  • 聚焦爬蟲: 在網(wǎng)頁抓取時(shí)會(huì)對(duì)內(nèi)容進(jìn)行處理篩選,盡量保證只抓取與需求相關(guān)的網(wǎng)頁信息。
4. 爬蟲的基本步驟

下載網(wǎng)頁 >> 提取信息 >> 保存數(shù)據(jù) >> 跳轉(zhuǎn)到其它網(wǎng)頁并執(zhí)行前面內(nèi)容

二、下載網(wǎng)頁

1.urllib庫
2. Requests庫
2.1 關(guān)于Requests庫
  • requests庫是簡(jiǎn)單易用的Http庫。
  • 繼承了urllib的所有特征, 底層使用urllib3實(shí)現(xiàn)。
  • 比urllib更方便,完全滿足 HTTP 測(cè)試需求。
2.2 常用方法

1) requests.get(url,params=None, **kwargs)

  • 以get方式請(qǐng)求url,并返回response對(duì)象。
  • 可以以字典的方式傳遞params、headers等。
>>>import requests

>>>url = "https://www.baidu.com/"
>>>res = requests.get(url)

>>>print(f'狀態(tài)碼:{res.status_code}')
>>>print(f'內(nèi)容:{res.text}')
狀態(tài)碼:200
內(nèi)容:<!DOCTYPE html><!--STATUS OK--><html>
<!--STATUS OK--><html> <head><meta http-equiv=content-type content=text/html;charset=utf-8><meta http-equiv=X-UA-Compatible content=IE=Edge><meta content=always name=referrer><link rel=stylesheet type=text/css href=https://ss1.bdstatic.com/5eN1bjq8AAUYm2zgoY3K/r/www/cache/bdorz/baidu.min.css><title>????o|??????????? ?°±??¥é??</title></head> <body link=#0000cc> <div id=wrapper> <div id=head> <div class=head_wrapper> <div class=s_form> <div class=s_form_wrapper> <div id=lg> <img hidefocus=true src=//www.baidu.com/img/bd_logo1.png width=270 height=129> </div> <form id=form name=f action=//www.baidu.com/s class=fm> <input type=hidden name=bdorz_come value=1> <input type=hidden name=ie value=utf-8> <input type=hidden name=f value=8> <input type=hidden name=rsv_bp value=1> <input type=hidden name=rsv_idx value=1> <input type=hidden name=tn value=baidu><span class="bg s_ipt_wr"><input id=kw name=wd class=s_ipt value maxlength=255 autocomplete=off autofocus=autofocus></span><span class="bg s_btn_wr"><input type=submit id=su value=????o|?????? class="bg s_btn" autofocus></span> </form> </div> </div> <div id=u1> <a href=http://news.baidu.com name=tj_trnews class=mnav>??°é??</a> <a href=https://www.hao123.com name=tj_trhao123 class=mnav>hao123</a> <a href=http://map.baidu.com name=tj_trmap class=mnav>??°???</a> <a href=http://v.baidu.com name=tj_trvideo class=mnav>è§?é¢?</a> <a href=http://tieba.baidu.com name=tj_trtieba class=mnav>è′′??§</a> <noscript> <a href=http://www.baidu.com/bdorz/login.gif?login&amp;tpl=mn&amp;u=http%3A%2F%2Fwww.baidu.com%2f%3fbdorz_come%3d1 name=tj_login class=lb>??????</a> </noscript> <script>document.write('<a + encodeURIComponent(window.location.href+ (window.location.search === "" ? "?" : "&")+ "bdorz_come=1")+ '" name="tj_login" class="lb">??????</a>');...

2) requests.post(url, data=None, json=None, **kwargs)

  • 以post方式請(qǐng)求url,并返回response對(duì)象。
>>>import requests

>>>url = "https://www.httpbin.org/post"
>>>data = {'name':'test'}
>>>res = requests.post(url,data=data)

>>>print(f'狀態(tài)碼:{res.status_code}')
>>>print(f'內(nèi)容:{res.text}')
狀態(tài)碼:200
內(nèi)容:{
 "args": {}, 
 "data": "", 
 "files": {}, 
 "form": {
   "name": "test"
 }, 
 "headers": {
   "Accept": "*/*", 
   "Accept-Encoding": "gzip, deflate", 
   "Content-Length": "9", 
   "Content-Type": "application/x-www-form-urlencoded", 
   "Host": "www.httpbin.org", 
   "User-Agent": "python-requests/2.22.0", 
   "X-Amzn-Trace-Id": "Root=1-5f056eb5-b711fdfcc85ef0a2d6e6b8fc"
 }, 
 "json": null, 
 "origin": "122.115.236.202", 
 "url": "https://www.httpbin.org/post"
}

3) Cookies

  • 可以通過response對(duì)象直接獲得cookies。
>>>import requests

>>>url = "https://www.baidu.com"
>>>res = requests.get(url)
>>>cookies = res.cookies
>>>for k,v in cookies.items():
>>>    print(f'{k}={v}')
BDORZ=27315
  • 可以將cookies添加到header中,用來保持登錄狀態(tài)。
>>>import requests

>>>url = "https://www.baidu.com"
>>>headers = {
>>>    'Cookie':'BIDUPSID=8DA5516860A041C8C2682A9F6CE8310A; PSTM=1488431109; HMACCOUNT=F82546873707B865; BAIDUID=2E7ED6E71EC32E4E738D8FBA51115B51:FG=1; H_WISE_SIDS=147417_146789_143879_148320_147087_141744_147887_148194_148209_147279_146536_148001_148823_147848_147762_147828_147639_148754_147897_148524_149194_127969_149061_147239_147350_142420_146653_147024_146732_138425_131423_144659_142209_147527_145597_126063_107311_147304_146339_148029_147212_143507_144966_145607_148071_139882_146786_148345_147547_146056_145395_148869_110085; MCITY=-%3A; H_PS_PSSID=; BDORZ=FFFB88E999055A3F8A630C64834BD6D0; BDSFRCVID=hW-OJeC62R6WKOTr3f6CbPEHwe5B58TTH6aoDIGUqq8sj7AJuNMnEG0PoM8g0Ku-S2-BogKK0mOTH6KF_2uxOjjg8UtVJeC6EG0Ptf8g0f5; H_BDCLCKID_SF=tJAq_Dt-tC83jb7G2tu_KPk8hx6054CX2C8sVUP2BhcqEIL40ljIDUkVKGb-blcaLgT7Lh5VylRPqxbSj4Qo-RFkjU6z0b5C22on5MK-Qh5nhMJSb67JDMPF-GoKhlby523ion6vQpP-OpQ3DRoWXPIqbN7P-p5Z5mAqKl0MLPbtbb0xXj_0Djb-Datttjn2aIOt0Tr25RrjeJrmq4bohjP3jaO9BtQO-DOxoM7xynrKhpcOy45mQfkWbtRi-qKeQgnk2p523-Tao-Ooj4-WQlKNWGo30x-jLTny3l3ebxAVDPP9QtnJyUnQbPnnBT5i3H8HL4nv2JcJbM5m3x6qLTKkQN3T-PKO5bRu_CFhJKI2MIKCePbShnLOqlO-2tJ-ajPX3b7EfMnnsl7_bJ7KhUbyBn3v2JDe5jbt3fJNWP3qOpvC36bxQhFTQqOxhRov-20q3KnIQqrAMJnHQT3m5-4_QUOtyKryMnb4Wb3cWKJJ8UbSjxRPBTD02-nBat-OQ6npaJ5nJq5nhMJmb67JDMr0eGLeqT_JtJ-s06rtKRTffjrnhPF32J8PXP6-3bbu2GnIoP3K-DOtMRQP0Mo15PLU-J5eLp37JD6y--Ox-hcBEn0lhjOk5fCIh4oxJpOdMnbMopvaHx8KKqovbURvD-ug3-AqBM5dtjTO2bc_5KnlfMQ_bf--QfbQ0hOhqP-j5JIEoC8ytC-KbKvNq45HMt00qxby26n3Q-j9aJ5nJDoCMx3oXpOKDP5BLU5T0xvf5RTG_CnmQpP-HJ7eyfJlKJobKUCf3xTtBejXKl0MLn7Ybb0xyn_V0TjDLxnMBMPjamOnaIQc3fAKftnOM46JehL3346-35543bRTLnLy5KJtMDcnK4-XjjbXDHQP; delPer=0; PSINO=1; HMVT=6bcd52f51e9b3dce32bec4a3997715ac|1594190119|; BDRCVFR[tFA6N9pQGI3]=mk3SLVN4HKm',
>>>    'Host':'#',
>>>    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.108 Safari/537.36'
>>>}
>>>res = requests.get(url,headers=headers)
>>>print(res.cookies)
<RequestsCookieJar[<Cookie H_PS_PSSID=32190_1431_31326_32140_31253_32046_32231_32260 for .baidu.com/>, <Cookie BDSVRTM=14 for #/>, <Cookie BD_HOME=1 for #/>]>

4) Session

  • 用于維持對(duì)話,與服務(wù)器session不是同一個(gè)概念。
>>>import requests

>>>url = "https://httpbin.org"
>>>s = requests.Session()
>>>s.get(url+'/cookies/set/name/test')
>>>r = s.get(url+'/cookies') 
>>>print(r.text) # 獲得了上一次會(huì)話設(shè)置的cookies
{
 "cookies": {
   "name": "test"
 }
}

5) SSL證書驗(yàn)證

  • 控制是否驗(yàn)證證書。
>>>import requests

>>>url = "https://www.baidu.com"
>>>res = requests.get(url,verify=False) # 停止驗(yàn)證證書
D:\Anaconda3\lib\site-packages\urllib3\connectionpool.py:1004: InsecureRequestWarning: Unverified HTTPS request is being made to host 'www.baidu.com'. Adding certificate verification is strongly advised. See: https://urllib3.readthedocs.io/en/latest/advanced-usage.html#ssl-warnings
 InsecureRequestWarning,
  • 指定本地證書。
>>>import requests

>>>url = "https://www.baidu.com"
>>>res = requests.get(url,cert=('server.crt','key')) # 停止驗(yàn)證證書

6) 設(shè)置代理

  • 通過參數(shù)proxies設(shè)置代理。
>>>import requests

>>>url = "https://www.baidu.com"
>>>proxies = {
>>>    "http":"http://host:port", # 你的代理服務(wù)器地址
>>>    "https":"https://host:port"}
>>>res = requests.get(url,proxies=proxies)
  • 支持SOCKS協(xié)議代理。
>>>import requests

>>>url = "https://www.baidu.com"
>>>proxies = {
>>>    "http":"socks5://0.0.0.0:10005", # 你的代理服務(wù)器地址
>>>    "https":"socks5://0.0.0.0:10006"
>>>}
>>>res = requests.get(url,proxies=proxies)

7) 設(shè)置超時(shí)

  • 使用timeout參數(shù)設(shè)定超時(shí)時(shí)間。
  • 如果需要分別設(shè)置連接和讀取時(shí)間,則將參數(shù)設(shè)置為元組。
>>>import requests

>>>url = "https://www.bbaidu.com"
>>>try:
>>>    res = requests.get(url,timeout = (3,5))
>>>except Exception as e:
>>>    print(e)
HTTPSConnectionPool(host='www.bbaidu.com', port=443): Max retries exceeded with url: / (Caused by ConnectTimeoutError(<urllib3.connection.VerifiedHTTPSConnection object at 0x000001D9C0F72248>, 'Connection to www.bbaidu.com timed out. (connect timeout=3)'))

8) 設(shè)置身份驗(yàn)證

  • 使用auth參數(shù)設(shè)置身份驗(yàn)證。
  • auth參數(shù)為一個(gè)包含用戶名和密碼的元祖。
  • 底層使用HTTPBasicAuth類驗(yàn)證。
>>>import requests

>>>url = "https://www.baidu.com"
>>>username = 'youruser'
>>>password = 'yourpassword'

>>>res = requests.get(url,auth=(username,password))

9) 使用Prepared Request

  • Prepared Request就是將參數(shù)封裝為獨(dú)立的Request對(duì)象。
>>>from requests import Request,Session

>>>url1 = "https://www.baidu.com"
>>>url2 = "https://httpbin.org"
>>>data = {
>>>    'name':'test'
>>>}
>>>headers = {
>>>    'User-Agent':'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like >>>>Gecko) Chrome/78.0.3904.108 Safari/537.36'
>>>}

>>>session = Session()
>>>req1 = Request('GET',url1,headers=headers)
>>>req2 = Request('GET',url2,data=data,headers=headers)
>>>prepared1 = session.prepare_request(req1)
>>>prepared2 = session.prepare_request(req2)
>>>res1 = session.send(prepared1)
>>>res2 = session.send(prepared2)
>>>print(f"res1:{res1.url}")
>>>print(f"res2:{res2.url}")
res1:https://www.baidu.com/
res2:https://httpbin.org/

三、提取信息

1. 使用正則表達(dá)式

1) 抓取首頁

>>>import requests

>>>def get_page(url):
>>>    # 獲得頁面內(nèi)容
>>>    headers = {
>>>        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like >>>>Gecko) Chrome/78.0.3904.108 Safari/537.36'
>>>    }
>>>    res = requests.get(url=url,headers=headers)
>>>    if res.status_code == 200:
>>>        return res.text
>>>    else:
>>>        return None

>>>def main():
>>>    # 入口
>>>    url = 'https://movie.douban.com/top250'
>>>    page_data = get_page(url)
>>>    if page_data:
>>>        print(len(page_data)) # 內(nèi)容太長(zhǎng),只打印長(zhǎng)度

>>>if __name__ == '__main__':
>>>    main()
63460

2) 觀察html源代碼編寫正則規(guī)則

  • 排行部分html代碼:
<ol class="grid_view">
       <li>
           <div class="item">
               <div class="pic">
                   <em class="">1</em>
                   <a >
                       <img width="100" alt="肖申克的救贖" >src="https://img3.doubanio.com/view/photo/s_ratio_poster/public/p480747492.jpg" class="">
                   </a>
               </div>
               <div class="info">
                   <div class="hd">
                       <a  class="">
                           <span class="title">肖申克的救贖</span>
                                   <span class="title">&nbsp;/&nbsp;The Shawshank >Redemption</span>
                               <span class="other">&nbsp;/&nbsp;月黑高飛(港)  /  刺激1995(臺(tái))</span>
                       </a>


                           <span class="playable">[可播放]</span>
                   </div>
                   <div class="bd">
                       <p class="">
                           導(dǎo)演: 弗蘭克·德拉邦特 Frank Darabont&nbsp;&nbsp;&nbsp;主演: 蒂姆·羅賓斯 Tim Robbins /...<br>
                           1994&nbsp;/&nbsp;美國(guó)&nbsp;/&nbsp;犯罪 劇情
                       </p>

                       
                       <div class="star">
                               <span class="rating5-t"></span>
                               <span class="rating_num" property="v:average">9.7</span>
                               <span property="v:best" content="10.0"></span>
                               <span>2072827人評(píng)價(jià)</span>
                       </div>

                           <p class="quote">
                               <span class="inq">希望讓人自由。</span>
                           </p>
                   </div>
               </div>
           </div>
       </li>
... ... 
  • 根據(jù)html代碼寫出正則表達(dá)式:
>>> pattern = re.compile(
>>>            '<em class="">(.*?)</em>\n.*?<a href="(.*?)">[\s\S]*?class="title">(.*?)</span>[\s\S]*?導(dǎo)演: (.*?)&nbsp.*?主演: (.*?)<br>[\s\S]*?<span>(.*?人評(píng)價(jià))</span?'
>>>        )

3) 用正則表達(dá)式抓取網(wǎng)頁內(nèi)容

>>>import requests,re

>>>def sort_data(func):
>>>    def deco(*args,**kargs):
>>>        # 處理內(nèi)容
>>>        data = func(*args,**kargs)
>>>        pattern = re.compile(
>>>            '<em class="">(.*?)</em>\n.*?<a href="(.*?)">[\s\S]*?class="title">(.*?)</span>[\s\S]*?導(dǎo)演: (.*?)&nbsp.*?主演: (.*?)<br>[\s\S]*?<span>(.*?人評(píng)價(jià))</span?'
>>>        )
>>>        items = re.findall(pattern,data)
>>>        for item in items:
>>>            yield {
>>>                'index':item[0],
>>>                'link':item[1],
>>>                'name':item[2],
>>>                'director':item[3],
>>>                'actors':item[4],
>>>                'post':item[5]
>>>            }
>>>        return items
>>>    return deco

>>>@sort_data
>>>def get_page(url):
>>>    # 獲得頁面內(nèi)容
>>>    headers = {
>>>        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like >>>>Gecko) Chrome/78.0.3904.108 Safari/537.36'
>>>    }
>>>    res = requests.get(url=url,headers=headers)
>>>    if res.status_code == 200:
>>>        return res.text
>>>    else:
>>>        return None

>>>def show_result(data):
>>>    # 打印結(jié)果
>>>    for i in range(10):
>>>        print(next(data))

>>>def main():
>>>    # 入口
>>>    url = 'https://movie.douban.com/top250'
>>>    page_data = get_page(url)
>>>    show_result(page_data)

>>>if __name__ == '__main__':
>>>    main()
{'index': '1', 'link': 'https://movie.douban.com/subject/1292052/', 'name': '肖申克的救贖', 'director': '弗蘭克·德拉邦特 Frank Darabont', 'actors': '蒂姆·羅賓斯 Tim Robbins /...', 'post': '2072827人評(píng)價(jià)'}
{'index': '2', 'link': 'https://movie.douban.com/subject/1291546/', 'name': '霸王別姬', 'director': '陳凱歌 Kaige Chen', 'actors': '張國(guó)榮 Leslie Cheung / 張豐毅 Fengyi Zha...', 'post': '1536626人評(píng)價(jià)'}
{'index': '3', 'link': 'https://movie.douban.com/subject/1292720/', 'name': '阿甘正傳', 'director': '羅伯特·澤米吉斯 Robert Zemeckis', 'actors': '湯姆·漢克斯 Tom Hanks / ...', 'post': '1566647人評(píng)價(jià)'}
{'index': '4', 'link': 'https://movie.douban.com/subject/1295644/', 'name': '這個(gè)殺手不太冷', 'director': '呂克·貝松 Luc Besson', 'actors': '讓·雷諾 Jean Reno / 娜塔莉·波特曼 ...', 'post': '1757433人評(píng)價(jià)'}
{'index': '5', 'link': 'https://movie.douban.com/subject/1292063/', 'name': '美麗人生', 'director': '羅伯托·貝尼尼 Roberto Benigni', 'actors': '羅伯托·貝尼尼 Roberto Beni...', 'post': '982086人評(píng)價(jià)'}
{'index': '6', 'link': 'https://movie.douban.com/subject/1292722/', 'name': '泰坦尼克號(hào)', 'director': '詹姆斯·卡梅隆 James Cameron', 'actors': '萊昂納多·迪卡普里奧 Leonardo...', 'post': '1519400人評(píng)價(jià)'}
{'index': '7', 'link': 'https://movie.douban.com/subject/1291561/', 'name': '千與千尋', 'director': '宮崎駿 Hayao Miyazaki', 'actors': '柊瑠美 Rumi H?ragi / 入野自由 Miy...', 'post': '1627730人評(píng)價(jià)'}
{'index': '8', 'link': 'https://movie.douban.com/subject/1295124/', 'name': '辛德勒的名單', 'director': '史蒂文·斯皮爾伯格 Steven Spielberg', 'actors': '連姆·尼森 Liam Neeson...', 'post': '798211人評(píng)價(jià)'}
{'index': '9', 'link': 'https://movie.douban.com/subject/3541415/', 'name': '盜夢(mèng)空間', 'director': '克里斯托弗·諾蘭 Christopher Nolan', 'actors': '萊昂納多·迪卡普里奧 Le...', 'post': '1496572人評(píng)價(jià)'}
{'index': '10', 'link': 'https://movie.douban.com/subject/3011091/', 'name': '忠犬八公的故事', 'director': '萊塞·霍爾斯道姆 Lasse Hallstr?m', 'actors': '理查·基爾 Richard Ger...', 'post': '1041023人評(píng)價(jià)'}

參考資料



本文作者:大師兄(superkmi)

最后編輯于
?著作權(quán)歸作者所有,轉(zhuǎn)載或內(nèi)容合作請(qǐng)聯(lián)系作者
【社區(qū)內(nèi)容提示】社區(qū)部分內(nèi)容疑似由AI輔助生成,瀏覽時(shí)請(qǐng)結(jié)合常識(shí)與多方信息審慎甄別。
平臺(tái)聲明:文章內(nèi)容(如有圖片或視頻亦包括在內(nèi))由作者上傳并發(fā)布,文章內(nèi)容僅代表作者本人觀點(diǎn),簡(jiǎn)書系信息發(fā)布平臺(tái),僅提供信息存儲(chǔ)服務(wù)。

友情鏈接更多精彩內(nèi)容