爬蟲：網(wǎng)絡(luò)爬蟲機(jī)器人，從互聯(lián)網(wǎng)自動(dòng)抓取數(shù)據(jù)的程序

理論上：通過瀏覽器看到的數(shù)據(jù)，我們一般都是可以獲取到的

爬蟲的作用：

搜索引擎
2.商品比價(jià)（慧慧購物助手）
3.知乎的數(shù)據(jù)分析平臺(tái)（知乎專欄，數(shù)據(jù)冰山）

網(wǎng)頁的三大特征：

每一個(gè)網(wǎng)頁都有一個(gè)唯一的url（統(tǒng)一資源定位符），來進(jìn)行定位
網(wǎng)頁都是通過HTMl（超文本）文本展示的
所有的網(wǎng)頁都是通過http<超文本傳輸協(xié)議>（HTTPS）協(xié)議來傳輸?shù)?/li>

爬蟲的流程

分析網(wǎng)站，得到目標(biāo)url
根據(jù)url，發(fā)起請求，獲取頁面的HTML的源碼
從頁面源碼中提取數(shù)據(jù)
a. 提取到目標(biāo)歐數(shù)據(jù)，做數(shù)據(jù)的賽選和持久化存儲(chǔ)
b. 從頁面中提取到新的url地址，繼續(xù)執(zhí)行第二部操作
爬蟲結(jié)束：所有的目標(biāo)url都提取完畢，并且得到了數(shù)據(jù)，再也沒有其他請求任務(wù)了，這是意味著爬蟲結(jié)束

通用爬蟲抓取網(wǎng)頁的流程：

選取一部分的url作為種子url，將這些url放入到帶爬取得任務(wù)隊(duì)列里
從帶爬取得任務(wù)隊(duì)列中取出url，發(fā)起請求，將獲取到的頁面源碼存儲(chǔ)到本地，并將已經(jīng)爬取過的url，放入已爬取隊(duì)列中
從已爬取url的響應(yīng)結(jié)果中，分析提取其他的url地址，繼續(xù)添加到帶爬取隊(duì)列中，之后就是不斷的循環(huán)，查到所有的url都提取完畢

通用爬蟲的缺點(diǎn)：

1）必須遵守roobot協(xié)議：就是一個(gè)規(guī)范，告訴搜索引擎，哪些目錄下的資源允許爬蟲，哪些目錄下的資源不允許爬取
‘User-agent’：該項(xiàng)值用來表示是哪家的搜索引擎
‘a(chǎn)llow’：允許被爬取的url
‘disallow’：不允許被爬取的url
2）搜索引擎返回的都是網(wǎng)頁，并且返回90%的都是無用的數(shù)據(jù)
3）不能夠跟劇不同的用戶需求或者檢索結(jié)果返回不同的結(jié)果
4）通過爬蟲對于對媒體的文件不能夠獲取

OSI七層協(xié)議

從上往下：
應(yīng)用層：為用戶的應(yīng)用程序提供網(wǎng)絡(luò)服務(wù)的（http，https，ftp。。。。。）
表示層:負(fù)責(zé)端到端的數(shù)據(jù)信息可以被另一個(gè)主機(jī)所理解和識(shí)別，并且按照一定的格式將信息傳遞到會(huì)話層
會(huì)話層：管理主機(jī)之間的會(huì)話進(jìn)程，負(fù)責(zé)建立，管理，和終止會(huì)話進(jìn)程
傳輸層：進(jìn)行數(shù)據(jù)傳輸（TCP UDP）
網(wǎng)絡(luò)層：路由器
數(shù)據(jù)鏈路層：網(wǎng)橋交換機(jī)
物理層：網(wǎng)線網(wǎng)卡集線器中繼器

常見的請求狀態(tài)碼

200 ：請求成功

301 ：永久重定向
302 ：臨時(shí)重定向

401 ：未授權(quán)
403 ：服務(wù)器拒絕訪問
404 ：頁面丟失
405 ：請求方式不對
408 ：請求超時(shí)

500 ：服務(wù)器錯(cuò)誤
503 ：服務(wù)器不可用

發(fā)起請求：

會(huì)攜帶請求頭：
’USer-Agent‘：模擬瀏覽器請求
‘Cookies’：存儲(chǔ)在瀏覽器里，使用cookie表明身份
‘Refere’：說明當(dāng)前請求是從哪個(gè)頁面發(fā)起

使用urllib發(fā)起請求

#目標(biāo)url
url = 'http://www.baidu.com/'

# request.urlopen():使用urlopen方法模擬瀏覽器發(fā)起請求
"""
url, 請求的目標(biāo)url地址
data=None,默認(rèn)情況為None,表示發(fā)起的是一個(gè)get請求,不為None,則發(fā)起的是一個(gè)post請求
timeout=,設(shè)置請求的超時(shí)時(shí)間　
cafile=None, 設(shè)置證書
capath=None, 設(shè)置證書路徑
cadefault=False, 是否要使用默認(rèn)證書（默認(rèn)為False）
context=None:是一個(gè)ssl值,表示忽略ssl認(rèn)證
"""

#是一個(gè)ssl值,表示忽略ssl認(rèn)證(如果請求出現(xiàn)了ssl證書認(rèn)證錯(cuò)誤,
# 我們就需要設(shè)置ssl._create_unverified_context(),忽略證書認(rèn)證)
content = ssl._create_unverified_context()
response = request.urlopen(url,timeout=10,content=content)
#從response響應(yīng)結(jié)果中獲取參數(shù)
#狀態(tài)碼
code = response.status
print(code)
#獲取頁面源碼的二進(jìn)制數(shù)據(jù)
b_html = response.read()
print(type(b_html),len(b_html))
#獲取響應(yīng)的響應(yīng)頭部(Response Headers)
res_headers = response.getheaders()
print(res_headers)
#獲取響應(yīng)頭中指定參數(shù)的值
cookie_data = response.getheader('Set-Cookie')
print(cookie_data)
#reason返回一個(gè)響應(yīng)結(jié)果的原因
reason = response.reason
print(reason)

#將獲取到的二進(jìn)制數(shù)據(jù),轉(zhuǎn)換為字符串decode
str_html = b_html.decode('utf-8')
print(type(str_html))

with open('b_baidu.page.html','w') as file:
    # file.write(b_html)
    file.write(str_html)


#如果請求要攜帶請求頭

#需要先構(gòu)建一個(gè)request對象
"""
url:發(fā)起請求的url地址
data=None, 默認(rèn)情況為None,表示發(fā)起的是一個(gè)get請求,不為None,則發(fā)起的是一個(gè)post請求
headers={},設(shè)置請求頭（headers對應(yīng)的數(shù)據(jù)類型是一個(gè)字典）
origin_req_host=None, (指定發(fā)起請求的域)
unverifiable=False,忽略SSL認(rèn)證
method=None：指定發(fā)起請求的方式
"""
req_header = {
    'User-Agent':'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.99 Safari/537.36'
}
req = request.Request(url,headers=req_header)

#根據(jù)構(gòu)建的req請求對象發(fā)起請求
response = request.urlopen(req)
response.status

正則

. :表示匹配除了換行符之外的任意字符
\ :轉(zhuǎn)義字符
[a-z] : 匹配a-z里面的任意一個(gè)字符

\d: 匹配數(shù)字 -> [0-9]
\D:　匹配非數(shù)字　[^\d]
\s: 匹配空白字符（空格,\n,\t...）
\S: 匹配非空白字符　
\w: 匹配單詞字符　[A-Za-z0-9_]
\W:　匹配非單子字符

^:匹配以．．．開頭
$:匹配以....結(jié)尾

():分組
|:或

多字符匹配
＊：匹配＊前面的字符任意次數(shù)
+ :　匹配＋號(hào)前面的字符至少１次
？: 匹配？前面的字符０次或1次
{m}:匹配{m}前面的字符m次
{m,n}:匹配{m,n}前面的字符m~n次

非貪婪匹配
*?
+?
??
{m,n}?

列子

import re

#把正則表達(dá)式構(gòu)建為一個(gè)pattern對象
sub_str = 'abcdefabcd'
pattern = re.compile('b')
#從字符串的起始位置開始匹配,開頭就必須符合正則規(guī)則,
# 如果匹配到結(jié)果了返回結(jié)果,如果匹配不到返回None,單次匹配
result = re.match(pattern,sub_str)
print(type(result))
if result:
    print(result.group())

#在整個(gè)字符串中進(jìn)行匹配,同樣是單次匹配，匹配到結(jié)果立即返回
#匹配不到則返回None
result = re.search(pattern,sub_str)
print(result.group())

# 再整個(gè)字符串中進(jìn)行匹配，匹配出所有符合正則規(guī)則的結(jié)果，
# 以列表的形式返回
result = re.findall(pattern,sub_str)
print(result)

#再整個(gè)字符串中進(jìn)行匹配，匹配出所有符合正則規(guī)則的結(jié)果，
#但是返回的是一個(gè)迭代器
result = re.finditer(pattern,sub_str)
# <class 'callable_iterator'>
print(type(result))
for note in result:
    #<class '_sre.SRE_Match'>
    print(type(note))
    print(note.group())
#替換re.sub()
url = 'http://www.baidu.com/s?kw=aaa&pn=20'
# pattern, \正則規(guī)則
# repl, \要替換的字符串
# string,原始字符串
pattern = re.compile('pn=\d+')
result = re.sub(pattern,'pn=30',url)
print(result)

#分割re.split()
pattern = re.compile('[=:&]')
#pattern, string
result = re.split(pattern,url)
print(result)

sub_html = """
<div class="threadlist_title pull_left j_th_tit ">
    
    
    <a rel="noreferrer" href="/p/5982749825" title="來聊" target="_blank" class="j_th_tit ">來聊</a>
</div>
"""
#re.S讓點(diǎn)可以匹配包括換行符的任意字符
pattern = re.compile(
    '<div.*?class="threadlist_title pull_left j_th_tit ">'+
    '.*?<a.*?href="(.*?)".*?</div>',re.S
)

result = re.findall(pattern,sub_html)
print(result)

cookies的使用

from urllib import request
# 目標(biāo)url：
# https://www.douban.com/people/175417123/

url = 'https://www.douban.com/people/175417123/'

# 設(shè)置請求頭
req_header = {
    'User-Agent':'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.99 Safari/537.36',
    'Cookie': 'bid=a5HEJxBlOLY; douban-fav-remind=1; ll="108288"; _vwo_uuid_v2=D4EC41D965B5FF84E814BF48AE34386A5|f6047e628a6acc98722e3100dfbc399c; __yadk_uid=KpfeK1cgib5IWMKWvG66MnJMbonYvDZa; _ga=GA1.2.36315712.1531837787; douban-profile-remind=1; push_doumail_num=0; push_noty_num=0; __utmv=30149280.17541; _gid=GA1.2.2070226630.1545227516; ps=y; _pk_ref.100001.8cb4=%5B%22%22%2C%22%22%2C1545292749%2C%22https%3A%2F%2Fwww.baidu.com%2Flink%3Furl%3DqpgHwc2FYGfOrGzt4yK3ZwwbraVm_oED80whnivpFaC3kA5IvGnUfQ9FSRZBOEVh%26wd%3D%26eqid%3D85ed620e000397aa000000065c1b4bc8%22%5D; _pk_ses.100001.8cb4=*; __utma=30149280.36315712.1531837787.1545227511.1545292752.48; __utmc=30149280; __utmz=30149280.1545292752.48.37.utmcsr=baidu|utmccn=(organic)|utmcmd=organic; __utmt=1; dbcl2="175417123:yY3DyEelGJE"; ck=FxtM; ap_v=0,6.0; _pk_id.100001.8cb4=7521abce5497fc2c.1531837784.39.1545292778.1545228853.; __utmb=30149280.7.10.1545292752',
}

req = request.Request(url=url,headers=req_header)

response = request.urlopen(req)

if response.status == 200:
    with open('douban.html','w') as file:
        file.write(response.read().decode('utf-8'))

cookiejar的使用

#使用cookiejar的目的：管理cookie,保存cookie值,
#一旦存儲(chǔ)cookie之后,下一次發(fā)起請求的時(shí)候就會(huì)攜帶cookie
#cookie是保存在內(nèi)存里面的,最后會(huì)進(jìn)行垃圾回收

from urllib import request,parse
from http.cookiejar import CookieJar

#創(chuàng)建cookiejar對象,目的如上
cookie_jar = CookieJar()

#HTTPCookieProcessor創(chuàng)建handle處理器,管理cookiejar
handler = request.HTTPCookieProcessor(cookie_jar)

#自定義opener
opener = request.build_opener(handler)

#分析發(fā)現(xiàn)
# https://www.douban.com/accounts/login
# 沒有驗(yàn)證碼的情況
# source: index_nav
# form_email: 18518753265
# form_password: ljh12345678

#有驗(yàn)證碼的情況
# source: index_nav
# form_email: 18518753265
# form_password: ljh12345678
# captcha-solution: blade
# captcha-id: 5IBtw5wm2riyrIrnV3utwUPt:en

url = 'https://www.douban.com/accounts/login'

form_data = {
    'source': 'index_nav',
    'form_email': '18518753265',
    'form_password': 'ljh12345678',
    'captcha-solution': 'noise',
    'captcha-id': 'waNQIJD6TkMaF4M51PFg5kYh:en'
}

form_data = parse.urlencode(form_data).encode('utf-8')

#設(shè)置請求頭
req_header = {
    'User-Agent':'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.99 Safari/537.36',
}

#夠建一個(gè)request對象
req = request.Request(url,headers=req_header,data=form_data)

#發(fā)起請求
response = opener.open(req)

#登錄成功后訪問個(gè)人主頁,能夠成功獲取到個(gè)人主頁信息,說明確實(shí)保存了cookie
#并且在一下次發(fā)起請求的時(shí)候攜帶了cookie
url = 'https://www.douban.com/people/175417123/'

req = request.Request(url,headers=req_header)

response = opener.open(req)

if response.status == 200:
    with open('douban.html','w') as file:
        file.write(response.read().decode('utf-8'))

urllib.error的使用

#　urllib.error:在發(fā)起請求的過程中,可能會(huì)因?yàn)楦鞣N情況
# 導(dǎo)致請求出現(xiàn)異常,因而導(dǎo)致代碼崩潰,所以我們懸疑處理這些異常的請求

from urllib import error,request

# error.URLError

def check_urlerror():
    """
    1.沒有網(wǎng)絡(luò)
    2.服務(wù)器連接失敗
    3.找不到指定服務(wù)器
    :return:
    """
    url = 'http://www.baidu.com/'
    try:
        response = request.urlopen(url,timeout=0.01)
        print(response.status)
    except error.URLError as err:
        #[Errno -2] Name or service not known(未知服務(wù)器)
        #timed out:請求超時(shí)
        #[Errno -3] Temporary failure in name resolution(沒網(wǎng))
        print(err.reason)

# check_urlerror()

# error.HTTPError是URLError的子類

def check_httperror():
    url = 'https://www.qidian.com/all/nsacnscn.htm'
    try:
        response = request.urlopen(url)
        print(response.status)
    except error.HTTPError as err:
        #HTTPError的三個(gè)屬性
        #狀態(tài)碼
        print(err.code)
        #返回錯(cuò)誤的原因
        print(err.reason)
        #返回響應(yīng)頭
        print(err.headers)
    except error.URLError as err:
        print(err.reason)

check_httperror()

urllib的parse模塊的使用

#urllib的parse模塊主要是實(shí)現(xiàn)url的解析,合并,編碼,解碼

from urllib import  parse

#parse.urlparse實(shí)現(xiàn)了url的識(shí)別和分段
url = 'https://www.1712B.com/daxuesheng?name=zhangsan#123'
"""
url,：要解析和才分的url
scheme='':設(shè)置協(xié)議,只有在url沒有協(xié)議的情況下才會(huì)生效
allow_fragments=True:是否忽略錨點(diǎn),默認(rèn)為True表示不忽略
"""
result = parse.urlparse(url)
"""
(scheme='https'（協(xié)議）, netloc='www.1712B.com'（域）, 
path='/daxuesheng'（路徑）, params=''（可選參數(shù)）, 
query='name=zhangsan'（查詢參數(shù)）, fragment='123'（錨點(diǎn)）)
"""
print(result)
#取出才分后的某一個(gè)參數(shù)
print(result.scheme)

#parse.urlunparse可以實(shí)現(xiàn)url的組合
data = [sub_str for sub_str in result]
print('-----',data)
full_url = parse.urlunparse(data)
print(full_url)

#parse.uurlrljoin需要傳遞一個(gè)基類url,根據(jù)基類將某一個(gè)不完整的url拼接完整
sub_url = '/p/123456'
base_url = 'https://www.1712B.com/daxuesheng?name=zhangsan#123'
full_url = parse.urljoin(base_url,sub_url)
print('urljoin',full_url)

#parse.urlencode將字典類型的參數(shù),序列化為url的編碼格式的字符串
parmars = {
    'name':'張三',
    'class':'1712B',
}
result = parse.urlencode(parmars)
print('urlencode',result)

#parse.parse_qs反序列化,將url編碼格式的字符串,轉(zhuǎn)為字典類型
result = parse.parse_qs(result)
print('parse_qs',result)

#parse.quote可以將中文字符,轉(zhuǎn)為url編碼格式
kw = '摸摸摸'
result = parse.quote(kw)
print('quote',result)

#將url編碼進(jìn)行解碼
result = parse.unquote(result)
print('unquote',result)


# 最最常用的urljoin,urlencode兩個(gè)方法

urllib_proxy的使用

# urllib下使用代理
#　http/https代理
# 一定是一個(gè)高匿代理理
# 隱藏真實(shí)ip

from urllib import request

#自定義ProxyHandler的目的是為了設(shè)置代理,使用代理發(fā)起請求
#proxies:對應(yīng)的是一個(gè)字典
# 代理有免費(fèi)代理（西刺,快代理.....）
# 和收費(fèi)代理 (西刺,快代理.....,阿布云．．．．)
# proxies = {
#     'http':'118.187.58.34:53281',
#     'https':'124.235.180.121:80',
# }

#獨(dú)享代理,需要賬號(hào)密碼做驗(yàn)證的
proxies = {
    'http':'http://2295808193:6can7hyh@106.12.23.200:16818',
    'https':'https://2295808193:6can7hyh@106.12.23.200:16818'
}
handler = request.ProxyHandler(proxies=proxies)

#自定義opener
opener = request.build_opener(handler)

#url地址
#https://httpbin.org/get
url = 'http://httpbin.org/get'

response = opener.open(url)

print(response.status)
print(response.read().decode('utf-8'))

requests模塊

#pip3 install requests
#requests模塊:是對urllib的封裝,可以實(shí)現(xiàn)urllib的所有功能
#并且api調(diào)用更加簡單方便

import requests

# url = 'http://www.baidu.com/'
url = 'http://www.sina.com'
# url, :要請求的目標(biāo)url
# params:get請求后面要拼接的參數(shù)
"""
:param method: 要發(fā)起的是什么類型的請求.
:param url: 要請求的目標(biāo)url
:param params: get請求后面要拼接的參數(shù)
:param data: Dictionary, post請求的表單數(shù)據(jù)
:param json: 傳遞json數(shù)據(jù)跟上面的data效果類似
:param headers: (optional) Dictionary 請求頭
:param cookies: (optional) Dict or CookieJar object (設(shè)置cookies信息模擬用戶請求)
:param files: 上傳文件
:param auth: 網(wǎng)站需要驗(yàn)證的信息（賬號(hào)和密碼）
:param timeout: 設(shè)置請求的超時(shí)時(shí)間
:param allow_redirects: bool,是否允許重定向
:param proxies: (optional) Dictionary （設(shè)置代理）
:param verify:  Defaults to ``True``.（忽略證書認(rèn)證,默認(rèn)為True表示不忽略）
"""
req_header = {
    'User-Agent':'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.99 Safari/537.36',
}
parmars = {
    'wd':'豆瓣'
}
# response = requests.get(url,params=parmars,headers=req_header)
response = requests.get(url,headers=req_header)
response.encoding='utf-8'

#從響應(yīng)結(jié)果中獲取的信息
#(這里得到的是解碼后的字符串)
html = response.text

"""
#如果使用response.text出現(xiàn)了亂碼
方式一
#response.content.decode('')
方式二
response.encoding=''設(shè)置編碼類型
"""

#獲取bytes類型的數(shù)據(jù)
b_html = response.content
#獲取狀態(tài)碼
code = response.status_code
#獲取響應(yīng)頭
response_headers = response.headers
#請求頭
req_headers = response.request.headers
#獲取當(dāng)前請求的url地址
current_url = response.url
#response.json():可以將json字符串轉(zhuǎn)為python數(shù)據(jù)類型
print(code)
print(html)

resquests_post請求

import requests

#url, 目標(biāo)url
# data=None,:post請求要上傳的表單數(shù)據(jù)

url = 'https://www.lagou.com/jobs/positionAjax.json?needAddtionalResult=false'

form_data = {
    'first': 'true',
    'pn': 1,
    'kd': 'python',
}

#設(shè)置請求頭
req_header = {
    'User-Agent':'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.99 Safari/537.36',
    'Referer': 'https://www.lagou.com/jobs/list_python?city=%E5%85%A8%E5%9B%BD&cl=false&fromSearch=true&labelWords=&suginput=',
}

response = requests.post(url,data=form_data,headers=req_header)

print(response.status_code)

print(response.text)

#可以吧將返回的json字符串轉(zhuǎn)為python數(shù)據(jù)類型
data = response.json()
print(type(data))

requests_outh

#web客戶端驗(yàn)證
import requests

#設(shè)置認(rèn)證信息
auth = ('username','password')

url = 'http://192.168.1.110'

response = requests.get(url,auth=auth)

print(response.status_code)

requests下使用cookies

import requests
#分析發(fā)現(xiàn)
# https://www.douban.com/accounts/login
# 沒有驗(yàn)證碼的情況
# source: index_nav
# form_email: 18518753265
# form_password: ljh12345678

#有驗(yàn)證碼的情況
# source: index_nav
# form_email: 18518753265
# form_password: ljh12345678
# captcha-solution: blade
# captcha-id: 5IBtw5wm2riyrIrnV3utwUPt:en

url = 'https://www.douban.com/accounts/login'

form_data = {
    'source': 'index_nav',
    'form_email': '18518753265',
    'form_password': 'ljh12345678',
    'captcha-solution': 'violent',
    'captcha-id': 'AuKNJ1FIktyrmpljJ6WAzXo3:en'
}

#設(shè)置請求頭
req_header = {
    'User-Agent':'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.99 Safari/537.36',
}

#發(fā)起請求
response = requests.post(url,headers=req_header,data=form_data)

#使用response.cookies獲取cookies信息
print('模擬登錄后的cookies信息',response.cookies)
print(type(response.cookies))
print(response.headers)

with open('douban.html','w') as file:
    file.write(response.text)

#requests.utils.cookiejar_from_dict():將字典轉(zhuǎn)為cookiejar
#requests.utils.dict_from_cookiejar():將cookiejar轉(zhuǎn)為字典
cookies_dict = requests.utils.dict_from_cookiejar(response.cookies)
print(cookies_dict)
#登錄成功后訪問個(gè)人主頁,能夠成功獲取到個(gè)人主頁信息,說明確實(shí)保存了cookie
#并且在一下次發(fā)起請求的時(shí)候攜帶了cookie
url = 'https://www.douban.com/people/175417123/'
#設(shè)置cookies參數(shù),模擬用戶發(fā)起請求
response = requests.get(url,headers=req_header,cookies=cookies_dict)

if response.status_code == 200:

    with open('douban1.html','w') as file:

        file.write(response.text)

使用requests模塊設(shè)置代理

import requests

proxies = {
    'http':'219.238.186.188:8118',
    'https':'222.76.204.110:808',
    'https':'https://username:password@ip:port',
    'http':'http://username:password@ip:port'
}

url = 'https://httpbin.org/get'

response = requests.get(url,proxies=proxies,timeout=10)

print(response.text)

requests.session的使用

#requests.session():維持會(huì)話,可以讓我們在跨請求時(shí)保存某些參數(shù)


import requests

#實(shí)例化session
session = requests.session()

#目標(biāo)url
url = 'https://www.douban.com/accounts/login'

form_data = {
    'source': 'index_nav',
    'form_email': '18518753265',
    'form_password': 'ljh12345678',
    'captcha-solution': 'stamp',
    'captcha-id': 'b3dssX515MsmNaklBX8uh5Ab:en'
}

#設(shè)置請求頭
req_header = {
    'User-Agent':'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.99 Safari/537.36',
}

#使用session發(fā)起請求
response = session.post(url,headers=req_header,data=form_data)

if response.status_code == 200:

    #訪問個(gè)人主頁：
    url = 'https://www.douban.com/people/175417123/'

    response = session.get(url,headers = req_header)

    if response.status_code == 200:

        with open('douban3.html','w') as file:

            file.write(response.text)

xpath的使用

xpath:可以在xml中查找信息，對xml文檔中元素進(jìn)行遍歷和屬性的提取
xml:被設(shè)計(jì)的目的是為了傳輸數(shù)據(jù),結(jié)構(gòu)和html非常相識(shí),是一種標(biāo)記語言

xpath常見的語法：

nodename 選取此節(jié)點(diǎn)的所有子節(jié)點(diǎn)
/        從根節(jié)點(diǎn)開始查找
//       匹配節(jié)點(diǎn)，不考慮節(jié)點(diǎn)的位置
.        選取當(dāng)前節(jié)點(diǎn)
..       選取當(dāng)前節(jié)點(diǎn)的父節(jié)點(diǎn)
a/@href        取標(biāo)簽的數(shù)據(jù)
a/text()       取標(biāo)簽的文本
a[@class="123"] 根據(jù)class屬性尋找標(biāo)簽
a[@id="123"]    根據(jù)id屬性尋找標(biāo)簽

a[@id="123"][last()]  取最后一個(gè)id為123的a標(biāo)簽
a[@id="123"][postion() < 2]  取id為123的前兩個(gè)a標(biāo)簽

色偷偷精品伊人,欧洲久久精品,欧美综合婷婷骚逼,国产AV主播,国产最新探花在线,九色在线视频一区,伊人大交九欧美,1769亚洲,黄色成人av

爬蟲小結(jié)

爬蟲小結(jié)

爬蟲的作用：

網(wǎng)頁的三大特征：

爬蟲的流程

通用爬蟲抓取網(wǎng)頁的流程：

通用爬蟲的缺點(diǎn)：

OSI七層協(xié)議

常見的請求狀態(tài)碼

發(fā)起請求：

使用urllib發(fā)起請求

正則

列子

cookies的使用

cookiejar的使用

urllib.error的使用

urllib的parse模塊的使用

urllib_proxy的使用

requests模塊

resquests_post請求

requests_outh

requests下使用cookies

使用requests模塊設(shè)置代理

requests.session的使用

xpath的使用

xpath常見的語法：

相關(guān)閱讀更多精彩內(nèi)容

友情鏈接更多精彩內(nèi)容

色偷偷精品伊人,欧洲久久精品,欧美综合婷婷骚逼,国产AV主播,国产最新探花在线,九色在线视频一区,伊人大交九 欧美,1769亚洲,黄色成人av

爬蟲小結(jié)

爬蟲的作用：

網(wǎng)頁的三大特征：

爬蟲的流程

通用爬蟲抓取網(wǎng)頁的流程：

通用爬蟲的缺點(diǎn)：

OSI七層協(xié)議

常見的請求狀態(tài)碼

發(fā)起請求：

使用urllib發(fā)起請求

正則

列子

cookies的使用

cookiejar的使用

urllib.error的使用

urllib的parse模塊的使用

urllib_proxy的使用

requests模塊

resquests_post請求

requests_outh

requests下使用cookies

使用requests模塊設(shè)置代理

requests.session的使用

xpath的使用

xpath常見的語法：

相關(guān)閱讀更多精彩內(nèi)容

友情鏈接更多精彩內(nèi)容

色偷偷精品伊人,欧洲久久精品,欧美综合婷婷骚逼,国产AV主播,国产最新探花在线,九色在线视频一区,伊人大交九欧美,1769亚洲,黄色成人av