操99riav在线看,黄色网址日韩视频

<blockquote>
<p>背景介紹：非計(jì)算機(jī)專業(yè)，具有一點(diǎn)python編程基礎(chǔ)。曾自學(xué)廖雪峰的《笨辦法學(xué) Python(第四版）》、網(wǎng)易公開課《麻省理工學(xué)院公開課：計(jì)算機(jī)科學(xué)及編程導(dǎo)論》（以python為開發(fā)工具），另外購(gòu)買了紙質(zhì)版《python核心編程》以便遇到問(wèn)題查看。
爬蟲相關(guān)書籍看過(guò)《OReilly.Web.Scraping.with.Python》</p>
<p>目的：爬取豆瓣書籍信息，包括書名、作者、譯者、出版社及時(shí)間、評(píng)分人數(shù)及星星等級(jí)，并存儲(chǔ)于mysql數(shù)據(jù)庫(kù)。</p>
<p>使用的工具：python3.5，jupyter notebook，sublime text，mysql。</p>
</blockquote>
<h3>個(gè)人理解爬蟲過(guò)程主要分為三個(gè)步：</h3>
<ul>
<li>獲取網(wǎng)頁(yè)內(nèi)容 </li>
<li>解析網(wǎng)頁(yè)內(nèi)容</li>
<li>存儲(chǔ)爬下來(lái)的數(shù)據(jù)</li>
</ul>
<h3>（1）獲取網(wǎng)頁(yè)內(nèi)容</h3>
<blockquote>
<p>關(guān)鍵在于把爬蟲模擬成瀏覽器，發(fā)送請(qǐng)求。
涉及到的知識(shí)：http相關(guān)基礎(chǔ)知識(shí)、cookie、session、幾個(gè)python的包requests、beautifulsoup等</p>
</blockquote>
<p>1、import requests、beautifulsoup 庫(kù)。獲取瀏覽器headers（可在瀏覽器開發(fā)工具請(qǐng)求報(bào)文中看），這一步是爬蟲偽裝成瀏覽器。根據(jù)具體情況看是否需要cookie或session。</p>
<pre><code>import requests
from bs4 import BeautifulSoup
headers={"User-Agent":"Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.99 Safari/537.36","Accept":"text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,/;q=0.8"}
url="https://book.douban.com/"
session=requests.session()
</code></pre>

<p>2、獲取網(wǎng)頁(yè)內(nèi)容。涉及到知識(shí)：http相關(guān)知識(shí)，beautifulsoup、requests包，函數(shù)：beautifulsoup()，其中‘lxml’格式需要自己另外安裝，findall()及find()。</p>
<pre><code>def get_Content(url):
res=session.get(url,headers=headers)
req=BeautifulSoup(res.content,'lxml')
book_content=req.findAll("div",{"class":"info"})
return book_content
</code></pre>

<h3>（2）解析網(wǎng)頁(yè)內(nèi)容</h3>
<blockquote>
<p>涉及到知識(shí)點(diǎn)：正則表達(dá)式、數(shù)據(jù)類型的轉(zhuǎn)換、for循環(huán)語(yǔ)句、find()and findall()、get_text()函數(shù)、sub()、replace()、join()、split()等</p>
</blockquote>
<p>1、import re（正則表達(dá)式庫(kù)），遍歷已經(jīng)獲取的頁(yè)面內(nèi)容（在get<em>content函數(shù)中返回，其中findall函數(shù)返回的數(shù)據(jù)類型是list，這點(diǎn)跟find函數(shù)有所區(qū)別）通過(guò)find、get</em>text函數(shù)獲取需要的書籍信息，數(shù)據(jù)類型為字符型str。由于獲取到的內(nèi)容的格式，并不是最終想要的格式，需要進(jìn)行數(shù)據(jù)清洗，包括去掉\n，多余的空白、括號(hào)、還有字段“4524人評(píng)價(jià)”需要提取“4524”并存儲(chǔ)為數(shù)值型。對(duì)于為什么需要判斷if not rate？是由于部分?jǐn)?shù)據(jù)由于評(píng)價(jià)人員過(guò)少，缺失rating內(nèi)容，這時(shí)候令rating=0，否則爬蟲爬到這本書的時(shí)候會(huì)報(bào)錯(cuò)停止，最后通過(guò)float()將rating轉(zhuǎn)換成浮點(diǎn)型數(shù)值。針對(duì)"234評(píng)價(jià)"及"少于10人評(píng)價(jià)"情況的處理，用spilt(),跟正則表達(dá)式\D+（非數(shù)值）將數(shù)字跟漢字分開，再通過(guò)join只留下人數(shù)并轉(zhuǎn)換成INT整型。</p>
<pre><code>import re
def get_items(book_content):
global Tag
for i in book_content:
title=get_cleandata(i.find("a").get_text())
tran=get_cleandata(i.find("div",{"class":"pub"}).get_text())
rate=i.find("span",{"class":"rating_nums"})
if not rate:
rating=0
else:
rating=float(get_cleandata(rate.get_text()))
pl=int(''.join(re.split('\D+',get_cleandata(i.find("span",{"class":"pl"}).get_text()))))
store(Tag,title,tran,rating,pl)
def get_cleandata(data):
cleandata=re.sub("\n+","",data)
cleandata=cleandata.replace("(","")
cleandata=cleandata.replace(")","")
cleandata=re.sub(" +","",cleandata)
return cleandata
</code></pre>

<p>2、獲取所有需要爬取的url，一般有集中方式獲?。?lt;/p>
<blockquote>
<p>1.獲取“下一頁(yè)”的url，不斷的循環(huán)獲取，爬完一頁(yè)接著一頁(yè)。直到獲取下一頁(yè)的url為空停止。</p>
<p>2、分析不同頁(yè)數(shù)的url，找出規(guī)律，例如豆瓣讀書，下一頁(yè)的參數(shù)都是增加20(https://book.douban.com/tag/小說(shuō)?start=20&type=T)，這樣子就可以列出所有url，一直爬到返回的頁(yè)面內(nèi)容為空停止。</p>
<p>我采用的是第二種辦法</p>
</blockquote>
<p>1、首先，觀察頁(yè)數(shù)最大值找到合適的值，觀察到豆瓣最大的頁(yè)數(shù)是99</p>
<pre><code>a=[a*20 for a in range(0,100)]
</code></pre>

<p>url參數(shù)的增加，通過(guò)urllib.parse.urlencode來(lái)增加。if not content 停止循環(huán)。最后通過(guò)time.sleep()控制循環(huán)時(shí)間，控制請(qǐng)求速度，模擬人點(diǎn)擊頁(yè)面，避免反爬策略。</p>
<pre><code>import import urllib.parse
import time
def get_Start(url):
global Tag
for i in Tag:
a=[a*20 for a in range(0,100)]
for r in a:
time.sleep(3)
values={"type":"T",'start':r}
data=urllib.parse.urlencode(values)
a=url+"tag/"+i+'?'+data
content=get_Content(a)
if not content:
break
else:
get_items(content)
</code></pre>

<p>其中還有獲取Tag的值，由于不想爬全部的書籍，就直接通過(guò)修改Tag的內(nèi)容來(lái)爬感興趣的：</p>
<pre><code>#def get_Tab(url):
#r=session.get(url+"/tag/",headers=headers)
#bsObj=BeautifulSoup(r.content,'lxml')
#Tag_contents=bsObj.findAll("a",href=re.compile("^/tag/.*"))
#Tag=[tag.get_text()for tag in Tag_contents]
#return Tag
Tag=['哲學(xué)']
</code></pre>

<h3>（3）存儲(chǔ)爬下來(lái)的信息</h3>
<blockquote>
<p>需要安裝mysql，一些數(shù)據(jù)庫(kù)操作的語(yǔ)句及基本知識(shí)</p>
</blockquote>
<p>1、首先，import pymysql庫(kù)及設(shè)置好連接，在mysql里面建立好database及相關(guān)的表</p>
<pre><code>import pymysql
conn = pymysql.connect(host='localhost',user='root',passwd='lym*',db='mysql',charset='utf8')
cur = conn.cursor()
cur.execute("USE douban")
</code></pre>

<p>2、調(diào)用def store()函數(shù)存儲(chǔ)相關(guān)的數(shù)據(jù)</p>
<pre><code>def store(Tag,title,tran,rating,pl):
cur.execute("insert into philosophy(Tag,book,content,point,comment_num) values(%s,%s,%s,%s,%s)",(Tag,title,tran,rating,pl))
cur.connection.commit()
</code></pre>

<h3>總結(jié)，整一個(gè)爬蟲流程是先import相關(guān)的庫(kù)，設(shè)置好開始的headers、開始url，然后get<em>Start(url)，獲取需要爬的url，再調(diào)用get</em>items()，獲取頁(yè)面內(nèi)容，通過(guò)get_cleandata()對(duì)數(shù)據(jù)進(jìn)行清洗，最后調(diào)用store()進(jìn)行存儲(chǔ)爬下來(lái)的數(shù)據(jù)</h3>
<h3>下次，搞搞多線程</h3>

色偷偷精品伊人,欧洲久久精品,欧美综合婷婷骚逼,国产AV主播,国产最新探花在线,九色在线视频一区,伊人大交九欧美,1769亚洲,黄色成人av

使用python爬豆瓣書單

使用python爬豆瓣書單

相關(guān)閱讀更多精彩內(nèi)容

友情鏈接更多精彩內(nèi)容

色偷偷精品伊人,欧洲久久精品,欧美综合婷婷骚逼,国产AV主播,国产最新探花在线,九色在线视频一区,伊人大交九 欧美,1769亚洲,黄色成人av

使用python爬豆瓣書單

相關(guān)閱讀更多精彩內(nèi)容

友情鏈接更多精彩內(nèi)容

色偷偷精品伊人,欧洲久久精品,欧美综合婷婷骚逼,国产AV主播,国产最新探花在线,九色在线视频一区,伊人大交九欧美,1769亚洲,黄色成人av