第一章
主要內(nèi)容:給特定網(wǎng)頁(yè)發(fā)送GET請(qǐng)求,得到html數(shù)據(jù)然后簡(jiǎn)單地提取數(shù)據(jù)
- 小實(shí)驗(yàn)
from urllib.request import urlopen
html = urlopen("http://pythonscraping.com/pages/page1.html")
print(html.read())
在python2中有urllib和urllib2,在python3中urllib2變成了urllib,并且被分成幾個(gè)子模塊urllib.request,urllib.parse,urllib.error,urllib.robotparser等,urllib官方文檔.
- BeautifulSoup簡(jiǎn)介
將HTML轉(zhuǎn)化為代表XML結(jié)構(gòu)的容易遍歷的python對(duì)象。
from urllib.request import urlopen
from bs4 import BeautifulSoup
html = urlopen("http://www.pythonscraping.com/pages/page1.html")
bsObj = BeautifulSoup(html.read())
print(bsObj.h1)
網(wǎng)頁(yè)的解構(gòu)如下圖所示:

最終網(wǎng)頁(yè)輸出:
<h1>An Interesting Title</h1>
bs解析結(jié)果:
html → <html><head>...</head><body>...</body></html>
— head → <head><title>A Useful Page<title></head>
— title → <title>A Useful Page</title>
— body → <body><h1>An Int...</h1><div>Lorem ip...</div></body>
— h1 → <h1>An Interesting Title</h1>
— div → <div>Lorem Ipsum dolor...</div>
雖然h1需要嵌套兩層,但是我們還是可以用上訴語(yǔ)法找到h1(這里可能是遞歸查找,直到找到),我們也可以使用下面的語(yǔ)法找到h1
bsObj.html.body.h1
bsObj.body.h1
bsObj.html.h1
- 連接的健壯性
下面這句語(yǔ)句可能會(huì)出現(xiàn)兩個(gè)錯(cuò)誤(“404 Page Not Found,” “500 Internal Server Error”):
1.服務(wù)器找不到
2.這個(gè)頁(yè)面找不到
html = urlopen("http://www.pythonscraping.com/pages/page1.html")
另外使用beautifulsoup的時(shí)候還會(huì)出現(xiàn)標(biāo)簽不存在的情況,因此需要對(duì)代碼進(jìn)行修改如下:
def getTitle(url):
try:
html = urlopen(url)
except HTTPError as e:
return None
try:
bsObj = BeautifulSoup(html.read())
title = bsObj.body.h1
except AttributeError as e:
return None
return title
def main():
title = getTitle("http://www.pythonscraping.com/pages/page1.html")
if title is None:
print("Title could not be found")
else:
print(title)
第二章-高級(jí)HTML解析
- 主要內(nèi)容:解析HTML以得到我們想要的信息
以下代碼我們可以提取出頁(yè)面中class屬性為green的標(biāo)簽,這里findAll的用法是:bsObj.findAll(tagName, tagAttributes)
getTitle("http://www.pythonscraping.com/pages/warandpeace.html")
def getTitle(url):
try:
html = urlopen(url)
except HTTPError as e:
return
try:
bsObj = BeautifulSoup(html)
nameList = bsObj.findAll("span", {"class": "green"})
for name in nameList:
print(name.get_text())
except AttributeError as e:
return
- find()和findAll()
findAll(tag, attributes, recursive=True, text, limit, keywords)
find(tag, attributes, recursive=True, text, keywords)
你可以這樣:
.findAll({"h1","h2","h3","h4","h5","h6"})
你還可以這樣:
.findAll("span", {"class":"green", "class":"red"})
參數(shù)recursive決定了遞歸的深度
text是用標(biāo)簽的內(nèi)容來(lái)查找,假如我們想在文本當(dāng)中找到包含'the prince'的內(nèi)容的次數(shù),用以下的代碼:
nameList = bsObj.findAll(text="the prince")
print(len(nameList))
limit用來(lái)限制超找的個(gè)數(shù),limit=1時(shí),find()和findAll()是一樣的。
keyword特定的屬性,例子:
allText = bsObj.findAll(id="text")
print(allText[0].get_text())
下面兩者是等價(jià)的
bsObj.findAll(id="text")
bsObj.findAll("", {"id":"text"})
個(gè)人隨手練習(xí)-獲取廖雪峰教程所有名稱
-
轉(zhuǎn)移樹
有時(shí)我們需要根據(jù)標(biāo)簽的相對(duì)位置來(lái)得到其他的標(biāo)簽
像這個(gè)網(wǎng)站有大概如下的標(biāo)簽結(jié)構(gòu):
孩子與后代、兄弟姐妹、父母.....
孩子只有一層關(guān)系,孩子的孩子也是后代。正則表達(dá)式
|:一個(gè)或者零個(gè)
*:零個(gè)或者任意多個(gè)
+:至少出現(xiàn)一次
\:可選
[]:范圍
():優(yōu)先級(jí)
見(jiàn)P26
任意郵箱的正則表達(dá)式:[A-Za-z0-9._+]+@[A-Za-z]+.(com|org|edu|net)正則表達(dá)式和bs
怎么找出這些圖片的鏈接地址,你可能會(huì)說(shuō),用findAll('img')?。〉莑ogo啊,其他的一些圖片就進(jìn)來(lái)了!!
image["src"]根據(jù)標(biāo)簽和屬性得出屬性的具體內(nèi)容。
from urllib.request import urlopen
from bs4 import BeautifulSoup
from bs4 import re
def findPictureURI():
html = urlopen("http://www.pythonscraping.com/pages/page3.html")
bsObj = BeautifulSoup(html)
images = bsObj.findAll("img", {"src":re.compile("\.\.\/img\/gifts\/img.*\.jpg")})
for image in images:
print(image["src"])
返回:
../img/gifts/img1.jpg
../img/gifts/img2.jpg
../img/gifts/img3.jpg
../img/gifts/img4.jpg
../img/gifts/img6.jpg
- Lambda表達(dá)式
看看下面的例子,bs運(yùn)行我們使用一個(gè)lambda表達(dá)式,參數(shù)必須是tag,輸出必須是布爾變量。這里可以用lambda表達(dá)式代替正則表達(dá)式,只要你開心
soup.findAll(lambda tag: len(tag.attrs) == 2)
可能會(huì)返回的結(jié)果:
<div class="body" id="content"></div>
<span style="color:red" class="title"></span>
第四章-開始“爬”
主要內(nèi)容:開始解決一些實(shí)際問(wèn)題——從一個(gè)名人wiki頁(yè)面到另外一個(gè)wiki頁(yè)面最少要多少次點(diǎn)擊?我選擇李嘉誠(chéng)和周杰倫。
- 首先實(shí)現(xiàn)爬取某個(gè)wiki頁(yè)面全部鏈接
官方實(shí)現(xiàn):
# retrieves an arbitrary Wikipedia page and produces a list of links on that page
def getlinks(url):
html = urlopen(url)
bsobj = BeautifulSoup(html)
for link in bsobj.findAll('a'):
if 'href' in link.attrs:
print(link['href'])
我用lambda表達(dá)式實(shí)現(xiàn)另外一個(gè)版本,結(jié)果是一樣的
# retrieves an arbitrary Wikipedia page and produces a list of links on that page
def getlinks(url):
html = urlopen(url)
bsobj = BeautifulSoup(html,'lxml')
for link in bsobj.findAll(lambda tag: tag.name =='a' and 'href' in tag.attrs):
print(link['href'])
結(jié)果出來(lái)發(fā)現(xiàn)了一些情況:
有些我們不想要的
//wikimediafoundation.org/wiki/Privacy_policy
//en.wikipedia.org/wiki/Wikipedia:Contact_us
作者發(fā)現(xiàn)有我們想要的都在有一個(gè)屬性id=bodyContent的標(biāo)簽里面:

我們先加入這個(gè)限制
from urllib.request import urlopen
from bs4 import BeautifulSoup
import re
html = urlopen("http://en.wikipedia.org/wiki/Kevin_Bacon")
bsObj = BeautifulSoup(html)
for link in bsObj.find("div", {"id":"bodyContent"}).findAll("a",
href=re.compile("^(/wiki/)((?!:).)*$")):
if 'href' in link.attrs:
print(link.attrs['href'])
繼續(xù)我們發(fā)現(xiàn)還有這些不要的鏈接,他們都沒(méi)有/wiki/:
#cite_ref-1
#cite_ref-2
#cite_ref-KBE_3-0
#cite_ref-KBE_3-1
http://www.nndb.com/people/977/000161494/
https://www.forbes.com/profile/li-ka-shing/
http://www.london-gazette.co.uk/issues/55879/supplements/24
/wiki/Category:Businesspeople_from_Guangdong
/wiki/Category:Canadian_Imperial_Bank_of_Commerce_people
/wiki/Category:Cheung_Kong_Holdings
/wiki/Category:CK_Hutchison_Holdings
/wiki/Category:Commandeurs_of_the_L%C3%A9gion_d%27honneur
/wiki/Category:Hong_Kong_Affairs_Advisors
/wiki/Category:Hong_Kong_billionaires
/wiki/Category:Hong_Kong_BLDC_members
/wiki/Category:Hong_Kong_Buddhists
/wiki/Category:Hong_Kong_chairmen_of_corporations
/wiki/Category:Hong_Kong_emigrants_to_Canada
繼續(xù)改進(jìn):1.沒(méi)有:、2.要有/wiki/
^(/wiki/)((?!:).)*$表示一段隨意字符+/wiki/+沒(méi)有‘:’
# retrieves an arbitrary Wikipedia page and produces a list of links on that page
def getlinks(url):
html = urlopen(url)
bsobj = BeautifulSoup(html,'lxml')
content_in_body = bsobj.find('div',{'id':'bodyContent'})
for link in content_in_body.findAll('a',href=re.compile('^(/wiki/)((?!:).)*$')):
if 'href' in link.attrs:
print(link['href'])
這下干凈了許多:
/wiki/Martin_Lee_Ka_Shing
/wiki/Chinese_name
/wiki/Chinese_surname
/wiki/Li_(surname_%E6%9D%8E)
/wiki/The_Honourable
/wiki/Grand_Bauhinia_Medal
/wiki/Order_of_the_British_Empire
/wiki/Justice_of_the_Peace
/wiki/Chaozhou
/wiki/Guangdong
/wiki/Republic_of_China_(1912%E2%80%931949)
/wiki/High_school_dropout
/wiki/CK_Hutchison_Holdings
/wiki/Cheung_Kong_Property_Holdings
/wiki/Li_Ka_Shing_Foundation
/wiki/US$
/wiki/Victor_Li_Tzar-kuoi
/wiki/Richard_Li
/wiki/Justice_of_the_Peace
/wiki/Doctor_of_law
/wiki/Doctor_of_Social_Science
/wiki/Traditional_Chinese_characters
/wiki/Simplified_Chinese_characters
/wiki/Standard_Chinese
/wiki/Hanyu_Pinyin
/wiki/Wu_Chinese
/wiki/Hakka_Chinese
/wiki/Guangdong_Romanization#Hakka
/wiki/Cantonese
/wiki/Jyutping
/wiki/Southern_Min
/wiki/Teochew_dialect
/wiki/Guangdong_Romanization#Teochew
/wiki/Chinese_language
/wiki/Specials_(Unicode_block)#Replacement_character
/wiki/Chinese_characters
/wiki/Grand_Bauhinia_Medal
/wiki/Order_of_the_British_Empire
/wiki/Justice_of_the_Peace
/wiki/Chaozhou
/wiki/Hong_Kong
/wiki/Chairman
/wiki/CK_Hutchison_Holdings
/wiki/Cheung_Kong_Holdings
/wiki/Hong_Kong_Stock_Exchange
/wiki/Forbes_family
/wiki/No_frills
/wiki/Seiko
/wiki/Deep_Water_Bay
/wiki/Hong_Kong_Island
/wiki/Superman
/wiki/Chaozhou
/wiki/Teochew_people
/wiki/Hong_Kong_Stock_Exchange
/wiki/Hongkong_Electric_Company
/wiki/Harvard_Business_School
/wiki/Artificial_flower
/wiki/Hong_Kong_1967_Leftist_Riots
/wiki/Cheung_Kong
/wiki/Yangtze_River
/wiki/Cheung_Kong_Holdings
/wiki/Hong_Kong_Stock_Exchange
/wiki/Hongkong_Land
/wiki/Cheung_Kong
/wiki/British_overseas_territory
/wiki/Bermuda
/wiki/HSBC
/wiki/Hutchison_Whampoa
/wiki/Cheung_Kong_Holdings
/wiki/Port_of_Hong_Kong
/wiki/Deltaport
/wiki/People%27s_Republic_of_China
/wiki/United_Kingdom
/wiki/Rotterdam
/wiki/Panama
/wiki/Bahamas
/wiki/Developing_countries
/wiki/A.S._Watson_Group
/wiki/Superdrug
/wiki/Kruidvat
/wiki/Watson%27s
/wiki/CK_Hutchison_Holdings
/wiki/Orange_SA
/wiki/Hutchison_Telecommunications
/wiki/Hutchison_Essar
/wiki/Vodafone
/wiki/Horizons_Ventures
/wiki/DoubleTwist
/wiki/Li_Ka_Shing_Foundation
/wiki/Facebook
/wiki/Spotify
/wiki/Siri_Inc.
/wiki/Nick_D%27Aloisio
/wiki/Bitcoin
/wiki/Australian_Tax_Office
/wiki/Cheung_Kong_Holdings
/wiki/SA_Power_Networks
/wiki/Cheung_Kong_Property_Holdings
/wiki/CK_Hutchison_Holdings
/wiki/Canadian_Imperial_Bank_of_Commerce
/wiki/Husky_Energy
/wiki/Alberta
/wiki/Canadian_dollar
/wiki/Canadian_Imperial_Bank_of_Commerce
/wiki/Li_Ka_Shing_Foundation
/wiki/Toronto
/wiki/The_Hongkong_and_Shanghai_Banking_Corporation
/wiki/HSBC_Holdings
/wiki/Victor_Li_Tzar-kuoi
/wiki/Richard_Li
/wiki/CK_Hutchison_Holdings
/wiki/PCCW
/wiki/Citizen_Watch_Co.
/wiki/Seiko
/wiki/Raymond_Chow
/wiki/Grand_Bauhinia_Medal
/wiki/Order_of_the_British_Empire
/wiki/L%C3%A9gion_d%27honneur
/wiki/Hutchison_Whampoa
/wiki/Henry_Tang
/wiki/Hong_Kong_Chief_Executive_election,_2012
/wiki/2014_Hong_Kong_protests
/wiki/2014%E2%80%9315_Hong_Kong_electoral_reform
/wiki/Shantou_University
/wiki/Chaozhou
/wiki/Guangdong_Technion-Israel_Institute_of_Technology
/wiki/Technion_%E2%80%93_Israel_Institute_of_Technology
/wiki/Shantou_University
/wiki/Hong_Kong_Polytechnic_University
/wiki/Cambridge
/wiki/Cancer_Research_UK
/wiki/Oncology
/wiki/Singapore_Management_University
/wiki/2004_Indian_Ocean_earthquake_and_tsunami
/wiki/University_of_Hong_Kong
/wiki/Li_Ka_Shing_Faculty_of_Medicine
/wiki/Kwok_Ka_Ki
/wiki/University_of_California,_Berkeley
/wiki/University_of_California,_Berkeley
/wiki/University_of_California,_San_Francisco
/wiki/University_of_California,_Berkeley
/wiki/Jennifer_Doudna
/wiki/Stanford_University
/wiki/Stanford_University_School_of_Medicine
/wiki/National_University_of_Singapore
/wiki/St._Michael%27s_Hospital,_Toronto
/wiki/University_of_Alberta
/wiki/2008_Sichuan_earthquake
/wiki/McGill_University
/wiki/Shantou_University
/wiki/University_of_California,_San_Francisco
/wiki/Tsz_Shan_Monastery
/wiki/The_Hongs
/wiki/List_of_Hong_Kong_people_by_net_worth
/wiki/Li%27s_field
/wiki/Wayback_Machine
/wiki/Techcrunch
/wiki/Horizons_Ventures
/wiki/Techcrunch
/wiki/Wayback_Machine
/wiki/Wayback_Machine
/wiki/UTC
/wiki/Wayback_Machine
/wiki/Cheung_Kong_Holdings
/wiki/Jao_Tsung-I
/wiki/Hong_Kong_order_of_precedence
/wiki/Yeung_Kwong
/wiki/Virtual_International_Authority_File
/wiki/Library_of_Congress_Control_Number
/wiki/Integrated_Authority_File
/wiki/Syst%C3%A8me_universitaire_de_documentation
Process finished with exit code 0
最終代碼:有點(diǎn)粗糙,隨機(jī)的走,后面會(huì)繼續(xù)講??!
def getlinks(url):
html = urlopen("http://en.wikipedia.org" + url)
bsobj = BeautifulSoup(html,'lxml')
content_in_body = bsobj.find('div',{'id':'bodyContent'})
return content_in_body.findAll('a',href=re.compile('^(/wiki/)((?!:).)*$'))
def main():
random.seed(datetime.datetime.now())
links = getlinks('/wiki/Li_Ka-shing')
while len(links) > 0:
newArticle = links[random.randint(0, len(links) - 1)].attrs["href"]
print(newArticle)
links = getlinks(newArticle)
爬下整個(gè)網(wǎng)站的好處:
1、可以生成一個(gè)網(wǎng)站地圖
2、收集數(shù)據(jù)
- 網(wǎng)站去重
下面是爬取wiki的全部網(wǎng)頁(yè),首先從主頁(yè)開始,找到一個(gè)頁(yè)面,然后遞歸地調(diào)用,就像DFS。
pages = set()
def getLinks(pageUrl):
global pages
html = urlopen("http://en.wikipedia.org"+pageUrl)
bsObj = BeautifulSoup(html)
for link in bsObj.findAll("a", href=re.compile("^(/wiki/)")):
if 'href' in link.attrs:
if link.attrs['href'] not in pages:
#We have encountered a new page
newPage = link.attrs['href']
print(newPage)
pages.add(newPage)
getLinks(newPage)
getLinks("")
- 觀察wiki頁(yè)面的模式
1.h1標(biāo)簽下的內(nèi)容就是標(biāo)題
2.第一段在標(biāo)簽div,屬性id=mw-content-text下面
3.編輯按鈕在<li id="ca-edit"><span>下面
基于以上觀察,爬取wiki的一些重要內(nèi)容:
from urllib.request import urlopen
from bs4 import BeautifulSoup
import re
pages = set()
def getLinks(pageUrl):
global pages
html = urlopen("http://en.wikipedia.org"+pageUrl)
bsObj = BeautifulSoup(html)
try:
print(bsObj.h1.get_text())
print(bsObj.find(id ="mw-content-text").findAll("p")[0])
print(bsObj.find(id="ca-edit").find("span").find("a").attrs['href'])
except AttributeError:
print("This page is missing something! No worries though!")
for link in bsObj.findAll("a", href=re.compile("^(/wiki/)")):
if 'href' in link.attrs:
if link.attrs['href'] not in pages:
#We have encountered a new page
newPage = link.attrs['href']
print("----------------\n"+newPage)
pages.add(newPage)
getLinks(newPage)
getLinks("")
問(wèn)題:
我們需要將這些數(shù)據(jù)放到數(shù)據(jù)庫(kù)——第五章解決
這段代碼實(shí)際還會(huì)跑到wiki以外的網(wǎng)站去一些問(wèn)題:
1.爬蟲要收集什么數(shù)據(jù)?是一些特殊網(wǎng)站的數(shù)據(jù),還是全部網(wǎng)站的數(shù)據(jù)
2.DFS還是BFS
3.在哪些情況下我不會(huì)爬某個(gè)網(wǎng)站?非英文網(wǎng)站?
4.要遵守一些規(guī)則嗎?還有幾個(gè)程序是爬外鏈的,見(jiàn)這里
簡(jiǎn)短無(wú)力地介紹了Scrapy,作者建議大家去看官方的文檔
