福利91在线,亚洲色尼古,国产一线自拍在线

Python爬蟲(chóng)——Beautiful Soup的用法

學(xué)習(xí)自崔慶才的個(gè)人博客靜覓
文章地址：http://cuiqingcai.com/1319.html

0. Beautiful Soup簡(jiǎn)介及環(huán)境配置

Beautiful Soup是python的一個(gè)庫(kù)，最主要的功能是從網(wǎng)頁(yè)抓取數(shù)據(jù)，所以可以用這個(gè)庫(kù)來(lái)實(shí)現(xiàn)爬蟲(chóng)功能。

下載地址：https://pypi.python.org/pypi/beautifulsoup4/4.3.2

下載后解壓至硬盤(pán)里，然后打開(kāi)命令行，進(jìn)入對(duì)應(yīng)文件夾，執(zhí)行python setup.py install進(jìn)行安裝。

Beautiful Soup中文文檔：

http://www.crummy.com/software/BeautifulSoup/bs4/doc/index.zh.html

1. 創(chuàng)建Beautiful Soup對(duì)象

導(dǎo)入bs4庫(kù)from bs4 import BeautifulSoup

然后創(chuàng)建BeautifulSoup對(duì)象soup=BeautifulSoup(html),這里的參數(shù)是一個(gè)網(wǎng)頁(yè)文本,或者是一個(gè)類(lèi)文件對(duì)象，如open(),urlopen()都可以。另外現(xiàn)在需要在后面加一個(gè)參數(shù)'lxml'，所以實(shí)例化的格式如下：soup=BeautifulSoup(urllib2.urlopen('http://www.baidu.com').read(),'lxml')

下來(lái)是將soup對(duì)象的內(nèi)容打印出來(lái)：print soup.prettify()

2. 四大對(duì)象種類(lèi)

Beautiful Soup將HTML文檔轉(zhuǎn)換成了復(fù)雜的樹(shù)形結(jié)構(gòu)，每個(gè)節(jié)點(diǎn)都是Python對(duì)象。共有四種對(duì)象，tag,NavigableString,BeautifulSoup,Comment

通過(guò)實(shí)例感受：

html = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title" name="dromouse"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a  class="sister" id="link1"><!-- Elsie --></a>,
<a  class="sister" id="link2">Lacie</a> and
<a  class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
"""
soup=BeautifulSoup(html)
#title tag之間的內(nèi)容
print soup.title 
#head之間的內(nèi)容
print soup.head 
#通過(guò)soup加標(biāo)簽名獲取標(biāo)簽之間的內(nèi)容
#Tag對(duì)象有兩個(gè)屬性，name與attrs
print soup.head.name
#輸出是head
print soup.p.attrs
#輸出是字典{'class': ['title'], 'name': 'dromouse'}
#單獨(dú)獲取屬性
print soup.p['class']
print soup.p.get('class')
#修改這些屬性
soup.p['class']='newClass'
#刪除屬性
del soup.p['class']

NavigableString

通過(guò)這樣soup.p.string獲取標(biāo)簽內(nèi)部的文字

print soup.p.string
#輸出是The Dormouse's story

BeautifulSoup

該對(duì)象表示的是一個(gè)文檔的全部?jī)?nèi)容，大部分情況可以當(dāng)成Tag對(duì)象，是一個(gè)特殊的Tag，實(shí)例感受：

print type(soup.name)
#輸出是<type 'unicode'>
print soup.name
 #輸出是[document]
print soup.attrs
#輸出是空字典[]

Comment

Comment對(duì)象是一個(gè)特殊類(lèi)型的NavigableString對(duì)象，使用soup.a.string打印將不包括注釋符號(hào)，所以在打印之前，判斷是否是bs4.element.Comment，再進(jìn)行其他操作。

if type(soup.a.string)==bs4.element.Comment:
    print soup.a.string

3. 遍歷文檔樹(shù)

.contents .children .descendants屬性

仍然是上實(shí)例：

#.contents屬性將tag的子節(jié)點(diǎn)以列表方式輸出
print soup.head.contents
#輸出方式為列表，以最大的標(biāo)簽內(nèi)容為一個(gè)列表項(xiàng)
#對(duì)于html.contents來(lái)說(shuō)列表包含head和body
#可以用列表索引獲取元素
print soup.head.contents[0]

#.children屬性返回的是列表迭代器，遍歷所有子節(jié)點(diǎn)
for child in soup.body.children:
    print child

#.descendants屬性將遍歷所有tag的子孫節(jié)點(diǎn)
for child in soup.descendants:
    print child
#重點(diǎn)是每一個(gè)標(biāo)簽層層剝離

未完待續(xù)

色偷偷精品伊人,欧洲久久精品,欧美综合婷婷骚逼,国产AV主播,国产最新探花在线,九色在线视频一区,伊人大交九欧美,1769亚洲,黄色成人av

Python爬蟲(chóng)——Beautiful Soup的用法

Python爬蟲(chóng)——Beautiful Soup的用法

Python爬蟲(chóng)——Beautiful Soup的用法

0. Beautiful Soup簡(jiǎn)介及環(huán)境配置

1. 創(chuàng)建Beautiful Soup對(duì)象

2. 四大對(duì)象種類(lèi)

3. 遍歷文檔樹(shù)

相關(guān)閱讀更多精彩內(nèi)容

友情鏈接更多精彩內(nèi)容

色偷偷精品伊人,欧洲久久精品,欧美综合婷婷骚逼,国产AV主播,国产最新探花在线,九色在线视频一区,伊人大交九 欧美,1769亚洲,黄色成人av

Python爬蟲(chóng)——Beautiful Soup的用法

Python爬蟲(chóng)——Beautiful Soup的用法

0. Beautiful Soup簡(jiǎn)介及環(huán)境配置

1. 創(chuàng)建Beautiful Soup對(duì)象

2. 四大對(duì)象種類(lèi)

3. 遍歷文檔樹(shù)

相關(guān)閱讀更多精彩內(nèi)容

友情鏈接更多精彩內(nèi)容

色偷偷精品伊人,欧洲久久精品,欧美综合婷婷骚逼,国产AV主播,国产最新探花在线,九色在线视频一区,伊人大交九欧美,1769亚洲,黄色成人av