python beautifulSoup4

bs4介紹

安裝 pip install bs4 pip lxml
Beautiful Soup是一個(gè)可以從HTML或XML文件中提取數(shù)據(jù)的Python庫(kù)
解析器

解析器	使用方法	優(yōu)勢(shì)	劣勢(shì)
Python標(biāo)準(zhǔn)庫(kù)	`BeautifulSoup(markup, "html.parser")`	Python的內(nèi)置標(biāo)準(zhǔn)庫(kù)執(zhí)行速度適中文檔容錯(cuò)能力強(qiáng)	Python 2.7.3 or 3.2.2)前的版本中文檔容錯(cuò)能力差
lxml HTML 解析器	`BeautifulSoup(markup, "lxml")`	速度快文檔容錯(cuò)能力強(qiáng)	需要安裝C語(yǔ)言庫(kù)
lxml XML 解析器	BeautifulSoup(markup, ["lxml-xml"])``BeautifulSoup(markup, "xml")	速度快唯一支持XML的解析器	需要安裝C語(yǔ)言庫(kù)
html5lib	`BeautifulSoup(markup, "html5lib")`	最好的容錯(cuò)性以瀏覽器的方式解析文檔生成HTML5格式的文檔	速度慢不依賴外部擴(kuò)展

bs4使用

導(dǎo)入包，使用beautiful解析數(shù)據(jù)，更具源碼結(jié)構(gòu)提取數(shù)據(jù)

from bs4 import BeautifulSoup
html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a  class="sister" id="link1">Elsie</a>,
<a  class="sister" id="link2">Lacie</a> and
<a  class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="story">...</p>
"""
# 使用beautifulsoup解析,html文檔與解析器
soup = BeaufulSoup(html_doc,'lxml')

bs4的四種對(duì)象

Tag對(duì)象
- 兩個(gè)重要的屬性name與attributes
- 使用tag.name可以提取該節(jié)點(diǎn)的名稱
- 使用tag['class']可以提取節(jié)點(diǎn)中屬性為class的值
BeautifulSoup對(duì)象
- BeautifulSoup 對(duì)象表示的是一個(gè)文檔的全部?jī)?nèi)容.大部分時(shí)候,可以把它當(dāng)作Tag 對(duì)象
- 將html用bs4解析后的對(duì)象就是BeautifulSoup對(duì)象
NavigableString可遍歷的字符串對(duì)象
- 字符串常被包含在tag內(nèi).Beautiful Soup用 NavigableString 類來(lái)包裝tag中的字符串
- 使用tag.string提取節(jié)點(diǎn)包含的NavigableString對(duì)象
Comment注釋對(duì)象，特殊的NavigableString對(duì)象
- 使用tag.prettify()提取節(jié)點(diǎn)所包含的注釋

Tag的屬性

string() 獲取當(dāng)前標(biāo)簽下的內(nèi)容
strings() .strings 如果tag中包含多個(gè)字符串,可以使用 .strings 來(lái)循環(huán)獲取
.stirpped_strings 如果tag中包含多個(gè)字符串,可以使用 .stirpped_strings 來(lái)循環(huán)獲取,去除多余空格

節(jié)點(diǎn)

搜索節(jié)點(diǎn)
子節(jié)點(diǎn) contents,children,子孫節(jié)點(diǎn)descendants
- 包含在節(jié)點(diǎn)中的節(jié)點(diǎn)
父節(jié)點(diǎn) 包含該節(jié)點(diǎn)的節(jié)點(diǎn)
- parent ,直接包含該節(jié)點(diǎn)的節(jié)點(diǎn)
- parents 所有父輩節(jié)點(diǎn)，
兄弟節(jié)點(diǎn)，同級(jí)節(jié)點(diǎn)
- next_sibling,后一個(gè)兄弟節(jié)點(diǎn)
- previous_sibling，前一個(gè)兄弟節(jié)點(diǎn)
- next_siblings 后面所有兄弟節(jié)點(diǎn)
- previous_siblings 前面所有兄弟節(jié)點(diǎn)

搜索文檔數(shù)

過(guò)濾器
- 字符串過(guò)濾器---傳入Tag
- 正則表達(dá)式，name = re.compile('[\w]{5}') , 使用正則中的compile()方法，選擇與正則匹配的Tag中
- 列表過(guò)濾器 name=['p', 'a'] 匹配列表中的Tag
- True過(guò)濾器 find(True) 匹配任意Tag
- 方法過(guò)濾器 lambda tag:tag.has_attr('class') and not tag.has_attr('id'),tag中屬性選擇，方法返回bool值
find(name=None, attrs={ }, recursive=True, text=None, **kwargs) 返回查找到的第一個(gè)tag
find_all( name=None, attrs={ }, recursive=True, text=None, limit=None, **kwargs)返回查找到的所有tag
- 參數(shù)解釋
  - name 過(guò)濾器
  - attrs={},以字典形似傳參
    - 如果以'css屬性'='str'的形式傳入，class要變成class_
  - limit 限制返回條數(shù)，大于0，默認(rèn)返回全部
  - text=str 查找所有NavigableString中有str的NavigableString，返回列表
  - kwarge： id='',class_=''
其他搜索方法
- find_parent(),find_parents() 搜索第一父輩和所有父輩
- find_next_siblings(). find_next_sibling()，搜索后面所有兄弟節(jié)點(diǎn)和第一兄弟節(jié)點(diǎn)
- find_previous_siblings(), find_previous_sibling()，搜索前面兄弟節(jié)點(diǎn)和第一兄弟節(jié)點(diǎn)
- find_all_next() find_next() 搜索后面返回所有符合條件的節(jié)點(diǎn),方法返回第一個(gè)符合條件的節(jié)點(diǎn)
- find_all_previous(), find_previous() 搜索前面返回所有符合條件的節(jié)點(diǎn),方法返回第一個(gè)符合條件的節(jié)點(diǎn)

修改文檔樹(shù)

修改tag名稱和包含的屬性值
- tag.name=str tag['attr']=str
修改string
- tag.string=str
tag內(nèi)容添加
- tag.append(str)
- s = NavigableString(str),tag.append(s)
- 添加注釋
 - ```
 from bs4 import Comment
 new_comment = soup.new_string("Nice to see you.", Comment)
 tag.append(new_comment)
```
- 在tag中創(chuàng)建tag并賦予屬性
 - ```
 soup = BeautifulSoup("")
 original_tag = soup.b
 new_tag = soup.new_tag("a", )
 original_tag.append(new_tag)
 original_tag
 # <a ></a>
 
 new_tag.string = "Link text."
 original_tag
 # <a >Link text.</a>
```
- insert(位置, 要插入的數(shù)據(jù))指定位置插入數(shù)據(jù)
- insert_before() 方法在當(dāng)前tag或文本節(jié)點(diǎn)前插入內(nèi)容
- insert_after() 方法在當(dāng)前tag或文本節(jié)點(diǎn)后插入內(nèi)容
clear()移除tag當(dāng)前內(nèi)容
extract() 方法將當(dāng)前tag移除文檔樹(shù),并作為方法結(jié)果返回
decompose()方法將當(dāng)前節(jié)點(diǎn)移除文檔樹(shù)并完全銷毀
replace_with()方法移除文檔樹(shù)中的某段內(nèi)容,并用新tag或文本節(jié)點(diǎn)替代它
wrap() 方法可以對(duì)指定的tag元素進(jìn)行包裝 ,并返回包裝后的結(jié)果
unwrap()方法與 wrap()方法相反.將移除tag內(nèi)的所有tag標(biāo)簽,該方法常被用來(lái)進(jìn)行標(biāo)記的解包

色偷偷精品伊人,欧洲久久精品,欧美综合婷婷骚逼,国产AV主播,国产最新探花在线,九色在线视频一区,伊人大交九欧美,1769亚洲,黄色成人av

python的bs4

python的bs4

python beautifulSoup4

bs4介紹

bs4使用

bs4的四種對(duì)象

Tag的屬性

節(jié)點(diǎn)

搜索文檔數(shù)

修改文檔樹(shù)

友情鏈接更多精彩內(nèi)容

色偷偷精品伊人,欧洲久久精品,欧美综合婷婷骚逼,国产AV主播,国产最新探花在线,九色在线视频一区,伊人大交九 欧美,1769亚洲,黄色成人av

python的bs4

python beautifulSoup4

bs4介紹

bs4使用

bs4的四種對(duì)象

Tag的屬性

節(jié)點(diǎn)

搜索文檔數(shù)

修改文檔樹(shù)

友情鏈接更多精彩內(nèi)容

色偷偷精品伊人,欧洲久久精品,欧美综合婷婷骚逼,国产AV主播,国产最新探花在线,九色在线视频一区,伊人大交九欧美,1769亚洲,黄色成人av