爬蟲第五講:BeautifulSoup網(wǎng)頁解析庫

BeautifulSoup

BeautifulSoup是靈活又方便的網(wǎng)頁解析庫,處理高效,支持多種解析器。利用它不用編寫正則表達(dá)式即可以方便地實(shí)現(xiàn)網(wǎng)頁信息的提取

安裝BeautifulSoup

pip3 install beautifulsoup4

BeautifulSoup用法

  • 解析庫

    解析庫 使用方法 優(yōu)勢 劣勢
    Python標(biāo)準(zhǔn)庫 BeautifulSoup(markup,"html.parser") Python的內(nèi)置標(biāo)準(zhǔn)庫、執(zhí)行速度適中、文檔容錯能力強(qiáng) Python2.7.3 or Python3.2.2之前的版本容錯能力差
    lxml HTML解析庫 BeautifulSoup(markup,"lxml") 速度快、文檔容錯能力強(qiáng) 需要安裝C語言庫
    lxml XML解析庫 BeautifulSoup(markup,"xml") 速度快、唯一支持XML的解析器 需要安裝C語言庫
    html5lib BeautifulSoup(markup,"html5lib") 最好的容錯性、以瀏覽器的方式解析文檔、生成HTML5格式的文檔 速度慢、不依賴外部擴(kuò)展

基本使用

import requests
from bs4 import BeautifulSoup
response = requests.get('http://www.baidu.com').text
soup = BeautifulSoup(response,'lxml')
print(soup.prettify())#prettify美化,會格式化輸出,還會自動補(bǔ)齊閉合
print(soup.title.string)#打印head里面的title

標(biāo)簽選擇器
選擇元素

import requests
from bs4 import BeautifulSoup
#response = requests.get('http://www.baidu.com').text
html = """
    <html><head><title>This is a test Html code</title></head>
    <body>
    <p class="title" name="dropmouse"<b>The doc story</b?</p>
    <p class="story">Once upon a time there were three little sisters;and their names lll
    <a  class="sister" id =""link1><!---Elsa---></a>
    <a  class="sister" id =""link2>Lacie</a> and
    <a  class="sister" id =""link3>Tille</a>;
    and They lived at the bottom of a well.</p>
    <p class="story">...</p>
"""
soup = BeautifulSoup(html,'lxml')
print(soup.title)#html title,并且標(biāo)簽也會輸出
print(type(soup.title))#type <class 'bs4.element.Tag'>
print(soup.head)#html head
print(soup.p)#只第一個找到的p標(biāo)簽
print(soup.p.name)#獲取名稱 就是p標(biāo)簽的名字,就是p嘛

獲取名稱
見上面例子

獲取屬性
有些類似jQuery


import requests
from bs4 import BeautifulSoup
#response = requests.get('http://www.baidu.com').text
html = """
    <html><head><title>This is a test Html code</title></head>
    <body>
    <p class="title" name="dropmouse"<b>The doc story</b?</p>
    <p class="story">Once upon a time there were three little sisters;and their names lll
    <a  class="sister" id =""link1><!---Elsa---></a>
    <a  class="sister" id =""link2>Lacie</a> and
    <a  class="sister" id =""link3>Tille</a>;
    and They lived at the bottom of a well.</p>
    <p class="story">...</p>
"""
soup = BeautifulSoup(html,'lxml')
print(soup.p.attrs['name'])#返回第一個找到的p標(biāo)簽的屬性名為name的屬性值,返回值是dropmouse。soup.p.attrs返回的是由屬性鍵值對組成的字典{'class': ['title'], 'name': 'dropmouse'}
print(soup.p['name'])#返回值也是dropmouse,和上面的方法結(jié)果一樣。

獲取內(nèi)容比如獲取p標(biāo)簽中的內(nèi)容

import requests
from bs4 import BeautifulSoup
#response = requests.get('http://www.baidu.com').text
html = """
    <html><head><title>This is a test Html code</title></head>
    <body>
    <p class="title" name="dropmouse"<b>The doc story</b?</p>
    <p class="story">Once upon a time there were three little sisters;and their names lll
    <a  class="sister" id =""link1><!---Elsa---></a>
    <a  class="sister" id =""link2>Lacie</a> and
    <a  class="sister" id =""link3>Tille</a>;
    and They lived at the bottom of a well.</p>
    <p class="story">...</p>
"""
soup = BeautifulSoup(html,'lxml')
print(soup.p.string)#選擇之后加.string就是選擇標(biāo)簽中的內(nèi)容,這個內(nèi)容不包含HTML標(biāo)簽

嵌套選擇
'bs4.element.Tag'還可以選擇該Tab中的子標(biāo)簽。比如

import requests
from bs4 import BeautifulSoup
#response = requests.get('http://www.baidu.com').text
html = """
    <html><head><title>This is a test Html code</title></head>
    <body>
    <p class="title" name="dropmouse"<b>The doc story</b></p>
    <p class="story">Once upon a time there were three little sisters;and their names lll
    <a  class="sister" id =""link1><!---Elsa---></a>
    <a  class="sister" id =""link2>Lacie</a> and
    <a  class="sister" id =""link3>Tille</a>;
    and They lived at the bottom of a well.</p>
    <p class="story">...</p>
"""
soup = BeautifulSoup(html,'lxml')
print(soup.body.p.string)#也和jQuery類似

子節(jié)點(diǎn)和子孫節(jié)點(diǎn)

import requests
from bs4 import BeautifulSoup
#response = requests.get('http://www.baidu.com').text
html = """
    <html><head><title>This is a test Html code</title></head>
    <body>
    <p class="story">Once upon a time there were three little sisters;and their names lll
    <a  class="sister" id =""link1><!---Elsa---></a>
    <a  class="sister" id =""link2>Lacie</a> and
    <a  class="sister" id =""link3>Tille</a>;
    and They lived at the bottom of a well.</p>
    <p class="story">...</p>
"""
soup = BeautifulSoup(html,'lxml')
print(soup.p.contents)#返回p標(biāo)簽內(nèi)的所有內(nèi)容,包括換行符。list類型
print(soup.p.string)#none,由于p標(biāo)簽里面嵌套了許多其他HTML標(biāo)簽,而且不止一個,所以返回none

另一種得到子節(jié)點(diǎn)的方法

import requests
from bs4 import BeautifulSoup
html = """
    <html><head><title>This is a test Html code</title></head>
    <body>
    <p class="story">Once upon a time there were three little sisters;and their names lll
    <a  class="sister" id =""link1><!---Elsa---></a>
    <a  class="sister" id =""link2>Lacie</a> and
    <a  class="sister" id =""link3>Tille</a>;
    and They lived at the bottom of a well.</p>
    <p class="story">...</p>
"""
soup = BeautifulSoup(html,'lxml')
print(soup.p.children)#返回包含直接子節(jié)點(diǎn)的迭代器
for i,child in enumerate(soup.p.children):
    print(i,child)
* 返回結(jié)果:*
<list_iterator object at 0x7fda5c186c88>
0 Once upon a time there were three little sisters;and their names lll
  
1 <a class="sister"  id="" link1=""><!---Elsa---></a>
2 

3 <a class="sister"  id="" link2="">Lacie</a>
4  and
    
5 <a class="sister"  id="" link3="">Tille</a>
6 ;
    and They lived at the bottom of a well.

子孫節(jié)點(diǎn)

import requests
from bs4 import BeautifulSoup
#response = requests.get('http://www.baidu.com').text
html = """
    <html><head><title>This is a test Html code</title></head>
    <body>
    <p class="story">Once upon a time there were three little sisters;and their names lll
    <a  class="sister" id =""link1>
        <span>Elsle</span>
    </a>
    <a  class="sister" id =""link2>Lacie</a> and
    <a  class="sister" id =""link3>Tille</a>;
    and They lived at the bottom of a well.</p>
    <p class="story">...</p>
"""
soup = BeautifulSoup(html,'lxml')
print(soup.p.descendants)
for i,child in enumerate(soup.p.descendants):
print(i,child)

會返回第一個找到的p下的所有子孫節(jié)點(diǎn)。

<generator object descendants at 0x7f0b04eceaf0>
0 Once upon a time there were three little sisters;and their names lll
    
1 <a class="sister"  id="" link1="">
<span>Elsle</span>
</a>
2 

3 <span>Elsle</span>
4 Elsle
5 

6 

7 <a class="sister"  id="" link2="">Lacie</a>
8 Lacie
9  and
    
10 <a class="sister"  id="" link3="">Tille</a>
11 Tille
12 ;
    and They lived at the bottom of a well.

父節(jié)點(diǎn)和祖先節(jié)點(diǎn)

import requests
from bs4 import BeautifulSoup
html = """
    <html><head><title>This is a test Html code</title></head>
    <body>
    <p class="story">Once upon a time there were three little sisters;and their names lll
    <a  class="sister" id =""link1>
        <span>Elsle</span>
    </a>
    <a  class="sister" id =""link2>Lacie</a> and
    <a  class="sister" id =""link3>Tille</a>;
    and They lived at the bottom of a well.</p>
    <p class="story">...</p>
"""
soup = BeautifulSoup(html,'lxml')
print(soup.a.parent)

返回結(jié)果:先找到第一個a標(biāo)簽,然后找到這個a標(biāo)簽的父節(jié)點(diǎn),再輸出整個p標(biāo)簽包含里面的所有內(nèi)容都輸出。

<p class="story">Once upon a time there were three little sisters;and their names lll
    <a class="sister"  id="" link1="">
<span>Elsle</span>
</a>
<a class="sister"  id="" link2="">Lacie</a> and
    <a class="sister"  id="" link3="">Tille</a>;
    and They lived at the bottom of a well.</p>

祖先節(jié)點(diǎn)

soup.a.parents #這就是第一個找到a的祖先標(biāo)簽,返回一個迭代器。迭代器包含所有的祖先,一層層從p標(biāo)簽、body標(biāo)簽、html標(biāo)簽

兄弟節(jié)點(diǎn)

import requests
from bs4 import BeautifulSoup
html = """
    <html><head><title>This is a test Html code</title></head>
    <body>
    <p class="story">Once upon a time there were three little sisters;and their names lll
    <a  class="sister" id =""link1>
        <span>Elsle</span>
    </a>
    <a  class="sister" id =""link2>Lacie</a> and
    <a  class="sister" id =""link3>Tille</a>;
    and They lived at the bottom of a well.</p>
    <p class="story">...</p>
"""
soup = BeautifulSoup(html,'lxml')
print(list(enumerate(soup.a.next_siblings)))#后面的所有兄弟
print(list(enumerate(soup.a.previous_siblings)))#前面的所有兄弟節(jié)點(diǎn)

用上面介紹的選擇器很難精確的選擇某個element(往往只能選擇第一個找到的元素),所以BeautifulSoup還提供了標(biāo)準(zhǔn)選擇器,向CSS選擇器一樣可以用標(biāo)簽名、屬性、內(nèi)容查找文檔。

標(biāo)準(zhǔn)選擇器

find_all(name,attrs,recursive,text,**kwargs)

name--標(biāo)簽名
import requests
from bs4 import BeautifulSoup
html = """
<div class="pannel">
    <div class="pannel-heading">
        <h4>Hello</h4>
    </div>
    <div class="pannel-body">
        <ul class="list" id="list-1">
            <li class="element">Foo</li>
            <li class="element">Bar</li>
            <li class="element">That's ok</li>
        </ul>
        <ul class="list list-small" id="list-2">
            <li class="element">FOO</li>
            <li class="element">BAR</li>
        </ul>
    </div>
</div>            
"""
soup = BeautifulSoup(html,'lxml')
print(soup.find_all('ul'))#find_all返回一個列表,這里返回找到所有的ul包含ul之內(nèi)的所有內(nèi)容。
print(type(soup.find_all('ul')[0]))

*輸出結(jié)果: *

[<ul class="list" id="list-1">
<li class="element">Foo</li>
<li class="element">Bar</li>
<li class="element">That's ok</li>
</ul>, <ul class="list list-small" id="list-2">
<li class="element">FOO</li>
<li class="element">BAR</li>
</ul>]
<class 'bs4.element.Tag'>

因?yàn)閒ind_all列表中的每個元素是element.Tag類型的標(biāo)簽,所以還可以遍歷Tag中的子節(jié)點(diǎn)。這樣可以層層嵌套的查找

import requests
from bs4 import BeautifulSoup
html = """
<div class="pannel">
    <div class="pannel-heading">
        <h4>Hello</h4>
    </div>
    <div class="pannel-body">
        <ul class="list" id="list-1">
            <li class="element">Foo</li>
            <li class="element">Bar</li>
            <li class="element">That's ok</li>
        </ul>
        <ul class="list list-small" id="list-2">
            <li class="element">FOO</li>
            <li class="element">BAR</li>
        </ul>
    </div>
</div>            
"""
soup = BeautifulSoup(html,'lxml')
for ul in soup.find_all('ul'):
    print(ul.find_all('li'))

返回結(jié)果:返回ul下面的所有l(wèi)i

[<li class="element">Foo</li>, <li class="element">Bar</li>, <li class="element">That's ok</li>]
[<li class="element">FOO</li>, <li class="element">BAR</li>]

attr find_all(attrs={'name':'element'})查找屬性為name:element鍵值對的所有元素

import requests
from bs4 import BeautifulSoup
html = """
<div class="pannel">
    <div class="pannel-heading">
        <h4>Hello</h4>
    </div>
    <div class="pannel-body">
        <ul class="list" id="list-1">
            <li class="element">Foo</li>
            <li class="element">Bar</li>
            <li class="element">That's ok</li>
        </ul>
        <ul class="list list-small" id="list-2">
            <li class="element">FOO</li>
            <li class="element">BAR</li>
        </ul>
    </div>
</div>            
"""
soup = BeautifulSoup(html,'lxml')
print(soup.find_all(attrs={"class":"list"}))#特殊的屬性如class、id 可以用class_="list"和id="list-1"代替。
print(soup.find_all(attrs={"id":"list-1"}))

textfind_all(text="FOO")

import requests
from bs4 import BeautifulSoup
html = """
<div class="pannel">
    <div class="pannel-heading">
        <h4>Hello</h4>
    </div>
    <div class="pannel-body">
        <ul class="list" id="list-1">
            <li class="element">Foo</li>
            <li class="element">Bar</li>
            <li class="element">That's ok</li>
        </ul>
        <ul class="list list-small" id="list-2">
            <li class="element">FOO</li>
            <li class="element">BAR</li>
        </ul>
    </div>
</div>            
"""
soup = BeautifulSoup(html,'lxml')
print(soup.find_all(text="Foo"))

返回值:['Foo']
查找元素沒用,只能判斷有沒有找到目標(biāo)。用處不大。

find(name,attrs,recursive,text,**kwargs)

返回找到的第一個元素,如果沒找到返回None,find_all是返回所有元素的列表。
不演示了

find_parents() find_parent與find_all和find()類似

返回所有的祖先節(jié)點(diǎn)和返回父節(jié)點(diǎn)

find_next_siblings(),find_next_sibling()

返回后面所有的兄弟節(jié)點(diǎn)和返回后面的第一個節(jié)點(diǎn)

find_previous_siblings(),find_previous_sibling()

返回前面所有的兄弟節(jié)點(diǎn)和返回前面第一個兄弟節(jié)點(diǎn)

find_all_next(),find_next()

返回節(jié)點(diǎn)后所有符合條件的節(jié)點(diǎn)和返回節(jié)點(diǎn)后第一個符合條件的節(jié)點(diǎn)

find_all_previous(),find_previous()

返回節(jié)點(diǎn)前所有符合條件的節(jié)點(diǎn)和返回節(jié)點(diǎn)前第一個符合條件的節(jié)點(diǎn)

CSS選擇器

通過select()直接傳入CSS選擇器即可完成選擇

import requests
from bs4 import BeautifulSoup
html = """
<div class="pannel">
    <div class="pannel-heading">
        <h4>Hello</h4>
    </div>
    <div class="pannel-body">
        <ul class="list" id="list-1">
            <li class="element">Foo</li>
            <li class="element">Bar</li>
            <li class="element">That's ok</li>
        </ul>
        <ul class="list list-small" id="list-2">
            <li class="element">FOO</li>
            <li class="element">BAR</li>
        </ul>
    </div>
</div>            
"""
soup = BeautifulSoup(html,'lxml')
print(soup.select('.pannel .pannel-heading'))#返回pannel類下pannel-heading類的元素的內(nèi)容
print(soup.select('ul li'))#返回ul類型之下的li類型的標(biāo)簽,包含內(nèi)容
print(soup.select('#list-2 .element'))#返回id=list-2下的element類的元素

結(jié)果

<h4>Hello</h4>
</div>]
[<li class="element">Foo</li>, <li class="element">Bar</li>, <li class="element">That's ok</li>, <li class="element">FOO</li>, <li class="element">BAR</li>]
[<li class="element">FOO</li>, <li class="element">BAR</li>]

獲取屬性


import requests
from bs4 import BeautifulSoup
html = """
<div class="pannel">
    <div class="pannel-heading">
        <h4>Hello</h4>
    </div>
    <div class="pannel-body">
        <ul class="list" id="list-1">
            <li class="element">Foo</li>
            <li class="element">Bar</li>
            <li class="element">That's ok</li>
        </ul>
        <ul class="list list-small" id="list-2">
            <li class="element">FOO</li>
            <li class="element">BAR</li>
        </ul>
    </div>
</div>            
"""
soup = BeautifulSoup(html,'lxml')
for ul in soup.select('ul'):
    print(ul['id'])#返回所有ul的id這個屬性的值
    print(ul.attrs['id'])#返回所有ul的id這個屬性的值,和上面一樣,用這個辦法可以返回任意的屬性。

獲取內(nèi)容get_text()

import requests
from bs4 import BeautifulSoup
html = """
<div class="pannel">
    <div class="pannel-heading">
        <h4>Hello</h4>
    </div>
    <div class="pannel-body">
        <ul class="list" id="list-1">
            <li class="element">Foo</li>
            <li class="element">Bar</li>
            <li class="element">That's ok</li>
        </ul>
        <ul class="list list-small" id="list-2">
            <li class="element">FOO</li>
            <li class="element">BAR</li>
        </ul>
    </div>
</div>            
"""
soup = BeautifulSoup(html,'lxml')
for li in soup.select('li'):
    print(li.get_text())

返回結(jié)果:

Foo
Bar
That's ok
FOO
BAR

總結(jié)

  • 推薦使用lxml解析庫,必要時使用html.parser或者h(yuǎn)tml5lib
  • 標(biāo)簽選擇器速度快但篩選功能弱
  • 建議使用find()、find_all()查詢匹配單個或多個結(jié)果
  • 如果對CSS選擇器熟悉,建議使用CSS選擇器select()
  • 記住常用的獲取屬性和文本的方法
最后編輯于
?著作權(quán)歸作者所有,轉(zhuǎn)載或內(nèi)容合作請聯(lián)系作者
【社區(qū)內(nèi)容提示】社區(qū)部分內(nèi)容疑似由AI輔助生成,瀏覽時請結(jié)合常識與多方信息審慎甄別。
平臺聲明:文章內(nèi)容(如有圖片或視頻亦包括在內(nèi))由作者上傳并發(fā)布,文章內(nèi)容僅代表作者本人觀點(diǎn),簡書系信息發(fā)布平臺,僅提供信息存儲服務(wù)。

相關(guān)閱讀更多精彩內(nèi)容

友情鏈接更多精彩內(nèi)容