xpath庫詳解

xpath入門

python爬蟲抓取網頁內容,需要對html或xml結構的數據進行解析,如果用正則,單是寫正則表達式就讓很多望而生畏了。

這個問題可以用正則表達式處理,于是,一個問題就變成了兩個問題

對于我們這些不喜歡寫正則的人來說,xpath提供了更方便的解析數據功能。

xpath全稱是:XML Path Language, 見名知意,是專門用于解析結構性語言的

xpath常用規(guī)則

使用xpath之前要先安裝lxml庫

pip install lxml

入門示例:

from lxml import etree

text = '''
<div>
    <ul>
        <li class="item-0"><a href="link1.html">first</a></li>
        <li class="item-1"><a href="link2.html">second</a>
        <li class="item-2"><a href="link3.html">third</li>
        <li class="item-3"><a href="link4.html">fourth</a></li>
    </ul>
</div>
'''

html = etree.HTML(text)
result = etree.tostring(html)
print(result.decode('utf-8'))

注意查看代碼中的html片段,第二個li沒有閉合,第三個li的a標簽沒有閉合

查看結果:

<html><body><div>
    <ul>
        <li class="item-0"><a href="link1.html">first</a></li>
        <li class="item-1"><a href="link2.html">second</a>
        </li><li class="item-2"><a href="link3.html">third</a></li>
        <li class="item-3"><a href="link4.html">fourth</a></li>
    </ul>
</div>
</body></html>

可以看到,etree模塊不僅將缺少的標簽閉合了,而且還加上了html、body節(jié)點

還可以讀取文本內容進行解析

新建 test.html

<div>
    <ul>
        <li class="item-0"><a href="link1.html">first</a></li>
        <li class="item-1"><a href="link2.html">second</a></li>
        <li class="item-2"><a href="link3.html">third</a></li>
        <li class="item-3"><a href="link4.html">fourth</a></li>
    </ul>
</div>
from lxml import etree

html = etree.parse('./test.html', etree.HTMLParser())
result = etree.tostring(html)
print(result.decode('utf-8'))

結果:

<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
<html><body><div>
    <ul>
        <li class="item-0"><a href="link1.html">first</a></li>
        <li class="item-1"><a href="link2.html">second</a></li>
        <li class="item-2"><a href="link3.html">third</a></li>
        <li class="item-3"><a href="link4.html">fourth</a></li>
    </ul>
</div>
</body></html>

獲取所有節(jié)點 //

from lxml import etree

html = etree.parse('./test.html', etree.HTMLParser())
result = html.xpath('//*')
print(result)

結果:

[<Element html at 0x1085a5e88>, <Element body at 0x1085a5f88>, <Element div at 0x1085a5fc8>, <Element ul at 0x1085c9048>, <Element li at 0x1085c9088>, <Element a at 0x1085c9108>, <Element li at 0x1085c9148>, <Element a at 0x1085c9188>, <Element li at 0x1085c91c8>, <Element a at 0x1085c90c8>, <Element li at 0x1085c9208>, <Element a at 0x1085c9248>]

//* 表示匹配所有節(jié)點

匹配指定節(jié)點,如獲取所有l(wèi)i節(jié)點

from lxml import etree

html = etree.parse('./test.html', etree.HTMLParser())
result = html.xpath('//li')
print(result)  # 所有l(wèi)i節(jié)點
print(result[0])  # 第一個li節(jié)點

結果:

[<Element li at 0x110115f88>, <Element li at 0x110115fc8>, <Element li at 0x110139048>, <Element li at 0x110139088>]
<Element li at 0x110115f88>

子節(jié)點 /

獲取li節(jié)點的直接子節(jié)點

from lxml import etree

html = etree.parse('./test.html', etree.HTMLParser())
result = html.xpath('//li/a') # 獲取所有l(wèi)i節(jié)點的直接子節(jié)點a
print(result)
[<Element a at 0x103c02f88>, <Element a at 0x103c02fc8>, <Element a at 0x103c26048>, <Element a at 0x103c26088>]

改成 // 可以這么寫:

from lxml import etree

html = etree.parse('./test.html', etree.HTMLParser())
result = html.xpath('//div//a')  # 獲取div的所有后代a節(jié)點
print(result)

父節(jié)點 ..

from lxml import etree

html = etree.parse('./test.html', etree.HTMLParser())
# 獲取href屬性為link2.html的a標簽的父節(jié)點的class名
result = html.xpath('//a[@href="link2.html"]/../@class')

print(result)
# ['item-1']

屬性匹配 @

根據屬性值匹配節(jié)點

from lxml import etree

html = etree.parse('./test.html', etree.HTMLParser())
# 獲取屬性class值為item-0的li
result = html.xpath('//li[@class="item-0"]')

print(result)
# [<Element li at 0x10c2b1f88>]

獲取屬性值

from lxml import etree

html = etree.parse('./test.html', etree.HTMLParser())
# 獲取所有l(wèi)i的子節(jié)點a的屬性href
result = html.xpath('//li/a/@href')

print(result)
# ['link1.html', 'link2.html', 'link3.html', 'link4.html']

屬性多值匹配

使用contains函數匹配

from lxml import etree

text = '''
<li class="li li-first"><a href="link.html">first item</a></li>
'''

html = etree.HTML(text)
result = html.xpath('//li[@class="li"]/a/text()')
print(result)
# []

result = html.xpath('//li[contains(@class, "li")]/a/text()')
print(result)
# ['first item']

多屬性匹配

需要匹配滿足多個屬性的節(jié)點,使用 and 運算符

from lxml import etree

text = '''
<li class="li li-first" name="item"><a href="link.html">first item</a></li>
'''

html = etree.HTML(text)
# 通過class和name兩個屬性進行匹配
result = html.xpath('//li[contains(@class, "li") and @name="item"]/a/text()')
print(result)
# ['first item']

xpath的運算符介紹

運算符

文本獲取

from lxml import etree

html = etree.parse('./test.html', etree.HTMLParser())
# 獲取屬性class值為item-0的li的子節(jié)點a的文本內容
result = html.xpath('//li[@class="item-0"]/a/text()')

print(result)
# ['first']

如果想要獲取后代節(jié)點內部的所有文本,使用 //text()

from lxml import etree

html = etree.parse('./test.html', etree.HTMLParser())
# 獲取所有l(wèi)i的后代節(jié)點中的文本
result = html.xpath('//li//text()')

print(result)
# ['first', 'second', 'third', 'fourth']

按序選擇

根據節(jié)點所在的順序進行提取

from lxml import etree

html = etree.parse('./test.html', etree.HTMLParser())

# 按索引排序
result = html.xpath('//li[1]/a/text()')
print(result)
# ['first']

# last 最后一個
result = html.xpath('//li[last()]/a/text()')
print(result)
# ['fourth']

# position 位置查找
result = html.xpath('//li[position()<3]/a/text()')
print(result)
# ['first', 'second']

# - 運算符
result = html.xpath('//li[last()-2]/a/text()')
print(result)
# ['second']

節(jié)點軸選擇

from lxml import etree

html = etree.parse('./test.html', etree.HTMLParser())

# 所有祖先節(jié)點
result = html.xpath('//li[1]/ancestor::*')
print(result)
# [<Element html at 0x106e4be88>, <Element body at 0x106e4bf88>, <Element div at 0x106e4bfc8>, <Element ul at 0x106e6f048>]

# 祖先節(jié)點中的div
result = html.xpath('//li[1]/ancestor::div')
print(result)
# [<Element div at 0x106ce4fc8>]

# 節(jié)點的所有屬性
result = html.xpath('//li[1]/attribute::*')
print(result)
# ['item-0']

# 子節(jié)點
result = html.xpath('//li[1]/child::a[@href="link1.html"]')
print(result)
# [<Element a at 0x107941fc8>]

# 后代節(jié)點中的a
result = html.xpath('//li[1]/descendant::a')
print(result)
# [<Element a at 0x10eeb7fc8>]

# 該節(jié)點后面所有節(jié)點中的第2個 從1開始計數
result = html.xpath('//li[1]/following::*[2]')
print(result)
# [<Element a at 0x10f188f88>]

# 該節(jié)點后面的所有兄弟節(jié)點
result = html.xpath('//li[1]/following-sibling::*')
print(result)
# [<Element li at 0x104b7f048>, <Element li at 0x104b7f088>, <Element li at 0x104b7f0c8>]
?著作權歸作者所有,轉載或內容合作請聯系作者
【社區(qū)內容提示】社區(qū)部分內容疑似由AI輔助生成,瀏覽時請結合常識與多方信息審慎甄別。
平臺聲明:文章內容(如有圖片或視頻亦包括在內)由作者上傳并發(fā)布,文章內容僅代表作者本人觀點,簡書系信息發(fā)布平臺,僅提供信息存儲服務。

相關閱讀更多精彩內容

友情鏈接更多精彩內容