1. 獲取網(wǎng)頁資源
url = "https://www.douban.com/group/explore"
headers = {
"User-Agent": "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; 360SE)"
}
response = requests.get(url, headers=headers)
2. xpath解析頁面(from lxml import etree)
2.1 先將html文本對象轉(zhuǎn)換為etree_html對象
etree_html = etree.HTML(html)
2.2 匹配所有節(jié)點(diǎn)
# 匹配所有節(jié)點(diǎn) //*
result = etree_html.xpath('//*')
2.3 匹配所有子節(jié)點(diǎn)
# 匹配所有子節(jié)點(diǎn)a 并文本獲?。簍ext()
result = etree_html.xpath('//a/text()')
print(result)
2.4 查找元素子節(jié)點(diǎn)
# 查找元素子節(jié)點(diǎn) /
result = etree_html.xpath('//div/p/text()')
print(result)
2.5 獲取當(dāng)前節(jié)點(diǎn)的父節(jié)點(diǎn)
父節(jié)點(diǎn) .. 類似于cd ../,查找當(dāng)前節(jié)點(diǎn)的父節(jié)點(diǎn)
result = etree_html.xpath('//span[@class="pubtime"]/../span/a/text()')
2.6 屬性
單屬性匹配 [@class="xxx"]
#文本匹配 text() 獲取所有文本//text()
result = etree_html.xpath('//div[@class="article"]//text()')
屬性多值匹配 contains(@class 'xx'),說明:當(dāng)該class有多個值時,使用contains包含
result = etree_html.xpath('//div[contains(@class, "grid-16-8")]//div[@class="likes"]/text()[1]')
多屬性匹配
# 多屬性匹配 or, and, mod, //book | //cd, + - * div = != < > <= >=
# 按序選擇 [1] [last()] [position() < 3] [last() -2] 該選擇器是放在xpath內(nèi)部的,如上多值匹配
result = etree_html.xpath('//span[@class="pubtime" and contains(text(), "昨天")]/text()')
print(result)
屬性獲取
result = etree_html.xpath('//div[@class="article"]/div/div/@class')[0]
print(result)
result = etree_html.xpath('//div[@class="bd"]/h3/a/@href')
print(result)
2.7 節(jié)點(diǎn)軸(了解)
//li/ancestor::* 所有祖先節(jié)點(diǎn)
//li/ancestor::div div這個祖先節(jié)點(diǎn)
//li/attribute::* attribute軸,獲取li節(jié)點(diǎn)所有屬性值
//li/child::a[@href="link1.html"] child軸,獲取直接子節(jié)點(diǎn)
//li/descendant::span 獲取所有span類型的子孫節(jié)點(diǎn)
//li/following::* 選取文檔中當(dāng)前節(jié)點(diǎn)的結(jié)束標(biāo)記之后的所有節(jié)點(diǎn)
//li/following-sibling::* 選取當(dāng)前節(jié)點(diǎn)之后的所有同級節(jié)點(diǎn),第一個不算
result = etree_html.xpath('//div[@class="channel-item"]/following-sibling::*')
print(result)
print(len(result))
result = etree_html.xpath('//div[@class="channel-item"][1]/following-sibling::*')
print(result)
print(len(result))