基礎(chǔ)回顧
- 網(wǎng)頁HTML的特點(diǎn):標(biāo)記語言/標(biāo)簽
- requests的作用及返回結(jié)果
BeautifulSoup
- BeautifulSoup 提供了一些簡單的、Python式的函數(shù)用來處理導(dǎo)航、搜索、修改分析樹等功能。它是一個工具箱,通過解析文檔為用戶提供需要抓取的數(shù)據(jù),因?yàn)楹唵?,所以不需要多少代碼就可以寫出一個完整的應(yīng)用程序。
- BeautifulSoup自動將輸入文檔轉(zhuǎn)換為Unicode編碼,輸出文檔轉(zhuǎn)換為utf-8編碼。你不需要考慮編碼方式,排除文檔沒有指定一個編碼方式,這時,BeautifulSoup就不能自動識別編碼方式了。然后,你僅僅需要說明一下原始編碼方式就可以了。
- BeautifulSoup已經(jīng)成為和lxml、html6lib一樣出色的python解釋器,為用戶靈活地提供不同解析策略或強(qiáng)勁的速度。
- BeautifulSoup是Python的一個庫,主要功能是從網(wǎng)頁抓取數(shù)據(jù)。
BeautifulSoup的安裝
- cmd中進(jìn)行安裝,直接輸入(附帶把lxml安裝好):
pip install BeautifulSoup4
pip install lxml
BeautifulSoup支持的解釋器
- BeautifulSoup支持Python標(biāo)準(zhǔn)庫中的HTML解析器,還支持一些第三方的解析器,默認(rèn)使用Python默認(rèn)的解析器,但推薦使用lxml解析器,更強(qiáng)大,速度更快。
| 解析器 | 使用方法 | 優(yōu)勢 | 劣勢 |
|---|---|---|---|
| Python標(biāo)準(zhǔn)庫 | BeautifulSoup(markup, 'html.parser') | (1)Pyhton的內(nèi)置標(biāo)準(zhǔn)庫(2)執(zhí)行速度適中(3)文檔存儲能力強(qiáng) | (1)Python2.7.3 or 3.2.2前的版本中文檔容錯能力差 |
| lxml HTML 解析器 | BeautifulSoup(markup, 'lxml') | (1)執(zhí)行速度快(2)文檔容錯能力強(qiáng) | 需要安裝C語言庫 |
| lxml XML解析器 | BeautifulSoup(markup, ['lxml','xml'])BeautifulSoup(markup, 'xml') | (1)速度快(2)唯一支持XML的解析器 | 需要安裝C語言庫 |
| htmlSlib | BeautifulSoup(markup, 'htmlSlib') | (1)最好的容錯性 (2)以瀏覽器的方式解析文檔 (3)生成HTMLS格式的文檔 | (1)速度慢 (2)不依賴外部擴(kuò)展 |
BeautifulSoup模塊的導(dǎo)入和基本應(yīng)用
- 解析對象:https://news.qq.com/a/20170205/019837.htm
- 解析器:lxml
import requests
from bs4 import BeautifulSoup
r = requests.get(url='https://news.qq.com/a/20170205/019837.htm')
print(r)
print(r.text)
print(type(r.text))
- 格式化輸出:按照html的縮進(jìn)方式輸出結(jié)果
soup.prettify()
import requests
from bs4 import BeautifulSoup
r = requests.get(url='https://news.qq.com/a/20170205/019837.htm')
soup = BeautifulSoup(r.text,'lxml')
print(soup)
print(soup.prettify())
- 提取html的標(biāo)簽Tag
該方法只提取所喲內(nèi)容中第一個符合要求的標(biāo)簽。
import requests
from bs4 import BeautifulSoup
r = requests.get(url='https://news.qq.com/a/20170205/019837.htm')
soup = BeautifulSoup(r.text,'lxml')
print(soup.head)
print(soup.title)
print(soup.a)
print(soup.p)
print(type(soup.title))
- Tag的兩個屬性:name和attrs。
soup.name較為特殊,它的name即為[document],對于其他內(nèi)容標(biāo)簽,輸出的值便為標(biāo)簽本身的名稱。
import requests
from bs4 import BeautifulSoup
r = requests.get(url='https://news.qq.com/a/20170205/019837.htm')
soup = BeautifulSoup(r.text,'lxml')
print(soup.title.name)
print(type(soup.title.name))
print(soup.a.name)
print(soup.p.name)
print(soup.title.attrs)
print(type(soup.title.attrs))
print(soup.a.attrs)
print(soup.p.attrs)
# 查看特殊屬性
print(soup.a.attrs['style'])
- 提取html的標(biāo)簽的文字 NavigableString-可以遍歷的字符串
import requests
from bs4 import BeautifulSoup
r = requests.get(url='https://news.qq.com/a/20170205/019837.htm')
soup = BeautifulSoup(r.text,'lxml')
print(soup.title.string)
print(type(soup.title.string))
print(soup.a.string)
print(soup.p.string)
print(soup.head)
print(soup.head.string)
print(soup.head.text) # text 直接輸出str,并且可以不僅僅只針對單個標(biāo)簽
分析文檔樹
- 直接子節(jié)點(diǎn)
.content返回列表.children返回生成器
import requests
from bs4 import BeautifulSoup
r = requests.get(url='https://news.qq.com/a/20170205/019837.htm')
soup = BeautifulSoup(r.text,'lxml')
print(soup.head.contents)
print(type(soup.head.contents))
import requests
from bs4 import BeautifulSoup
r = requests.get(url='https://news.qq.com/a/20170205/019837.htm')
soup = BeautifulSoup(r.text,'lxml')
#print(soup.head.children)
print(type(soup.head.children))
for i in soup.head.chlidren:
print(i)
- 所有子孫節(jié)點(diǎn)
.descendants生成器
import requests
from bs4 import BeautifulSoup
r = requests.get(url='https://news.qq.com/a/20170205/019837.htm')
soup = BeautifulSoup(r.text,'lxml')
print(soup.body.descendants)
print(type(soup.body.descendants))
for i in soup.body.descendants:
print(i)
- 父節(jié)點(diǎn)
.parent
import requests
from bs4 import BeautifulSoup
r = requests.get(url='https://news.qq.com/a/20170205/019837.htm')
soup = BeautifulSoup(r.text,'lxml')
print(soup.title)
print(type(soup.title))
print(soup.title.parent)
print(type(soup.title.parent))
print(soup.title.parent.name)
print(soup.title.parent.attrs)
- 全部父節(jié)點(diǎn)
.parents生成器
import requests
from bs4 import BeautifulSoup
r = requests.get(url='https://news.qq.com/a/20170205/019837.htm')
soup = BeautifulSoup(r.text,'lxml')
a = soup.body.a
for i in a.parents:
print(i.name)
*兄弟節(jié)點(diǎn)
.next_silbling .previous_silbling
兄弟節(jié)點(diǎn)可以理解為和本節(jié)點(diǎn)在統(tǒng)一級的節(jié)點(diǎn)
import requests
from bs4 import BeautifulSoup
r = requests.get(url='https://news.qq.com/a/20170205/019837.htm')
soup = BeautifulSoup(r.text,'lxml')
print(soup.p.next_silbling)
print(soup.p.previous_silbling)
- 全部兄弟節(jié)點(diǎn)
.next_silblings.previous_silblings生成器
import requests
from bs4 import BeautifulSoup
r = requests.get(url='https://news.qq.com/a/20170205/019837.htm')
soup = BeautifulSoup(r.text,'lxml')
for i in soup.p.next_silblings:
print(i)
- 前后節(jié)點(diǎn)
.next_element.previous_element
與兄弟界節(jié)點(diǎn)不同,并不針對與兄弟節(jié)點(diǎn),而是再有節(jié)點(diǎn),部分層次
import requests
from bs4 import BeautifulSoup
r = requests.get(url='https://news.qq.com/a/20170205/019837.htm')
soup = BeautifulSoup(r.text,'lxml')
print(soup.head.previous_element.name)
print(soup.head.previous_element)
遍歷所有標(biāo)簽的方法
-
find_all()
搜索當(dāng)前tag的所有tag子節(jié)點(diǎn),并判斷是否符合過濾器的條件
find_all(name, attrs, recursive, string, **kwargs)
import requests
from bs4 import BeautifulSoup
r = requests.get(url='https://news.qq.com/a/20170205/019837.htm')
soup = BeautifulSoup(r.text,'lxml')
print(soup.find_all('title'))
print(soup.find_all('meta'))
print(soup.find_all('img'))
print(soup.find_all('img','sspLogo'))
print(soup.find_all('img',limit=2)) # limit參數(shù),返回幾個。
print(soup.find_all('img',height='20')) # keyword參數(shù),高度為20的圖。
- keyyword參數(shù):用正則化re包來爬去特定網(wǎng)頁
先不討論正則,可以直接先用'href=re.compile('...')來查詢.
soup.find_all返回的是列表,但其中每個元素都是tag,可以提取text、attrs等。
import requests
import re
from bs4 import BeautifulSoup
r = requests.get(url='https://news.qq.com/a/20170205/019837.htm')
soup = BeautifulSoup(r.text,'lxml')
for i in soup.find_all('a',href=re.compile('news.qq.com/a/201605')):
print(i,type(i))
print(i.text)
print(i.attrs['href'])
print('\n')
-
find()查找一個結(jié)果,并且查找第一個
find(name, attrs,recursive,string,**kwargs)
import requests
from bs4 import BeautifulSoup
r = requests.get(url='https://news.qq.com/a/20170205/019837.htm')
soup = BeautifulSoup(r.text,'lxml')
print(soup.find('a'))
print(type(soup.find('a')))
print(soup.find('a').text)