亚洲综合香蕉婷婷蜜桃,久一区电影

本文首發(fā)：http://blog.orisonchan.cc/2018/08/16/44

記錄一下Python爬蟲常用庫BeautifulSoup4的簡單用法。其中demo均以自己的博客網(wǎng)站為對象。

1 urllib和urllib2

Python中包含了兩個網(wǎng)絡模塊，分別是urllib與urllib2，urllib2是urllib的升級版，擁有更強大的功能。urllib，讓我們可以像讀文件一樣，讀取http與ftp。而urllib2，則在urllib的基礎(chǔ)上，提供了更多的接口，如cookie、代理、認證等更強大的功能。

這里借鑒下文章一和文章二的說法：

urllib僅可以接受URL，不能創(chuàng)建，設(shè)置headers的request類實例；
但是urllib提供urlencode()方法用來GET查詢字符串的產(chǎn)生，而urllib2則沒有（這是urllib和urllib2經(jīng)常一起使用的主要原因）
編碼工作使用urllib的urlencode()函數(shù)，幫我們講key:value這樣的鍵值對轉(zhuǎn)換成‘key=value’這樣的字符串，解碼工作可以使用urllib的unquote

該兩個庫都是Python自帶庫，且由于簡單爬蟲所需要的功能比較少，所以不做更多贅述。

2 BeautifulSoup

中文文檔：https://beautifulsoup.readthedocs.io/zh_CN/v4.4.0/#id13

現(xiàn)在的Mac已經(jīng)不能用

$ sudo easy_install pip

來安裝pip，而只能用一個get-pip.py的方法，網(wǎng)上有大量教程。然后安裝完畢后用pip來安裝beautilfulsoup4。

安裝完以后一個簡單的例子（環(huán)境Python2.7，其實Python3在這段代碼也差不多）：

from urllib import urlopen
from bs4 import BeautifulSoup
html = urlopen("http://blog.orisonchan.cc")
bsObj = BeautifulSoup(html.read())
print(bsObj.h1)

網(wǎng)頁不存在的情況如何判斷

方法一：空值判斷

if html is None:
    print("URL not found")
else:
    # expressions

方法二：try-catch

try:
    html = urlopen("http://blog.orisonchan.cc")
except HTTPError as e:
    print(e)
    # expressions

這里從O`Reilly系列圖書《Python網(wǎng)絡數(shù)據(jù)采集》中摘抄一個完整例子：

from urllib.request import urlopen
from urllib.error import HTTPError
from bs4 import BeautifulSoup


def getTitle(url):
    try:
        html = urlopen(url) 
    except HTTPError as e:
        return None
    try:
        bsObj = BeautifulSoup(html. read())
        title = bsObj.body.h1
    except AttributeError as e:
        return None
    return title


title = getTitle(" http://www.pythonscraping.com/pages/page1.html")
if title == None:
    print("Title could not be found")
else:
    print(title)

2.1 常用方法

2.1.1 find()

格式：

find(name, attributes, recursive, text ,keywords)

參數(shù)介紹

name：標簽名，如a，p。
attributes：一個標簽的若干屬性和對應的屬性值。
recursive：是否遞歸。如果是，就會查找tag的所有子孫標簽，默認true。
text：標簽的文本內(nèi)容去匹配，而不是標簽的屬性。
keyword：選擇那些具有指定屬性的標簽。

find()示例：

from urllib import urlopen
from bs4 import BeautifulSoup
html = urlopen("http://blog.orisonchan.cc")
bsObj = BeautifulSoup(html.read(), 'html.parser')
print str(bsObj.find(name='h1', attrs={'class': {'post-title'}}))

結(jié)果：

<h1 class="post-title" itemprop="name headline">
<a class="post-title-link" href="/2018/08/14/43/" itemprop="url">常見“樹”概念解析（1）</a></h1>

2.1.2 find_all()

find_all(name, attributes, recursive, text , limit, keywords)

參數(shù)介紹

name：標簽名，如a，p。
attributes：一個標簽的若干屬性和對應的屬性值。
recursive：是否遞歸。如果是，就會查找tag的所有子孫標簽，默認true。
text：標簽的文本內(nèi)容去匹配，而不是標簽的屬性。
limit: 個數(shù)限制，find其實就等于limit=1，查看find源碼即可發(fā)現(xiàn)。
keyword：選擇那些具有指定屬性的標簽。

bsObj.find_all("a")可以簡寫為bsObj("a")

find_all()示例：

from urllib import urlopen
from bs4 import BeautifulSoup
html = urlopen("http://blog.orisonchan.cc")
bsObj = BeautifulSoup(html.read(), 'html.parser')
print str(bsObj.find_all(name='h1', attrs={'class': {'post-title'}})[1:3]).decode('unicode-escape')

結(jié)果：

[<h1 class="post-title" itemprop="name headline">
<a class="post-title-link" href="/2018/08/13/42/" itemprop="url">Spark聚合下推思路以及demo</a></h1>, <h1 class="post-title" itemprop="name headline">
<a class="post-title-link" href="/2018/08/09/41/" itemprop="url">寫一個Spark DataSource的隨手筆記</a></h1>]

2.2 常用對象

2.2.1 tag對象

即html中的標簽。其中兩個屬性就是name和attributes。使用如下：

from urllib import urlopen
from bs4 import BeautifulSoup
html = urlopen("http://blog.orisonchan.cc")
bsObj = BeautifulSoup(html.read(), 'html.parser')
tag = bsObj.h1
print(tag)
print(tag.name)
print(tag.attrs)
print(tag['class'])

結(jié)果：

<h1 class="post-title" itemprop="name headline">
<a class="post-title-link" href="/2018/08/14/43/" itemprop="url">常見“樹”概念解析（1）</a></h1>
h1
{u'class': [u'post-title'], u'itemprop': u'name headline'}
[u'post-title']

2.2.2 NavigableString對象

用來表示包含在tag中的文字。注意！如果tag中包含子tag，navigableString對象會是None！

2.2.3 Comment對象

用來查找HTML里的注釋標簽。是一個特殊的NavigableString，所以也有其性質(zhì)。

2.3 導航樹（Navigating Trees）

2.3.1 子節(jié)點們

對tag級別調(diào)用childen屬性可得到該tag的所有子節(jié)點：

from urllib import urlopen
from bs4 import BeautifulSoup
html = urlopen("http://blog.orisonchan.cc")
bsObj = BeautifulSoup(html.read(), 'html.parser')
for child in bsObj.find(name='h1', attrs={'class': {'post-title'}}).children:
    print(child)

結(jié)果：

<a class="post-title-link" href="/2018/08/14/43/" itemprop="url">常見“樹”概念解析（1）</a>

2.3.2 兄弟節(jié)點（們）

next_siblings屬性和previous_siblings屬性可以查詢兄弟節(jié)點們。
next_sibling和previous_sibling可以查下一個/上一個兄弟。

需要注意的是，很可能直接相鄰的上一個下一個兄弟節(jié)點并不是Tag而是一個換行符啊標點符號啊等NavigableString對象（字符串節(jié)點）。

2.3.3 父節(jié)點（們）

parent屬性得到某元素父節(jié)點。
parents屬性遞歸得到所有的父節(jié)點。

2.4 過濾器

2.4.1 字符串

即上文提到bsObj.find_all("a")和bsObj("a")。

2.4.2 標簽列表

形如bsObj.find_all(["a", "p"])則會去尋找a標簽和p標簽。

2.4.3 True

True可以匹配任何職，但是不會返回字符串節(jié)點

2.4.4 正則表達式

引入Python自帶的re（regular expressions）

示例：

from urllib import urlopen
from bs4 import BeautifulSoup
import re
html = urlopen("http://blog.orisonchan.cc/2018/08/14/43")
bsObj = BeautifulSoup(html.read(), 'html.parser')
print bsObj.find_all(name='img', attrs={'src': re.compile("[a-z]*.png")})[1]

結(jié)果：

<img alt="red-black-tree" src="/uploads/2018/08/red-black-tree.png"/>

2.4.5 自定義方法作為參數(shù)

這個不知道該怎么表述，有興趣的可以去官方文檔看一下，這里直接來一個demo：

from urllib import urlopen
from bs4 import BeautifulSoup
import re
html = urlopen("http://blog.orisonchan.cc/2018/08/14/43")
bsObj = BeautifulSoup(html.read(), 'html.parser')


def png_image(tag):
    return tag.name == "img" and re.compile("[a-z]*tree.png").search(tag.attrs["src"])


for img in bsObj.find_all(png_image):
    print(img)

結(jié)果返回了該篇文章4個img中的3個：

<img alt="binary-search-tree" src="/uploads/2018/08/binary-search-tree.png"/>
<img alt="red-black-tree" src="/uploads/2018/08/red-black-tree.png"/>
<img alt="sentinel-in-list" src="/uploads/2018/08/b-tree.png"/>

色偷偷精品伊人,欧洲久久精品,欧美综合婷婷骚逼,国产AV主播,国产最新探花在线,九色在线视频一区,伊人大交九欧美,1769亚洲,黄色成人av

Python爬蟲簡單筆記之BeautifulSoup4的簡單用法

Python爬蟲簡單筆記之BeautifulSoup4的簡單用法

1 urllib和urllib2

2 BeautifulSoup

2.1 常用方法

2.1.1 find()

2.1.2 find_all()

2.2 常用對象

2.2.1 tag對象

2.2.2 NavigableString對象

2.2.3 Comment對象

2.3 導航樹（Navigating Trees）

2.3.1 子節(jié)點們

2.3.2 兄弟節(jié)點（們）

2.3.3 父節(jié)點（們）

2.4 過濾器

2.4.1 字符串

2.4.2 標簽列表

2.4.3 True

2.4.4 正則表達式

2.4.5 自定義方法作為參數(shù)

相關(guān)閱讀更多精彩內(nèi)容

友情鏈接更多精彩內(nèi)容

色偷偷精品伊人,欧洲久久精品,欧美综合婷婷骚逼,国产AV主播,国产最新探花在线,九色在线视频一区,伊人大交九 欧美,1769亚洲,黄色成人av

Python爬蟲簡單筆記之BeautifulSoup4的簡單用法

1 urllib和urllib2

2 BeautifulSoup

2.1 常用方法

2.1.1 find()

2.1.2 find_all()

2.2 常用對象

2.2.1 tag對象

2.2.2 NavigableString對象

2.2.3 Comment對象

2.3 導航樹（Navigating Trees）

2.3.1 子節(jié)點們

2.3.2 兄弟節(jié)點（們）

2.3.3 父節(jié)點（們）

2.4 過濾器

2.4.1 字符串

2.4.2 標簽列表

2.4.3 True

2.4.4 正則表達式

2.4.5 自定義方法作為參數(shù)

相關(guān)閱讀更多精彩內(nèi)容

友情鏈接更多精彩內(nèi)容

色偷偷精品伊人,欧洲久久精品,欧美综合婷婷骚逼,国产AV主播,国产最新探花在线,九色在线视频一区,伊人大交九欧美,1769亚洲,黄色成人av