網(wǎng)頁解析方法

1、BeautifulSoup

from bs4 import BeatifulSopu

soup=BeautifulSopu(html,'lxml')

獲取屬性的方法：

soup.p.attrs ? ?輸出全部屬性

soup.p.attrs['name']獲取指定屬實(shí)

soup.p['name']使用字典形式獲取

獲取文本的方法：

soup.p.string使用屬性

soup.p.get_text()使用方法

獲取直接子節(jié)點(diǎn)

soup.p.contents返回所有的子結(jié)點(diǎn)一個(gè)列表

soup.p.children返回子結(jié)點(diǎn)的一個(gè)生成器，使用使用迭待輸出

獲取子孫節(jié)點(diǎn)

soup.p.descendants返回子孫節(jié)點(diǎn)的一個(gè)生成器，包括子節(jié)點(diǎn)、子孫節(jié)點(diǎn)

返回父節(jié)點(diǎn)

soup.a.parent

soup.a.parents獲取所有父輩

獲取兄弟節(jié)點(diǎn)

soup.a.next_sibling

soup.a.previous_sibling

soup.a.next_siblings

soup.a.previous_siblings

元素查找

def find_all(self, name=None, attrs={}, recursive=True, text=None, limit=None, **kwargs):與findAll是同一函數(shù)

此處的name為標(biāo)簽名，如ul,li等也可以傳遞attrs參數(shù)

使用text參數(shù)匹配節(jié)點(diǎn)的文本，傳入字符串或者傳入正則表達(dá)式

soup.find_all(text=re.compile('link')))把包含link的文本結(jié)果輸出，類型為列表

其他方法

find()

find_parents()

find_parent()

find_next_sibling()

find_next_siblings()

find_previous_siblings()

find_previous_sibling()

find_all_next()

find_next()

find_all_previous()

find_previous()

另外BeautifulSopu中也提供了CSS選擇器

soup.select('ul ?li'))

2、pyquery使用

from pyquery import PyQuery as pq

以下三種方式

doc=pq(html)

doc=pq(url)

doc=pq(文件名）

print(doc('#container ?.list li'))

獲取子節(jié)點(diǎn)：

items=doc('.list')

items.find('li')使用find可以返回所有子孫節(jié)點(diǎn)，如果只想返回子節(jié)點(diǎn)，則使用children()

items.children()此時(shí)可以指定需要篩選出的子節(jié)點(diǎn)，增加參數(shù)即可，如

items.children('.activer')

獲取父節(jié)點(diǎn)

items=doc('.list')

items.parent()及items.parents()

如果要獲取指定的父節(jié)點(diǎn)可使用items.parents('.wrap')

獲取兄弟節(jié)點(diǎn)

items.siblings()獲取所有的氏兄弟節(jié)點(diǎn)

items.siblings('.active')獲取指定的兄弟節(jié)點(diǎn)

元素的輸出

1、針對單個(gè)元素，可以直接打印輸入出，也可以轉(zhuǎn)為字符串

li=doc('.item1')

print(li)或print(str(li))

2、如果有多個(gè)節(jié)點(diǎn)則需要遍歷輸出

doc('li').items()返回一個(gè)生成器，此處的items與上面代碼的items是不同的，此處是固定值，上面講的是隨便取的一個(gè)名字

獲取屬性

a=doc('.item')

以下2種方式獲取屬性

print(a.attr('href'))

print(a.attr.href)

,如果屬性是多個(gè)，則需要遍歷才能輸出

as=doc('li').items()

for a in as:

????print(a.attr('href')

獲取文件

a=doc('.item')

print(a.text())獲取內(nèi)部的文本信息

print(a.html())獲取節(jié)點(diǎn)內(nèi)的html文本

另外html()返回第一個(gè)節(jié)點(diǎn)的內(nèi)部html文本，而text()返回的是所以節(jié)點(diǎn)的純文本，返回中間使用空格分開的字符串

其他方法

addClass() ?removeClass() ?remove()

屬性值修改

li=doc('.item')

li.attr('name','link')

li.text('changed')

li.html('<span> ahndagdaf</span>')

如果attr()方法中傳入一個(gè)參數(shù)的屬性名，則此時(shí)是獲取這個(gè)屬性值，如果傳入2個(gè)參數(shù)，則是修改屬性值，如果text(),html()沒有傳入?yún)?shù)，則是獲取文本，如果傳入了參數(shù)，則是進(jìn)行賦值。

http://www.w3school.com.cn/css/index.asp

支持偽類選擇器

doc=pq(html)

li=doc('li:first-child')

first-child ?last-child ? nth-child(2) ?gt(2) ? nth-child(2n) ?contains(second)

3、lxml (xpath)

from lxml import etree

以下2種方式

html=etree.HTML(text）#已知字符串時(shí)使用

html=etree.parse('./test.html',etree.HTMLParser())#已知文件名時(shí)使用

獲取屬性值、文本值

此時(shí)就可以使用xpath進(jìn)行匹配了@取屬性 ?text()取文本

result=html.xpath('//li/a/@hre')

html.xpath('//li[@class='item']

html.xpath('//li[@class='item'/text()]

如果有多個(gè)屬性值時(shí)，可以使用contains()函數(shù)

result=html.xpath('//li/a[contains(@class,“l(fā)i”)]/p/test()')

多屬性匹配可以使用and

result=html.xpath('//li[contains(@clall,'li') and @name='tang']/a/text()')

按序選擇

li[1]

li[last()]

li[position()<3]

li[last()-2]

http://www.w3school.com.cn/xpath/xpath_functions.asp

4、selector解析

選擇器的使用可以分為下面的三步：

在scrapy中有2中方法調(diào)用selector

一種是from scrapy.selector import Selector

另一種是from scrapy import Selector

導(dǎo)入選擇器from scrapy.selector import Selector

創(chuàng)建選擇器實(shí)例selector = Selector(response=response)

使用選擇器selector.xpath()或者selector.css()

Scrapy selector是以?文字(text)?或?TextResponse?構(gòu)造的?Selector?實(shí)例。其根據(jù)輸入的類型自動選擇最優(yōu)的分析方法(XML vs HTML):

>>> from scrapy.selector import Selector

>>> from scrapy.http import HtmlResponse

以文字構(gòu)造:

>>> body='good'

>>> Selector(text=body).xpath('//span/text()').extract()[u'good']

以response構(gòu)造:

>>> response=HtmlResponse(url='http://example.com',body=body)

>>> Selector(response=response).xpath('//span/text()').extract()[u'good']

其他方法

xpath選擇器中還有一個(gè).re()方法，可以返回unicode字符串的列表

色偷偷精品伊人,欧洲久久精品,欧美综合婷婷骚逼,国产AV主播,国产最新探花在线,九色在线视频一区,伊人大交九欧美,1769亚洲,黄色成人av

2018-06-24

2018-06-24

網(wǎng)頁解析方法

1、BeautifulSoup

獲取屬性的方法：

獲取文本的方法：

獲取直接子節(jié)點(diǎn)

獲取子孫節(jié)點(diǎn)

返回父節(jié)點(diǎn)

獲取兄弟節(jié)點(diǎn)

元素查找

其他方法

另外BeautifulSopu中也提供了CSS選擇器

2、pyquery使用

以下三種方式

獲取子節(jié)點(diǎn)：

獲取父節(jié)點(diǎn)

獲取兄弟節(jié)點(diǎn)

元素的輸出

獲取屬性

獲取文件

其他方法

屬性值修改

支持偽類選擇器

3、lxml (xpath)

獲取屬性值、文本值

按序選擇

4、selector解析

其他方法

相關(guān)閱讀更多精彩內(nèi)容

友情鏈接更多精彩內(nèi)容

色偷偷精品伊人,欧洲久久精品,欧美综合婷婷骚逼,国产AV主播,国产最新探花在线,九色在线视频一区,伊人大交九 欧美,1769亚洲,黄色成人av

2018-06-24

網(wǎng)頁解析方法

1、BeautifulSoup

獲取屬性的方法：

獲取文本的方法：

獲取直接子節(jié)點(diǎn)

獲取子孫節(jié)點(diǎn)

返回父節(jié)點(diǎn)

獲取兄弟節(jié)點(diǎn)

元素查找

其他方法

另外BeautifulSopu中也提供了CSS選擇器

2、pyquery使用

以下三種方式

獲取子節(jié)點(diǎn)：

獲取父節(jié)點(diǎn)

獲取兄弟節(jié)點(diǎn)

元素的輸出

獲取屬性

獲取文件

其他方法

屬性值修改

支持偽類選擇器

3、lxml (xpath)

獲取屬性值、文本值

按序選擇

4、selector解析

其他方法

相關(guān)閱讀更多精彩內(nèi)容

友情鏈接更多精彩內(nèi)容

色偷偷精品伊人,欧洲久久精品,欧美综合婷婷骚逼,国产AV主播,国产最新探花在线,九色在线视频一区,伊人大交九欧美,1769亚洲,黄色成人av

1、BeautifulSoup

2、pyquery使用

3、lxml (xpath)

4、selector解析