以網(wǎng)頁(yè)http://blog.jobbole.com/110691/為例提?。?/p>

目標(biāo)

xpath基礎(chǔ)知識(shí)：

xpath節(jié)點(diǎn)關(guān)系：

父節(jié)點(diǎn) 上一層節(jié)點(diǎn)
子節(jié)點(diǎn)
兄弟節(jié)點(diǎn) 同胞節(jié)點(diǎn)
先輩節(jié)點(diǎn) 父節(jié)點(diǎn)，爺爺節(jié)點(diǎn)
后代節(jié)點(diǎn) 兒子，孫子節(jié)點(diǎn)

xpath語(yǔ)法

表達(dá)式	說(shuō)明
article	選取所有article元素的所有子節(jié)點(diǎn)
/article	選取根元素article
article/a	選取所有屬于article的子元素的a元素
//div	選取所有div元素（不管出現(xiàn)在文檔里的任何地方）
article//div	選取所有屬于article元素的后代的div元素，不管它出現(xiàn)在article之下的任何位置
//@class	選取所有名為class的屬性

表達(dá)式	說(shuō)明
/article/div[1]	選取屬于article子元素的第一個(gè)div元素
/article/div[last()]	選取屬于article子元素的最后一個(gè)div元素
/article/div[last()-1]	選取屬于article子元素的倒數(shù)第二個(gè)div元素
//div[@lang]	選取所有擁有l(wèi)ang屬性的div元素
//div[@lang='eng']	選取所有l(wèi)ang屬性值為eng的div元素
/div/*	選取屬于div元素的所有子節(jié)點(diǎn)
//*	選取所有元素
//div[@*]	選取所有帶屬性的div 元素
//div/a 丨//div/p	選取所有div元素的a和p元素
//span丨//ul	選取文檔中的span和ul元素
article/div/p丨//span	選取所有屬于article元素的div元素的p元素以及文檔中所有的 span元素

獲得相應(yīng)內(nèi)容

在上節(jié)工程中，把start_urls更換為上面的地址。
為了調(diào)試方便，在cmd下輸入:
scrapy shell http://blog.jobbole.com/110691/ 進(jìn)入調(diào)試shell。

獲得標(biāo)題

輸入
response.xpath('//*[@id="post-110691"]/div[1]/h1/text()').extract()或者
response.xpath('//*[@class="entry-header"]/h1/text()').extract()都可以得到文章標(biāo)題列表。
第一種方法通過(guò)確定id，第二種方法通過(guò)標(biāo)題class的屬性得到標(biāo)題列表，更加通用，然后通過(guò)數(shù)組的切片即可得到標(biāo)題。詳細(xì)查看上面的表格。

獲得時(shí)間

response.xpath('//p[@class="entry-meta-hide-on-mobile"]/text()').extract()
注意，只能獲得p標(biāo)簽下的文字內(nèi)容，對(duì)于子節(jié)點(diǎn)內(nèi)容無(wú)法獲得。對(duì)于獲得的數(shù)據(jù)通過(guò)strip() replace()函數(shù)進(jìn)行清洗。

Paste_Image.png

獲取點(diǎn)贊

對(duì)于含有多個(gè)屬性的class如：class=" btn-bluet-bigger href-style vote-post-up register-user-only "，若只使用其中的一個(gè)屬性得到值，可以使用contains。
response.xpath("http://span[contains(@class, 'vote-post-up')]/h10/text()").extract()[0]

獲得收藏?cái)?shù)

fav_nums = response.xpath("http://span[contains(@class, 'bookmark-btn')]/text()").extract()[0]
得到的內(nèi)容為7 收藏，此時(shí)需要通過(guò)正則表達(dá)式進(jìn)行清洗。

match_re = re.match(".*(\d+).*", fav_nums)
if match_re:
    fav_nums = match_re.group(1)

獲得評(píng)論數(shù)

comment_nums = response.xpath("http://a[@href='#article-comment']/text()").extract()[0]
獲得相應(yīng)內(nèi)容后，使用同樣的正則進(jìn)行數(shù)據(jù)清洗。

獲得正文

content = response.xpath("http://div[@class='entry']").extract()[0]
注意此處沒(méi)有加text()

獲得tags

所有的tag都在a標(biāo)簽下，類似獲得日期的方式，增加一個(gè)a標(biāo)簽路徑即可。
tag_list = response.xpath('//p[@class="entry-meta-hide-on-mobile"]/a/text()').extract()
得到的內(nèi)容為：
['其他', ' 3 評(píng)論 ', '創(chuàng)業(yè)', '程序員']

Paste_Image.png

此時(shí)需要對(duì)數(shù)據(jù)進(jìn)行清洗去掉3 評(píng)論

tag_list = [e for e in tag_list if not e.strip().endswith("評(píng)論")]
tags = ",".join(tag_list)

這樣就可以將清洗后的數(shù)據(jù)放到tags字段中了。

        title = response.xpath('//*[@class="entry-header"]/h1/text()').extract()[0]
        create_date = response.xpath('//p[@class="entry-meta-hide-on-mobile"]/text()').extract()[0].replace('·','').strip()
        fav_nums = response.xpath("http://span[contains(@class, 'bookmark-btn')]/text()").extract()[0]
        match_re = re.match(".*?(\d+).*", fav_nums)
        if match_re:
            fav_nums = match_re.group(1)
        comment_nums = response.xpath("http://a[@href='#article-comment']/text()").extract()[0]
        content = response.xpath("http://div[@class='entry']").extract()[0]
        tag_list = response.xpath('//p[@class="entry-meta-hide-on-mobile"]/a/text()').extract()
        tag_list = [e for e in tag_list if not e.strip().endswith("評(píng)論")]
        tags = ",".join(tag_list)

色偷偷精品伊人,欧洲久久精品,欧美综合婷婷骚逼,国产AV主播,国产最新探花在线,九色在线视频一区,伊人大交九欧美,1769亚洲,黄色成人av

Python爬蟲(chóng)學(xué)習(xí)7-xpath使用

Python爬蟲(chóng)學(xué)習(xí)7-xpath使用

xpath基礎(chǔ)知識(shí)：

xpath語(yǔ)法

獲得相應(yīng)內(nèi)容

獲得標(biāo)題

獲得時(shí)間

獲取點(diǎn)贊

獲得收藏?cái)?shù)

獲得評(píng)論數(shù)

獲得正文

獲得tags

相關(guān)閱讀更多精彩內(nèi)容

友情鏈接更多精彩內(nèi)容

色偷偷精品伊人,欧洲久久精品,欧美综合婷婷骚逼,国产AV主播,国产最新探花在线,九色在线视频一区,伊人大交九 欧美,1769亚洲,黄色成人av

Python爬蟲(chóng)學(xué)習(xí)7-xpath使用

xpath基礎(chǔ)知識(shí)：

xpath語(yǔ)法

獲得相應(yīng)內(nèi)容

獲得標(biāo)題

獲得時(shí)間

獲取點(diǎn)贊

獲得收藏?cái)?shù)

獲得評(píng)論數(shù)

獲得正文

獲得tags

相關(guān)閱讀更多精彩內(nèi)容

友情鏈接更多精彩內(nèi)容

色偷偷精品伊人,欧洲久久精品,欧美综合婷婷骚逼,国产AV主播,国产最新探花在线,九色在线视频一区,伊人大交九欧美,1769亚洲,黄色成人av