1.Scrapy爬蟲(chóng)之靜態(tài)網(wǎng)頁(yè)爬取之三spider.py練習(xí)

練習(xí)1.抓取一個(gè)頁(yè)面的內(nèi)容
網(wǎng)址:http://stackoverflow.com/questions?sort=votes
圖如下:

1

注意:運(yùn)行一個(gè)spider.py的命令 scrapy runspider stackoverflow.py
輸出到一個(gè)文件中 scrapy runspider stackoverflow.py -o stackoverflow.csv

# -*- coding: utf-8 -*-
import scrapy

class StackOverFlowSpider(scrapy.Spider):
    name = "stackoverflow" #你在項(xiàng)目中跑蜘蛛的時(shí)候,要用到它的名字
    start_urls = ['http://stackoverflow.com/questions?sort=votes']
    
    #parse是解析函數(shù)
    def parse(self,response):
        for question in response.xpath('//div[@class="question-summary"]'):
            title = question.xpath('.//div[@class="summary"]/h3/a/text()').extract_first()
            links = response.urljoin(question.xpath('.//div[@class="summary"]/h3/a/@href').extract_first())
            content = question.xpath('.//div[@class="excerpt"]/text()').extract_first().strip()
            votes = question.xpath('.//span[@class="vote-count-post high-scored-post"]/strong/text()').extract_first()
            #votes = question.xpath('.//strong/text()').extract_first()
            answers = question.xpath('.//div[@class="status answered-accepted"]/strong/text()').extract_first()

            yield{
                'title':title,
                'links':links,
                'content':content,
                'votes': votes,
                'answers':answers
            }```
輸出到文件中如下:

![2](http://upload-images.jianshu.io/upload_images/5076126-29c8906471d5346a.png?imageMogr2/auto-orient/strip%7CimageView2/2/w/1240)

**練習(xí)2.給一個(gè)列表,其中都是url**
來(lái)看下一頁(yè)類(lèi)型:(就是給一個(gè)列表去抓取網(wǎng)頁(yè))有每個(gè)頁(yè)數(shù)
網(wǎng)址:http://www.cnblogs.com/pick/#p1

![3](http://upload-images.jianshu.io/upload_images/5076126-0f1730c18b9ddb41.png?imageMogr2/auto-orient/strip%7CimageView2/2/w/1240)

-- coding: utf-8 --

import scrapy

class CnblogSpider(scrapy.Spider):
name = "cnblogs"
allowed_domains = ["http://www.cnblogs.com"]
start_urls = ['http://www.cnblogs.com/pick/#p%s' %p for p in range(1,3)]
def parse(self,response):
for article in response.xpath('//div[@class="post_item"]'):
title = article.xpath('.//div[@class="post_item_body"]/h3/a/text()').extract_first()
#鏈接不完整用:response.urljoin()
title_link = article.xpath('.//div[@class="post_item_body"]/h3/a/@href').extract_first()
content = article.xpath('.//p[@class="post_item_summary"]/text()').extract_first()
anthor = article.xpath('.//div[@class="post_item_foot"]/a/text()').extract_first()
anthor_link = article.xpath('.//div[@class="post_item_foot"]/a/@href').extract_first()
comment = article.xpath('.//span[@class="article_comment"]/a/text()').extract_first().strip()
view = article.xpath('.//span[@class="article_view"]/a/text()').extract_first()

        print title
        print title_link
        print content
        print anthor
        print anthor_link
        print comment
        print view

        yield{
            'title':title,
            'title_link':title_link,
            'content':content,
            'anthor':anthor,
            'anthor_link':anthor_link,
            'comment':comment,
            'view':view
        }```
4

輸出到文件中如下:

5

重點(diǎn)技巧:一開(kāi)始加屬性就是類(lèi)似id一樣的精確定位,后面子標(biāo)簽有屬性不一定加,看需要屬性還是文本內(nèi)容

練習(xí)3.還是下一頁(yè),只有一個(gè)next,假如網(wǎng)址里面沒(méi)有1和2等等的數(shù)字

6
7
# -*- coding: utf-8 -*-
import scrapy

class QuetoSpider(scrapy.Spider):
    name = 'queto'
    start_urls = ['http://quotes.toscrape.com/tag/humor/']

    def parse(self,response):
        for quote in response.xpath('//div[@class="quote"]'):
            content = quote.xpath('.//span[@class="text"]/text()').extract_first()
            author = quote.xpath('.//small[@class="author"]/text()').extract_first()

            yield{
                'content' : content,
                'author' :author
            }
        #解析下一頁(yè)
        next_page = response.xpath('//li[@class="next"]/a/@href').extract_first()
        if next_page is not None:
            next_page = response.urljoin(next_page)
            #返回頁(yè)面
            yield scrapy.Request(next_page,callback=self.parse)```
解析下一個(gè)頁(yè)面,next_page里面是網(wǎng)址鏈接,返回response,有個(gè)回掉函數(shù),再用自己的parse,這是一個(gè)遞歸的過(guò)程。




最后編輯于
?著作權(quán)歸作者所有,轉(zhuǎn)載或內(nèi)容合作請(qǐng)聯(lián)系作者
【社區(qū)內(nèi)容提示】社區(qū)部分內(nèi)容疑似由AI輔助生成,瀏覽時(shí)請(qǐng)結(jié)合常識(shí)與多方信息審慎甄別。
平臺(tái)聲明:文章內(nèi)容(如有圖片或視頻亦包括在內(nèi))由作者上傳并發(fā)布,文章內(nèi)容僅代表作者本人觀點(diǎn),簡(jiǎn)書(shū)系信息發(fā)布平臺(tái),僅提供信息存儲(chǔ)服務(wù)。

相關(guān)閱讀更多精彩內(nèi)容

友情鏈接更多精彩內(nèi)容