練習(xí)1.抓取一個(gè)頁(yè)面的內(nèi)容
網(wǎng)址:http://stackoverflow.com/questions?sort=votes
圖如下:

注意:運(yùn)行一個(gè)spider.py的命令 scrapy runspider stackoverflow.py
輸出到一個(gè)文件中 scrapy runspider stackoverflow.py -o stackoverflow.csv
# -*- coding: utf-8 -*-
import scrapy
class StackOverFlowSpider(scrapy.Spider):
name = "stackoverflow" #你在項(xiàng)目中跑蜘蛛的時(shí)候,要用到它的名字
start_urls = ['http://stackoverflow.com/questions?sort=votes']
#parse是解析函數(shù)
def parse(self,response):
for question in response.xpath('//div[@class="question-summary"]'):
title = question.xpath('.//div[@class="summary"]/h3/a/text()').extract_first()
links = response.urljoin(question.xpath('.//div[@class="summary"]/h3/a/@href').extract_first())
content = question.xpath('.//div[@class="excerpt"]/text()').extract_first().strip()
votes = question.xpath('.//span[@class="vote-count-post high-scored-post"]/strong/text()').extract_first()
#votes = question.xpath('.//strong/text()').extract_first()
answers = question.xpath('.//div[@class="status answered-accepted"]/strong/text()').extract_first()
yield{
'title':title,
'links':links,
'content':content,
'votes': votes,
'answers':answers
}```
輸出到文件中如下:

**練習(xí)2.給一個(gè)列表,其中都是url**
來(lái)看下一頁(yè)類(lèi)型:(就是給一個(gè)列表去抓取網(wǎng)頁(yè))有每個(gè)頁(yè)數(shù)
網(wǎng)址:http://www.cnblogs.com/pick/#p1

-- coding: utf-8 --
import scrapy
class CnblogSpider(scrapy.Spider):
name = "cnblogs"
allowed_domains = ["http://www.cnblogs.com"]
start_urls = ['http://www.cnblogs.com/pick/#p%s' %p for p in range(1,3)]
def parse(self,response):
for article in response.xpath('//div[@class="post_item"]'):
title = article.xpath('.//div[@class="post_item_body"]/h3/a/text()').extract_first()
#鏈接不完整用:response.urljoin()
title_link = article.xpath('.//div[@class="post_item_body"]/h3/a/@href').extract_first()
content = article.xpath('.//p[@class="post_item_summary"]/text()').extract_first()
anthor = article.xpath('.//div[@class="post_item_foot"]/a/text()').extract_first()
anthor_link = article.xpath('.//div[@class="post_item_foot"]/a/@href').extract_first()
comment = article.xpath('.//span[@class="article_comment"]/a/text()').extract_first().strip()
view = article.xpath('.//span[@class="article_view"]/a/text()').extract_first()
print title
print title_link
print content
print anthor
print anthor_link
print comment
print view
yield{
'title':title,
'title_link':title_link,
'content':content,
'anthor':anthor,
'anthor_link':anthor_link,
'comment':comment,
'view':view
}```

輸出到文件中如下:

重點(diǎn)技巧:一開(kāi)始加屬性就是類(lèi)似id一樣的精確定位,后面子標(biāo)簽有屬性不一定加,看需要屬性還是文本內(nèi)容
練習(xí)3.還是下一頁(yè),只有一個(gè)next,假如網(wǎng)址里面沒(méi)有1和2等等的數(shù)字


# -*- coding: utf-8 -*-
import scrapy
class QuetoSpider(scrapy.Spider):
name = 'queto'
start_urls = ['http://quotes.toscrape.com/tag/humor/']
def parse(self,response):
for quote in response.xpath('//div[@class="quote"]'):
content = quote.xpath('.//span[@class="text"]/text()').extract_first()
author = quote.xpath('.//small[@class="author"]/text()').extract_first()
yield{
'content' : content,
'author' :author
}
#解析下一頁(yè)
next_page = response.xpath('//li[@class="next"]/a/@href').extract_first()
if next_page is not None:
next_page = response.urljoin(next_page)
#返回頁(yè)面
yield scrapy.Request(next_page,callback=self.parse)```
解析下一個(gè)頁(yè)面,next_page里面是網(wǎng)址鏈接,返回response,有個(gè)回掉函數(shù),再用自己的parse,這是一個(gè)遞歸的過(guò)程。