安裝python
推薦python2.7,測(cè)試過(guò)python3.7并不能使用
更新:最新版本v1.6 已支持python3.4+
安裝Scrapy
pip install scrapy
mac在安裝的時(shí)候會(huì)報(bào)錯(cuò) six package 無(wú)法安裝,可以通過(guò)直接下源碼導(dǎo)入解決
新建Scrapy項(xiàng)目
scrapy startproject demo01
認(rèn)識(shí)Scrapy項(xiàng)目結(jié)構(gòu)
demo01 =>項(xiàng)目根目錄
│ ├── demo01 =>主功能模塊
│ │ ├── spiders =>爬蟲(chóng)處理邏輯
│ │ │ ├── init.py =>
│ │ │ ├── demo01Spider.py =>手動(dòng)或者通過(guò)CLI建立的文件
│ │ ├── items.py =>存儲(chǔ)爬取數(shù)據(jù)的容器
│ │ ├── middlewares.py =>中間件
│ │ ├── pipelines.py =>管道?
│ │ ├── settings.py =>設(shè)定文件
│ ├── scrapy.cfg =>全局配置文件
定義Item
import scrapy
class demo01Item(scrapy.Item):
# define the fields for your item here like:
# name = scrapy.Field()
# 標(biāo)題
title = scrapy.Field()
# 鏈接
link = scrapy.Field()
# 作者
author = scrapy.Field()
# 備注
memo = scrapy.Field()
爬蟲(chóng)主邏輯spider
from scrapy.spiders import CrawlSpider
from scrapy.selector import Selector
from jianshu.items import JianshuItem
# 繼承CrawlSpider
class Demo01Spider(CrawlSpider):
# model 名
name = "jianshu"‘
# 初始訪問(wèn)url
start_urls = ['http://www.itdecent.cn/trending/monthly']
url = 'http://www.itdecent.cn'
# 重寫父類方法
def parse(self, response):
item = JianshuItem()
selector = Selector(response)
articles = selector.xpath('//ul[@class="note-list"]/li')
for article in articles:
title = article.xpath('div[@class="content"]/a/text()').extract()
print(title)
link = article.xpath('a[@class="wrap-img"]/@href').extract()
author = article.xpath(
'div[@class="content"]/div/a[@class="nickname"]/text()').extract()
memo = article.xpath('div[@class="content"]/p/text()').extract()
item['title'] = title
item['link'] = link
item['author'] = author
item['memo'] = memo
# 將數(shù)據(jù)推到item
yield item
Setting文件中設(shè)置文件導(dǎo)出位置
# FEED 配置文件導(dǎo)出
FEED_URI='/Users/gsp/Documents/jianshu-hot.csv'
FEED_FORMAT='CSV'
啟動(dòng)程序
scrapy crawl demo01
錯(cuò)誤處理
這時(shí)候運(yùn)行結(jié)果出錯(cuò),提示
http status code is not handled or allowed
解決方法如下:
#在setting文件中添加
HTTPERROR_ALLOWED_CODES = [404, 403]
再次運(yùn)行命令
scrapy crawl demo01
依然出錯(cuò),logo提示顯示被服務(wù)器拒絕,這是目標(biāo)網(wǎng)站反爬策略導(dǎo)致的,處理方式為在請(qǐng)求時(shí)添加隨機(jī)請(qǐng)求頭:
1、在setting文件中聲明請(qǐng)求頭
# 隨機(jī)請(qǐng)求頭
USER_AGENT_LIST=[
"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/22.0.1207.1 Safari/537.1",
"Mozilla/5.0 (X11; CrOS i686 2268.111.0) AppleWebKit/536.11 (KHTML, like Gecko) Chrome/20.0.1132.57 Safari/536.11",
"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.6 (KHTML, like Gecko) Chrome/20.0.1092.0 Safari/536.6",
"Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.6 (KHTML, like Gecko) Chrome/20.0.1090.0 Safari/536.6",
"Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/19.77.34.5 Safari/537.1",
"Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/536.5 (KHTML, like Gecko) Chrome/19.0.1084.9 Safari/536.5",
"Mozilla/5.0 (Windows NT 6.0) AppleWebKit/536.5 (KHTML, like Gecko) Chrome/19.0.1084.36 Safari/536.5",
"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3",
"Mozilla/5.0 (Windows NT 5.1) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3",
"Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; Trident/4.0; SE 2.X MetaSr 1.0; SE 2.X MetaSr 1.0; .NET CLR 2.0.50727; SE 2.X MetaSr 1.0)",
"Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1062.0 Safari/536.3",
"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1062.0 Safari/536.3",
"Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; 360SE)",
"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3",
"Mozilla/5.0 (Windows NT 6.1) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3",
"Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.0 Safari/536.3",
"Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/535.24 (KHTML, like Gecko) Chrome/19.0.1055.1 Safari/535.24",
"Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/535.24 (KHTML, like Gecko) Chrome/19.0.1055.1 Safari/535.24"
]
2、在setting文件中啟用中間件
# 添加自定義中間件及禁用默認(rèn)中間件
DOWNLOADER_MIDDLEWARES = {
'demo01.middlewares.Demo01SpiderMiddleware': 403,
'scrapy.contrib.downloadermiddleware.useragent.UserAgentMiddleware': None,
}
3、修改middlewares文件,添加請(qǐng)求頭
from demo01.settings import USER_AGENT_LIST
import random
class Demo01SpiderMiddleware(object):
···
def process_request(self, request, spider):
ua = random.choice(USER_AGENT_LIST)
if ua:
request.headers.setdefault('User-Agent', ua)
再次啟動(dòng)
scrapy crawl demo01

數(shù)據(jù)結(jié)果集