python學(xué)習(xí)筆記1--使用scrapy編寫網(wǎng)絡(luò)爬蟲(chóng)

scrapy文檔

安裝python

推薦python2.7,測(cè)試過(guò)python3.7并不能使用
更新:最新版本v1.6 已支持python3.4+

安裝Scrapy

pip install scrapy

mac在安裝的時(shí)候會(huì)報(bào)錯(cuò) six package 無(wú)法安裝,可以通過(guò)直接下源碼導(dǎo)入解決

新建Scrapy項(xiàng)目

scrapy startproject demo01

認(rèn)識(shí)Scrapy項(xiàng)目結(jié)構(gòu)

demo01 =>項(xiàng)目根目錄
│ ├── demo01 =>主功能模塊
│ │ ├── spiders =>爬蟲(chóng)處理邏輯
│ │ │ ├── init.py =>
│ │ │ ├── demo01Spider.py =>手動(dòng)或者通過(guò)CLI建立的文件
│ │ ├── items.py =>存儲(chǔ)爬取數(shù)據(jù)的容器
│ │ ├── middlewares.py =>中間件
│ │ ├── pipelines.py =>管道?
│ │ ├── settings.py =>設(shè)定文件
│ ├── scrapy.cfg =>全局配置文件

定義Item

import scrapy
class demo01Item(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    # 標(biāo)題
    title = scrapy.Field()
    # 鏈接
    link = scrapy.Field()
    # 作者
    author = scrapy.Field()
    # 備注
    memo = scrapy.Field()

爬蟲(chóng)主邏輯spider

from scrapy.spiders import CrawlSpider
from scrapy.selector import Selector
from jianshu.items import JianshuItem

# 繼承CrawlSpider
class Demo01Spider(CrawlSpider):
    # model 名
    name = "jianshu"‘
    # 初始訪問(wèn)url
    start_urls = ['http://www.itdecent.cn/trending/monthly']
    url = 'http://www.itdecent.cn'
    
    # 重寫父類方法
    def parse(self, response):
       item = JianshuItem()
       selector = Selector(response)
       articles = selector.xpath('//ul[@class="note-list"]/li')

       for article in articles:
           title = article.xpath('div[@class="content"]/a/text()').extract()
           print(title)
           link = article.xpath('a[@class="wrap-img"]/@href').extract()
           author = article.xpath(
               'div[@class="content"]/div/a[@class="nickname"]/text()').extract()
           memo = article.xpath('div[@class="content"]/p/text()').extract()

           item['title'] = title
           item['link'] = link
           item['author'] = author
           item['memo'] = memo
          
           # 將數(shù)據(jù)推到item
           yield item

Setting文件中設(shè)置文件導(dǎo)出位置

# FEED 配置文件導(dǎo)出
FEED_URI='/Users/gsp/Documents/jianshu-hot.csv'
FEED_FORMAT='CSV'

啟動(dòng)程序

 scrapy crawl demo01

錯(cuò)誤處理

這時(shí)候運(yùn)行結(jié)果出錯(cuò),提示

http status code is not handled or allowed

解決方法如下:

#在setting文件中添加
HTTPERROR_ALLOWED_CODES = [404, 403]

再次運(yùn)行命令

  scrapy crawl demo01

依然出錯(cuò),logo提示顯示被服務(wù)器拒絕,這是目標(biāo)網(wǎng)站反爬策略導(dǎo)致的,處理方式為在請(qǐng)求時(shí)添加隨機(jī)請(qǐng)求頭:

1、在setting文件中聲明請(qǐng)求頭

  # 隨機(jī)請(qǐng)求頭
USER_AGENT_LIST=[
    "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/22.0.1207.1 Safari/537.1",
    "Mozilla/5.0 (X11; CrOS i686 2268.111.0) AppleWebKit/536.11 (KHTML, like Gecko) Chrome/20.0.1132.57 Safari/536.11",
    "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.6 (KHTML, like Gecko) Chrome/20.0.1092.0 Safari/536.6",
    "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.6 (KHTML, like Gecko) Chrome/20.0.1090.0 Safari/536.6",
    "Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/19.77.34.5 Safari/537.1",
    "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/536.5 (KHTML, like Gecko) Chrome/19.0.1084.9 Safari/536.5",
    "Mozilla/5.0 (Windows NT 6.0) AppleWebKit/536.5 (KHTML, like Gecko) Chrome/19.0.1084.36 Safari/536.5",
    "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3",
    "Mozilla/5.0 (Windows NT 5.1) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3",
    "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; Trident/4.0; SE 2.X MetaSr 1.0; SE 2.X MetaSr 1.0; .NET CLR 2.0.50727; SE 2.X MetaSr 1.0)",
    "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1062.0 Safari/536.3",
    "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1062.0 Safari/536.3",
    "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; 360SE)",
    "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3",
    "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3",
    "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.0 Safari/536.3",
    "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/535.24 (KHTML, like Gecko) Chrome/19.0.1055.1 Safari/535.24",
    "Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/535.24 (KHTML, like Gecko) Chrome/19.0.1055.1 Safari/535.24"
]

2、在setting文件中啟用中間件

  # 添加自定義中間件及禁用默認(rèn)中間件
DOWNLOADER_MIDDLEWARES = {
    'demo01.middlewares.Demo01SpiderMiddleware': 403,
    'scrapy.contrib.downloadermiddleware.useragent.UserAgentMiddleware': None,
}

3、修改middlewares文件,添加請(qǐng)求頭

from demo01.settings import USER_AGENT_LIST
import random

class Demo01SpiderMiddleware(object):
     ···
    def process_request(self, request, spider):
        ua = random.choice(USER_AGENT_LIST)
        if ua:
            request.headers.setdefault('User-Agent', ua)

再次啟動(dòng)

     scrapy crawl demo01
數(shù)據(jù)結(jié)果集
最后編輯于
?著作權(quán)歸作者所有,轉(zhuǎn)載或內(nèi)容合作請(qǐng)聯(lián)系作者
【社區(qū)內(nèi)容提示】社區(qū)部分內(nèi)容疑似由AI輔助生成,瀏覽時(shí)請(qǐng)結(jié)合常識(shí)與多方信息審慎甄別。
平臺(tái)聲明:文章內(nèi)容(如有圖片或視頻亦包括在內(nèi))由作者上傳并發(fā)布,文章內(nèi)容僅代表作者本人觀點(diǎn),簡(jiǎn)書(shū)系信息發(fā)布平臺(tái),僅提供信息存儲(chǔ)服務(wù)。

相關(guān)閱讀更多精彩內(nèi)容

  • https://www.lfd.uci.edu/~gohlke/pythonlibs/#lxml 下載即可。 安裝...
    慫恿的大腦閱讀 1,406評(píng)論 0 7
  • 你,現(xiàn)在還好嗎, 或許已經(jīng)不再為他心動(dòng)了。 偶爾這樣任性一次也好, 在深夜因?yàn)橄氲剿罂抟粓?chǎng)也好, 走在路上看到和...
    北林亞離閱讀 121評(píng)論 0 1
  • 當(dāng)你來(lái)到一個(gè)不曾到訪的城市,你所經(jīng)歷的感受,也許與城市無(wú)關(guān),你所看到的風(fēng)景,也許與你設(shè)想的、也許與你曾經(jīng)所了解到的...
    深林小屋閱讀 426評(píng)論 2 8
  • 在時(shí)光隧道中穿梭 只為尋你驚鴻一瞥 風(fēng)說(shuō)那里很美 永不停歇的 是對(duì)本體的追尋 可我留不住風(fēng)眼底的永恒 它如電似光 ...
    慕容歌閱讀 459評(píng)論 10 19

友情鏈接更多精彩內(nèi)容