1.創(chuàng)建
2.繼承的類
3.不能用parse方法
4.parse_start_url
反爬措施:
- 基于請求頭的反爬(合理構(gòu)建請求頭)(請求頭參數(shù)(user-agent,referer,cookie),常見的狀態(tài)碼,常見的請求方式)
- 基于cookie的反爬(cookie池,文件存儲,數(shù)據(jù)庫存儲)(如何獲取cookie,如何驗證cookie,怎么進(jìn)行模擬登陸)
- 基于IP的反爬(代理,代理的原理?代理如何獲???代理怎么檢測?代理池)
- 基于動態(tài)加載的網(wǎng)頁(ajax,js,jq)(selenium?無頭和有頭瀏覽器?selenium方法)
- 關(guān)于數(shù)據(jù)加密?(js,app,web網(wǎng)頁)
下載中間件:處于引擎和下載器之間
@classmethod
def from_crawler(cls, crawler):
def process_request(self, request, spider):
所有的request請求在交給下載器之前都會經(jīng)過這個方法
# - return None: continue processing this request
# - or return a Response object
# - or return a Request object
# - or raise IgnoreRequest: process_exception() methods of
# installed downloader middleware will be called
def process_response(self, request, response, spider):
所有的相應(yīng)結(jié)果會經(jīng)過這個方法
# - return a Response object
# - return a Request object
# - or raise IgnoreRequest
def process_exception(self, request, exception, spider):
處理請求的異常錯誤
# - return None: continue processing this exception
# - return a Response object: stops process_exception() chain
# - return a Request object: stops process_exception() chain
def spider_opened(self, spider):
spider.logger.info('Spider opened: %s' % spider.name)
關(guān)于爬蟲斷點爬取:
scrapy crawl 爬蟲名稱 -s JOBDIR=crawls/爬蟲名稱
requests.queue:保存請求的任務(wù)隊列
requests.seen:保存的指紋
spider.status:爬蟲運行狀態(tài)
scrapy settings.py設(shè)置文件(相關(guān)參數(shù))