scrapy深度爬蟲
1.深度爬蟲概述
2.scrapy Spider實(shí)現(xiàn)的什么爬蟲
3.scrapy CrawlSpider實(shí)現(xiàn)的深度爬蟲
深度爬蟲概述
爬蟲程序,主要是用與數(shù)據(jù)采集處理的一種網(wǎng)絡(luò)程序,在操作過程中針對(duì)指定的url地址進(jìn)行數(shù)據(jù)請(qǐng)求并根據(jù)需要采集數(shù)據(jù),但是在實(shí)際項(xiàng)目開發(fā)過程中,經(jīng)常會(huì)遇到目標(biāo)url地址數(shù)量不明確的情況,如之前的章節(jié)中提到的智聯(lián)招聘項(xiàng)目,不同的崗位搜索到的崗位數(shù)量不一定一致,也就意味著每個(gè)工作搜索到的工作崗位列表頁面的數(shù)量不一定一致,爬蟲工程師工作可能搜索到了10頁,Django工作有可能都索到了25頁數(shù)據(jù),那么針對(duì)這樣的數(shù)據(jù)要全部進(jìn)行爬取,應(yīng)該怎么處理呢?答案就是:深度爬蟲
深度爬蟲:針對(duì)其實(shí)url地址進(jìn)行數(shù)據(jù)采集,在響應(yīng)數(shù)據(jù)中進(jìn)行數(shù)據(jù)篩選得到需要進(jìn)行數(shù)據(jù)采集的下一波url地址,并將url地址添加到數(shù)據(jù)采集隊(duì)列中進(jìn)行二次爬取..以此類推,一致到所有頁面的數(shù)據(jù)全部采集完成即可完成深度數(shù)據(jù)采集,這里的深度指代的就是url地址的檢索深度。
深度爬蟲可以通過不同的方式實(shí)現(xiàn),在urllib2和requesets模塊中通過輪詢數(shù)據(jù)篩選得到目標(biāo)url地址,然后進(jìn)行循環(huán)爬取數(shù)據(jù)即可,在scrapy中主要通過兩種方式進(jìn)行處理:
通過Response對(duì)象的地址序列和Request對(duì)象的請(qǐng)求處理完成深度采集
通過CrawlSpider類型中的請(qǐng)求鏈接提取規(guī)則自動(dòng)進(jìn)行深度數(shù)據(jù)采集處理
Spider Request和Response完成數(shù)據(jù)深度采集
首先完成深度爬蟲之前,先了解Scrapy框架底層的一些操作模式,Scrapy框架運(yùn)行爬蟲項(xiàng)目,默認(rèn)調(diào)用并執(zhí)行parse()函數(shù)進(jìn)行數(shù)據(jù)的解析,但是此時(shí)已經(jīng)由框架完成了請(qǐng)求解析調(diào)度和下載的過程,那么Scrapy到底做了哪些事情呢?
我們首先觀察一下scrapy.Spider源代碼
class Spider(object_ref):
"""Base class for scrapy spiders. All spiders must inherit from this
class.
"""
name = None
custom_settings = None
# 初始化函數(shù),主要進(jìn)行程序的名稱、起始地址等數(shù)據(jù)初始化工作
def __init__(self, name=None, **kwargs):
if name is not None:
self.name = name
elif not getattr(self, 'name', None):
raise ValueError("%s must have a name" % type(self).__name__)
self.__dict__.update(kwargs)
if not hasattr(self, 'start_urls'):
self.start_urls = []
...
...
# 程序啟動(dòng),發(fā)送請(qǐng)求的函數(shù)
def start_requests(self):
cls = self.__class__
# 默認(rèn)沒有重寫直接調(diào)用,重寫的時(shí)候根據(jù)子類重寫的方式重新定義發(fā)送處理方式
# 默認(rèn)情況下發(fā)送get請(qǐng)求獲取數(shù)據(jù),如果要發(fā)送Post請(qǐng)求可以重寫start_reuqests函數(shù)進(jìn)行請(qǐng)求的處理
if method_is_overridden(cls, Spider, 'make_requests_from_url'):
warnings.warn(
"Spider.make_requests_from_url method is deprecated; it "
"won't be called in future Scrapy releases. Please "
"override Spider.start_requests method instead (see %s.%s)." % (
cls.__module__, cls.__name__
),
)
for url in self.start_urls:
yield self.make_requests_from_url(url)
else:
# 沒有重寫該方法,直接根據(jù)初始地址包裝請(qǐng)求對(duì)象發(fā)送請(qǐng)求
for url in self.start_urls:
yield Request(url, dont_filter=True)
我們可以從源代碼中查看到,我們定義的爬蟲處理類繼承的scrapy.Spider類型中,對(duì)于初始化的name和start_urls初始地址進(jìn)行了初始化,然后自動(dòng)調(diào)用start_requests函數(shù)包裝Request請(qǐng)求對(duì)象,然后通過協(xié)程調(diào)用的方法將請(qǐng)求交給調(diào)度器進(jìn)行后續(xù)的處理
這里就需要了解請(qǐng)求對(duì)象中到底做了哪些事情?!
(1) Request對(duì)象
Request請(qǐng)求對(duì)象是scrapy框架中的核心對(duì)象,通過將字符串url地址包裝成請(qǐng)求對(duì)象交給調(diào)度器進(jìn)行調(diào)度管理,之后交給下載模塊進(jìn)行數(shù)據(jù)采集的操作
Request底層操作部分源碼如下:
# scrapy中的Request請(qǐng)求對(duì)象
class Request(object_ref):
# 默認(rèn)構(gòu)建時(shí),method="GET"包裝的是GET請(qǐng)求的采集方式
# 參數(shù)url:請(qǐng)求地址字符串
# 參數(shù)callback:請(qǐng)求的回調(diào)函數(shù)
# 參數(shù)headers:默認(rèn)的請(qǐng)求頭
# 參數(shù)body: 請(qǐng)求體
# 參數(shù)cookies:請(qǐng)求中包含的cookie對(duì)象
# 參數(shù)encoding:請(qǐng)求編碼方式
def __init__(self, url, callback=None, method='GET', headers=None, body=None,
cookies=None, meta=None, encoding='utf-8', priority=0,
dont_filter=False, errback=None, flags=None):
self._encoding = encoding # this one has to be set first
self.method = str(method).upper()
self._set_url(url)
self._set_body(body)
assert isinstance(priority, int), "Request priority not an integer: %r" % priority
self.priority = priority
if callback is not None and not callable(callback):
raise TypeError('callback must be a callable, got %s' % type(callback).__name__)
if errback is not None and not callable(errback):
raise TypeError('errback must be a callable, got %s' % type(errback).__name__)
assert callback or not errback, "Cannot use errback without a callback"
self.callback = callback
self.errback = errback
self.cookies = cookies or {}
self.headers = Headers(headers or {}, encoding=encoding)
self.dont_filter = dont_filter
self._meta = dict(meta) if meta else None
self.flags = [] if flags is None else list(flags)
那么在實(shí)際操作中,我們通過如下三點(diǎn)詳細(xì)說明:
如何發(fā)送get請(qǐng)求
直接編寫爬蟲程序,定義strat_urls中的初始地址和爬蟲的name名稱,然后重寫父類中的parse()函數(shù)即可,請(qǐng)求的發(fā)送默認(rèn)就是get()方式進(jìn)行數(shù)據(jù)采集:
import scrapy
# 定義自己的爬蟲處理類
class MySpider(scrapy.Spider):
# 定義爬蟲名稱
name = 'myspider'
# 定義初始化url地址列表
start_urls = ("http://www.baidu.com", )
# 定義域名限制
allowed_domains = ["baidu.com"]
# 定義數(shù)據(jù)處理方式
def parse(self, response):
# 數(shù)據(jù)處理部分
pass
如何發(fā)送post請(qǐng)求
因?yàn)閟carpy默認(rèn)的Request是get方式發(fā)送請(qǐng)求,如果要通過post方式發(fā)送請(qǐng)求采集數(shù)據(jù),需要重新編寫start_requests()函數(shù)覆蓋父類中的請(qǐng)求包裝方式
import scrapy
class MySpider(scrapy.Spider):
# 定義爬蟲名稱
name = 'myspider'
# 定義初始化url地址列表
start_urls = ("http://www.baidu.com", )
# 定義域名限制
allowed_domains = ["baidu.com"]
# 重寫父類請(qǐng)求初始化發(fā)送方式
def start_requests(self, response):
# 循環(huán)初始話地址,發(fā)送post請(qǐng)求
for url in self.start_urls:
yield scrapy.FormRequest(
url = url,
formdata = {post參數(shù)字典},
callback = self.parse_response,
)
# 重新編寫響應(yīng)數(shù)據(jù)處理函數(shù)
def parse_response(self, response):
# 處理采集到的response數(shù)據(jù)
pass
同時(shí),也可以通過響應(yīng)對(duì)象構(gòu)建一個(gè)POST請(qǐng)求重新發(fā)送,如下:
import scrapy
class MySpider(scarpy.Spider):
# 定義爬蟲名稱
name = 'myspider'
# 定義初始化url地址列表
start_urls = ("http://www.baidu.com", )
# 定義域名限制
allowed_domains = ["baidu.com"]
# 重寫父類請(qǐng)求初始化發(fā)送方式
def parse(self, response):
# 通過響應(yīng)對(duì)象重新構(gòu)建一個(gè)POST請(qǐng)求再次發(fā)送
return scrapy.FormRequest.from_response(
response,
formdata = {"post參數(shù)字典數(shù)據(jù)"},
callback = self.parse_response
)
# 重新編寫響應(yīng)數(shù)據(jù)處理函數(shù)
def parse_response(self, response):
# 處理采集到的response數(shù)據(jù)
pass
(2) Response對(duì)象
Response對(duì)象在項(xiàng)目中的直接操作并不是很多,參考源代碼如下:
# 部分代碼
class Response(object_ref):
def __init__(self, url, status=200, headers=None, body='', flags=None, request=None):
self.headers = Headers(headers or {})
self.status = int(status) # 響應(yīng)碼
self._set_body(body) # 響應(yīng)體
self._set_url(url) # 響應(yīng)url
self.request = request # 請(qǐng)求對(duì)象
self.flags = [] if flags is None else list(flags)
@property
def meta(self):
try:
return self.request.meta
except AttributeError:
raise AttributeError("Response.meta not available, this response " \
"is not tied to any request")
(3)案例操作:模擬CSDN登錄
創(chuàng)建爬蟲項(xiàng)目
scrapy startproject csdnspider
在csdnspider/csdnspider/spiders/目錄中創(chuàng)建csdnspider.py文件,創(chuàng)建爬蟲類如下:
# coding:utf-8
import scrapy
class CsdnSpider(scrapy.Spider):
'''
CSDN登錄爬蟲處理類
'''
# 爬蟲名稱
name = "cs"
# 初始登錄地址
start_urls = ["https://passport.csdn.net/account/login"]
def parse(self, response):
# 匹配登錄流水號(hào)
lt = response.xpath("http://form[@id='fm1']/input[@type='hidden']/@value").extract()[1]
# 發(fā)送post請(qǐng)求完成登錄
return scrapy.FormRequest.from_response(
response,
formdata = {
"username": "15682808270",
"password": "DAMUpython2016",
"lt": lt,
# "execution": "e2s1",
# "_eventId": "submit"
},
callback=self.parse_response
)
def parse_response(self, response):
# 得到登錄后的數(shù)據(jù),進(jìn)行后續(xù)處理
with open("csdn.html", "w") as f:
f.write(response.body)
(4). 深度采集數(shù)據(jù):爬取智聯(lián)某工作崗位所有頁面工作數(shù)據(jù)
創(chuàng)建爬蟲程序
scrapy startproject zlspider
分析請(qǐng)求,定義Item對(duì)象
# -*- coding: utf-8 -*-
# Define here the models for your scraped items
#
# See documentation in:
# https://doc.scrapy.org/en/latest/topics/items.html
import scrapy
class ZhilianItem(scrapy.Item):
'''
定義采集數(shù)據(jù)的類型,該類型中,會(huì)封裝采集到的數(shù)據(jù)
繼承scrapy.Item類型,scrapy框架才會(huì)調(diào)用內(nèi)建函數(shù)繼續(xù)自動(dòng)化操作
'''
# 通過scrapy.Field()定義屬性字段,每個(gè)字段都是采集數(shù)據(jù)的一部分
job_name = scrapy.Field()
company = scrapy.Field()
salary = scrapy.Field()
創(chuàng)建數(shù)據(jù)庫,定義數(shù)據(jù)表,用于存儲(chǔ)數(shù)據(jù)
# 創(chuàng)建數(shù)據(jù)庫
DROP DATABASE py1709_spider;
CREATE DATABASE py1709_spider DEFAULT CHARSET 'utf8';
USE py1709_spider;
# 創(chuàng)建數(shù)據(jù)表
CREATE TABLE jobs(
id INT AUTO_INCREMENT PRIMARY KEY,
job_name VARCHAR(200),
company VARCHAR(200),
salary VARCHAR(50)
);
SELECT COUNT(1) FROM jobs;
SELECT * FROM jobs;
TRUNCATE TABLE jobs;
開發(fā)爬蟲程序,通過請(qǐng)求對(duì)象的自定義包裝,完成請(qǐng)求鏈接[分頁連接]跟蹤爬取
在zlspider/zlspider/spider/文件夾中,創(chuàng)建zhilianspider.py文件,編輯爬蟲程序如下:
# coding:utf-8
# 引入scrapy模塊
import scrapy
from ..items import ZhilianItem
class ZhilianSpider(scrapy.Spider):
'''
智聯(lián)招聘數(shù)據(jù)采集爬蟲程序
需要繼承scrapy.Spider類型,讓scrapy負(fù)責(zé)調(diào)度爬蟲程序進(jìn)行數(shù)據(jù)的采集
'''
# name屬性:爬蟲名稱
name = "zl"
# allowed_domains屬性:限定采集數(shù)據(jù)的域名
allowed_domains = ["zhaopin.com"]
# 起始url地址
start_urls = [
#"http://sou.zhaopin.com/jobs/searchresult.ashx?jl=%E5%8C%97%E4%BA%AC&kw=%E7%88%AC%E8%99%AB&sm=0&sg=cab76822e6044ff4b4b1a907661851f9&p=1",
"http://sou.zhaopin.com/jobs/searchresult.ashx?jl=%E5%8C%97%E4%BA%AC%2b%E4%B8%8A%E6%B5%B7%2b%E5%B9%BF%E5%B7%9E%2b%E6%B7%B1%E5%9C%B3&kw=python&isadv=0&sg=7cd76e75888443e6b906df8f5cf121c1&p=1",
]
def parse(self, response):
'''
采集的數(shù)據(jù)解析函數(shù)(響應(yīng)數(shù)據(jù)解析函數(shù))
主要用于進(jìn)行響應(yīng)數(shù)據(jù)的篩選:篩選目標(biāo)數(shù)據(jù)分裝成Item對(duì)象
:param response:
:return:
'''
# # 再次從響應(yīng)中獲取要進(jìn)行下一次爬取的url地址[其他頁面請(qǐng)求]
# next_page = response.xpath("http://div[@class='pagesDown']/ul/li/a/@href").extract()
# # 循環(huán)處理請(qǐng)求
# for page in next_page:
# page = response.urljoin(page)
# # 重新發(fā)起請(qǐng)求采集下一組url地址的數(shù)據(jù)[第一個(gè)參數(shù):發(fā)起的請(qǐng)求地址,第二個(gè)參數(shù):請(qǐng)求數(shù)據(jù)一旦被采集~交個(gè)哪個(gè)函數(shù)進(jìn)行處理]
# yield scrapy.Request(page, callback=self.parse_response)
url = response.urljoin(self.start_urls[0])
yield scrapy.Request(url, callback=self.parse_response)
def parse_response(self, response):
# 篩選得到工作列表
job_list = response.xpath("http://div[@id='newlist_list_content_table']/table[position()>1]/tr[1]")
# 循環(huán)獲取采集的字段信息
for job in job_list:
# 崗位名稱
job_name = job.xpath("td[@class='zwmc']/div/a").xpath("string(.)").extract()[0]
# 公司名稱
company = job.xpath("td[@class='gsmc']/a").xpath("string(.)").extract()[0]
# 薪水
salary = job.xpath("td[@class='zwyx']").xpath("string(.)").extract()[0]
# 封裝成item對(duì)象
item = ZhilianItem()
item['job_name'] = job_name
item['company'] = company
item['salary'] = salary
# 通過協(xié)程的方式移交給pipeline進(jìn)行處理
yield item
# 再次從響應(yīng)中獲取要進(jìn)行下一次爬取的url地址[其他頁面請(qǐng)求]
next_page = response.xpath("http://div[@class='pagesDown']/ul/li/a/@href").extract()
# 循環(huán)處理請(qǐng)求
for page in next_page:
page = response.urljoin(page)
# 重新發(fā)起請(qǐng)求采集下一組url地址的數(shù)據(jù)[第一個(gè)參數(shù):發(fā)起的請(qǐng)求地址,第二個(gè)參數(shù):請(qǐng)求數(shù)據(jù)一旦被采集~交個(gè)哪個(gè)函數(shù)進(jìn)行處理]
yield scrapy.Request(page, callback=self.parse_response)
運(yùn)行測(cè)試程序
在終端命令行窗口中,運(yùn)行程序
scrapy crawl zl
查看數(shù)據(jù)庫中的數(shù)據(jù)記錄
備注:在這樣的深度采集數(shù)據(jù)時(shí),首頁數(shù)據(jù)很有可能會(huì)重復(fù),所以,將數(shù)據(jù)解析函數(shù)分成了兩個(gè)步驟執(zhí)行,第一步通過parse()函數(shù)處理首頁地址增加到response.urljoin()中,然后通過parse_response()函數(shù)進(jìn)行實(shí)際的數(shù)據(jù)采集工作,達(dá)到首頁數(shù)據(jù)去重的目的!
- Spider CrawlSpider完成數(shù)據(jù)深度采集
Scrapy框架針對(duì)深度爬蟲,提供了一種深度爬蟲的封裝類型scrapy.CrawlSpider,我們自己定義開發(fā)的爬蟲處理類需要繼承該類型,才能使用scrapy提供封裝的各項(xiàng)深度爬蟲的功能
scrapy.CrawlSpider是從scrapy.Spider繼承并進(jìn)行功能擴(kuò)展的類型,在該類中,通過定義Url地址的提取規(guī)則,跟蹤連接地址,從已經(jīng)采集得到的響應(yīng)數(shù)據(jù)中繼續(xù)提取符合規(guī)則的地址進(jìn)行跟蹤爬取數(shù)據(jù)
部分源代碼如下:
class CrawlSpider(Spider):
rules = ()
def __init__(self, *a, **kw):
super(CrawlSpider, self).__init__(*a, **kw)
self._compile_rules()
# 1. 調(diào)用重寫父類的parse()函數(shù)來處理start_urls中返回的response對(duì)象
# 2. parse()則將這些response對(duì)象再次傳遞給了_parse_response()函數(shù)處理
# 2.1. _parse_response()函數(shù)中設(shè)置follow為True,該參數(shù)用于打開是否跟進(jìn)鏈接提取
# 3. parse將返回item和跟進(jìn)了的Request對(duì)象
def parse(self, response):
return self._parse_response(response, self.parse_start_url, cb_kwargs={}, follow=True)
# 定義處理start_url中返回的response的函數(shù),需要重寫
def parse_start_url(self, response):
return []
# 結(jié)果過濾函數(shù)
def process_results(self, response, results):
return results
# 從response中抽取符合任一用戶定義'規(guī)則'的鏈接,并構(gòu)造成Resquest對(duì)象返回
def _requests_to_follow(self, response):
if not isinstance(response, HtmlResponse):
return
seen = set()
# 循環(huán)獲取定義的url地址提取規(guī)則
for n, rule in enumerate(self._rules):
# 得到所有的提取規(guī)則列表
links = [l for l in rule.link_extractor.extract_links(response) if l not in seen]
# 使用用戶指定的process_links處理每個(gè)連接
if links and rule.process_links:
links = rule.process_links(links)
#將鏈接加入seen集合,為每個(gè)鏈接生成Request對(duì)象,并設(shè)置回調(diào)函數(shù)為_repsonse_downloaded()
for link in links:
seen.add(link)
# 構(gòu)造Request對(duì)象,并將Rule規(guī)則中定義的回調(diào)函數(shù)作為這個(gè)Request對(duì)象的回調(diào)函數(shù)
r = Request(url=link.url, callback=self._response_downloaded)
r.meta.update(rule=n, link_text=link.text)
# 對(duì)每個(gè)Request調(diào)用process_request()函數(shù)。該函數(shù)默認(rèn)為indentify,即不做任何處理,直接返回該Request.
yield rule.process_request(r)
# 采集數(shù)據(jù)鏈接處理,從符合規(guī)則的rule中提取鏈接并返回item和request
def _response_downloaded(self, response):
rule = self._rules[response.meta['rule']]
return self._parse_response(response, rule.callback, rule.cb_kwargs, rule.follow)
# 解析response對(duì)象,通過callback回調(diào)函數(shù)解析處理,并返回request或Item對(duì)象
def _parse_response(self, response, callback, cb_kwargs, follow=True):
# 首先判斷是否設(shè)置了回調(diào)函數(shù)。(該回調(diào)函數(shù)可能是rule中的解析函數(shù),也可能是 parse_start_url函數(shù))
#如果設(shè)置了回調(diào)函數(shù)(parse_start_url()),那么首先用parse_start_url()處理response對(duì)象,
# 然后再交給process_results處理。返回cb_res的一個(gè)列表
if callback:
#如果是parse調(diào)用的,則會(huì)解析成Request對(duì)象
#如果是rule callback,則會(huì)解析成Item
cb_res = callback(response, **cb_kwargs) or ()
cb_res = self.process_results(response, cb_res)
for requests_or_item in iterate_spider_output(cb_res):
yield requests_or_item
# 如果需要跟進(jìn),那么使用定義的Rule規(guī)則提取并返回這些Request對(duì)象
if follow and self._follow_links:
#返回每個(gè)Request對(duì)象
for request_or_item in self._requests_to_follow(response):
yield request_or_item
# 規(guī)則過濾
def _compile_rules(self):
def get_method(method):
if callable(method):
return method
elif isinstance(method, basestring):
return getattr(self, method, None)
self._rules = [copy.copy(r) for r in self.rules]
for rule in self._rules:
rule.callback = get_method(rule.callback)
rule.process_links = get_method(rule.process_links)
rule.process_request = get_method(rule.process_request)
# 鏈接跟蹤全局配置設(shè)置
def set_crawler(self, crawler):
super(CrawlSpider, self).set_crawler(crawler)
self._follow_links = crawler.settings.getbool('CRAWLSPIDER_FOLLOW_LINKS', True)
(1) LinkExtractor鏈接提取對(duì)象
LinkExtract類型,主要目的是用于定義鏈接的提取匹配方式
該類中的方法extract_link()用于從響應(yīng)對(duì)象response中提取符合定義規(guī)則的鏈接
該類型只會(huì)被實(shí)例化一次,但是在每次采集得到數(shù)據(jù)時(shí)重復(fù)調(diào)用
class scrapy.linkextractors.LinkExtractor(
allow = (), # 正則表達(dá)式,符合規(guī)則的鏈接會(huì)提取
deny = (), # 正則表達(dá)式,負(fù)責(zé)規(guī)則的鏈接會(huì)排除
allow_domains = (), # 允許的域名
deny_domains = (), # 禁止的域名
deny_extensions = None, # 是否允許擴(kuò)展
restrict_xpaths = (), # xpath表達(dá)式,和allow配合使用精確提取數(shù)據(jù)
tags = ('a','area'), # 標(biāo)簽~
attrs = ('href'), # 指定提取的屬性
canonicalize = True,
unique = True, # 唯一約束,是否去重
process_value = None
)
上述的參數(shù)中,我們可以看到通過一個(gè)linkextractors.LinkExtractor對(duì)象,可以定義各種提取規(guī)則,并且不需要考慮是否會(huì)將重復(fù)的鏈接添加到地址列表中
通過srapy shell做一個(gè)簡(jiǎn)單的測(cè)試,首先打開智聯(lián)工作列表頁面,終端命令行執(zhí)行如下命令:
scrapy shell "http://sou.zhaopin.com/jobs/searchresult.ashx?jl=%E5%8C%97%E4%BA%AC%2b%E4%B8%8A%E6%B5%B7%2b%E5%B9%BF%E5%B7%9E%2b%E6%B7%B1%E5%9C%B3&kw=python&isadv=0&sg=5b827b7808f548ad8261595837624f24&p=4"
此時(shí)scrapy就會(huì)自動(dòng)從指定的地址中采集數(shù)據(jù),并包含在response變量中,打開了python命令行,導(dǎo)入LinkExtractor類型并定義提取規(guī)則:
# 導(dǎo)入LinkExtractor類型
>>> from linkextractors import LinkExtractor
# 定義提取規(guī)則,包含指定字符的鏈接被提取
>>> links = LinkExtractor(allow=('7624f24&p=\d+'))
接下來,從響應(yīng)數(shù)據(jù)中提取符合規(guī)則的超鏈接,執(zhí)行extract_links()函數(shù)如下:
next_urls = links.extract_links(response)
打印next_urls,得到如下結(jié)果:
[Link(url='http://sou.zhaopin.com/jobs/searchresult.ashx
?jl=%E5%8C%97%E4%BA%AC%2b%E4%B8%8A%E6%B5%B7%2b%E5%B9%BF%E
5%B7%9E%2b%E6%B7%B1%E5%9C%B3&kw=python&isadv=0&sg=5b827b7
808f548ad8261595837624f24&p=4',text=u'\u767b\u5f55', frag
ment='', nofollow=True), Link(url='http://sou.zhaopin.com
/jobs/searchresult.ashx?j
l=%e5%8c%97%e4%ba%ac%2b%e4%b8%8a%e6%b5%b7%2b%e5%b9%bf%e5%
b7%9e%2b%e6%b7%b1%e5%9c%b3&kw=python&isadv=0&sg=5b827b780
8f548ad8261595837624f24&p=3', text=u'\u4e0a\u4e00\u9875',
fragment='', nofollow=False), Link(url='http://sou.zhaopi
n.com/jobs/searchresult.ashx?j
l=%e5%8c%97%e4%ba%ac%2b%e4%b8%8a%e6%b5%b7%2b%e5%b9%bf%e5%b
7%9e%2b%e6%b7%b1%e5%9c%b3&kw=python&isadv=0&sg=5b827b7808
f548ad8261595837624f24&p=1', text='1', fragment='', nofoll
ow=False), Link(url='http://sou.zhaopin.com/jobs/searchre
sult.ashx?jl=%e5%8c%97%e4%ba%ac%2b%e4%b8%8a%e6%b5%b7%2b%e
5%b9%bf%e5%b7%9e%2b%e6%b7%b1%e5%9c%b3&kw=python&isadv=0&s
g=5b827b7808f548ad8261595837624f24&p=2', text='2', fragme
nt='', nofollow=False), Link(url='http://sou.zhaopin.com/
jobs/searchresult.ashx?j
l=%e5%8c%97%e4%ba%ac%2b%e4%b8%8a%e6%b5%b7%2b%e5%b9%bf%e5%
b7%9e%2b%e6%b7%b1%e5%9c%b3&kw=python&isadv=0&sg=5b827b780
8f548ad8261595837624f24&p=5', text='5', fragment='', nofo
llow=False), Link(url='http://sou.zhaopin.com/jobs/search
result.ashx?jl=%e5%8c%97%e4%ba%ac%2b%e4%b8%8a%e6%b5%b7%2b
%e5%b9%bf%e5%b7%9e%2b%e6%b7%b1%e5%9c%b3&kw=python&isadv=0
&sg=5b827b7808f548ad8261595837624f24&p=6', text='6', frag
ment='', nofollow=False), Link(url='http://sou.zhaopin.co
m/jobs/searchresult.ashx?j
l=%e5%8c%97%e4%ba%ac%2b%e4%b8%8a%e6%b5%b7%2b%e5%b9%bf%e5%
b7%9e%2b%e6%b7%b1%e5%9c%b3&kw=python&isadv=0&sg=5b827b780
8f548ad8261595837624f24&p=7', text='7', fragment='', nofo
llow=False), Link(url='http://sou.zhaopin.com/jobs/search
result.ashx?jl=%e5%8c%97%e4%ba%ac%2b%e4%b8%8a%e6%b5%b7%2b
%e5%b9%bf%e5%b7%9e%2b%e6%b7%b1%e5%9c%b3&kw=python&isadv=0
&sg=5b827b7808f548ad8261595837624f24&p=8', text='8', frag
ment='', nofollow=False), Link(url='http://sou.zhaopin.co
m/jobs/searchresult.ashx?j
l=%e5%8c%97%e4%ba%ac%2b%e4%b8%8a%e6%b5%b7%2b%e5%b9%bf%e5%
b7%9e%2b%e6%b7%b1%e5%9c%b3&kw=python&isadv=0&sg=5b827b780
8f548ad8261595837624f24&p=9', text='...', fragment='', no
follow=False)]
我們可以很直觀的看到,所有符合規(guī)則的連接全部被提取了出來
(2) Rule規(guī)則對(duì)象
Rule對(duì)象是鏈接操作規(guī)則對(duì)象,主要定義了對(duì)于LinkExtractor類型提取的超鏈接url地址的操作行為,可以在一個(gè)爬蟲程序中定義多個(gè)Rule對(duì)象,包含在一個(gè)rules列表中即可
class scrapy.spiders.Rule(
# LinkExtractor對(duì)象
link_extractor,
# 回調(diào)函數(shù),得到數(shù)據(jù)庫之后調(diào)用的函數(shù)
callback = None,
# 回調(diào)函數(shù)調(diào)用時(shí)傳遞的參數(shù)列表
cb_kwargs = None,
# 是否從返回的響應(yīng)數(shù)據(jù)中根據(jù)LinkExtractor繼續(xù)提取,一般選擇True
follow = None,
# 從LinkExtractor中提取的連接,會(huì)自動(dòng)調(diào)用該選項(xiàng)指定的函數(shù),用來進(jìn)行超鏈接的篩選
process_links = None,
# 指定每個(gè)請(qǐng)求封裝處理時(shí)要調(diào)用的函數(shù)
process_request = None
)
(3) 案例操作
智聯(lián)招聘深度爬蟲操作案例:
創(chuàng)建爬蟲項(xiàng)目
scrapy startproject zhilianspider2
創(chuàng)建爬蟲程序
在zhilianspider2/zhilianspider2/spiders/目錄中創(chuàng)建zhilianspider.py文件,編輯如下:
# coding:utf-8
# 引入CrawlSpider, Rule, LinkExtractor模塊
from scrapy.linkextractors import LinkExtractor
from scrapy.spider import CrawlSpider, Rule
class ZhilianSpider(CrawlSpider):
"""
智聯(lián)招聘深度爬蟲處理類
繼承scrapy.spiders.CrawlSpider類型
"""
# 定義爬蟲名稱
name = "cs2"
# 定義域名限制
allowed_domains = ["zhaopin.com"]
# 定義起始地址
start_urls = ("http://sou.zhaopin.com/jobs/searchresult.ashx?jl=%E5%8C%97%E4%BA%AC%2b%E4%B8%8A%E6%B5%B7%2b%E5%B9%BF%E5%B7%9E%2b%E6%B7%B1%E5%9C%B3&kw=python&isadv=0&sg=5b827b7808f548ad8261595837624f24&p=1",)
# 定義提取規(guī)則
links = LinkExtractor(
allow=("5837624f24&p=\d+")
)
# 定義操作規(guī)則
rules = [
# 定義一個(gè)操作規(guī)則
Rule(links, follow=True, callback='parse_response'),
]
# 定義數(shù)據(jù)處理函數(shù)
def parse_response(self, response):
# 提取數(shù)據(jù)
job_list = response.xpath("http://div[@id='newlist_list_content_table']/table[@class='newlist'][position()>1]")
# 循環(huán)篩選數(shù)據(jù)
for job in job_list:
job_name = job.xpath("tr[1]/td[@class='zwmc']/div/a").xpath("string(.)").extract()[0]
print job_name
print("*************************************************")
在終端命令行中執(zhí)行如下命令運(yùn)行爬蟲程序:
scrapy crawl cs2
可以在控制臺(tái)看到具體的爬取信息,對(duì)于提取的數(shù)據(jù)全部進(jìn)行了跟蹤處理
..
[scrapy.core.engine] DEBUG: Crawled (200) <GET http://sou.zhaopin.com/jobs/searchresult.ashx?jl=%e5%8c%97%e4%ba%ac%2
b%e4%b8%8a%e6%b5%b7%2b%e5%b9%bf%e5%b7%9e%2b%e6%b7%b1%e5%9c%b3&kw=python&isadv=0&sg=5b827b7808f548ad8261595837624f24&p=13> (referer: http
://sou.zhaopin.com/jobs/searchresult.ashx?jl=%e5%8c%97%e4%ba%ac%2b%e4%b8%8a%e6%b5%b7%2b%e5%b9%bf%e5%b7%9e%2b%e6%b7%b1%e5%9c%b3&kw=python
&isadv=0&sg=5b827b7808f548ad8261595837624f24&p=9)
....
圖像算法工程師
軟件測(cè)試工程師
********************************************************************
軟件測(cè)試經(jīng)理
高級(jí)軟件測(cè)試工程師
......
'scheduler/enqueued/memory': 17,
'spider_exceptions/IOError': 3,
'spider_exceptions/UnicodeEncodeError': 1,
'start_time': datetime.datetime(2018, 1, 17, 4, 33, 38, 441000)}
2018-01-17 12:35:56 [scrapy.core.engine] INFO: Spider closed (shutdown)
深度數(shù)據(jù)采集,不同的采集方式已經(jīng)完成了處理。