深度爬蟲

scrapy深度爬蟲

1.深度爬蟲概述
2.scrapy Spider實(shí)現(xiàn)的什么爬蟲
3.scrapy CrawlSpider實(shí)現(xiàn)的深度爬蟲

深度爬蟲概述

爬蟲程序,主要是用與數(shù)據(jù)采集處理的一種網(wǎng)絡(luò)程序,在操作過程中針對(duì)指定的url地址進(jìn)行數(shù)據(jù)請(qǐng)求并根據(jù)需要采集數(shù)據(jù),但是在實(shí)際項(xiàng)目開發(fā)過程中,經(jīng)常會(huì)遇到目標(biāo)url地址數(shù)量不明確的情況,如之前的章節(jié)中提到的智聯(lián)招聘項(xiàng)目,不同的崗位搜索到的崗位數(shù)量不一定一致,也就意味著每個(gè)工作搜索到的工作崗位列表頁面的數(shù)量不一定一致,爬蟲工程師工作可能搜索到了10頁,Django工作有可能都索到了25頁數(shù)據(jù),那么針對(duì)這樣的數(shù)據(jù)要全部進(jìn)行爬取,應(yīng)該怎么處理呢?答案就是:深度爬蟲

深度爬蟲:針對(duì)其實(shí)url地址進(jìn)行數(shù)據(jù)采集,在響應(yīng)數(shù)據(jù)中進(jìn)行數(shù)據(jù)篩選得到需要進(jìn)行數(shù)據(jù)采集的下一波url地址,并將url地址添加到數(shù)據(jù)采集隊(duì)列中進(jìn)行二次爬取..以此類推,一致到所有頁面的數(shù)據(jù)全部采集完成即可完成深度數(shù)據(jù)采集,這里的深度指代的就是url地址的檢索深度。

深度爬蟲可以通過不同的方式實(shí)現(xiàn),在urllib2和requesets模塊中通過輪詢數(shù)據(jù)篩選得到目標(biāo)url地址,然后進(jìn)行循環(huán)爬取數(shù)據(jù)即可,在scrapy中主要通過兩種方式進(jìn)行處理:

通過Response對(duì)象的地址序列和Request對(duì)象的請(qǐng)求處理完成深度采集
通過CrawlSpider類型中的請(qǐng)求鏈接提取規(guī)則自動(dòng)進(jìn)行深度數(shù)據(jù)采集處理

Spider Request和Response完成數(shù)據(jù)深度采集

首先完成深度爬蟲之前,先了解Scrapy框架底層的一些操作模式,Scrapy框架運(yùn)行爬蟲項(xiàng)目,默認(rèn)調(diào)用并執(zhí)行parse()函數(shù)進(jìn)行數(shù)據(jù)的解析,但是此時(shí)已經(jīng)由框架完成了請(qǐng)求解析調(diào)度和下載的過程,那么Scrapy到底做了哪些事情呢?

我們首先觀察一下scrapy.Spider源代碼

class Spider(object_ref):
    """Base class for scrapy spiders. All spiders must inherit from this
    class.
    """

    name = None
    custom_settings = None
    
    
    # 初始化函數(shù),主要進(jìn)行程序的名稱、起始地址等數(shù)據(jù)初始化工作
    def __init__(self, name=None, **kwargs):
        if name is not None:
            self.name = name
        elif not getattr(self, 'name', None):
            raise ValueError("%s must have a name" % type(self).__name__)
        self.__dict__.update(kwargs)
        if not hasattr(self, 'start_urls'):
            self.start_urls = []
    ...
    ...
    # 程序啟動(dòng),發(fā)送請(qǐng)求的函數(shù)
    def start_requests(self):
        cls = self.__class__
        # 默認(rèn)沒有重寫直接調(diào)用,重寫的時(shí)候根據(jù)子類重寫的方式重新定義發(fā)送處理方式
        # 默認(rèn)情況下發(fā)送get請(qǐng)求獲取數(shù)據(jù),如果要發(fā)送Post請(qǐng)求可以重寫start_reuqests函數(shù)進(jìn)行請(qǐng)求的處理
        if method_is_overridden(cls, Spider, 'make_requests_from_url'):
            warnings.warn(
                "Spider.make_requests_from_url method is deprecated; it "
                "won't be called in future Scrapy releases. Please "
                "override Spider.start_requests method instead (see %s.%s)." % (
                    cls.__module__, cls.__name__
                ),
            )
            for url in self.start_urls:
                yield self.make_requests_from_url(url)
        else:
            # 沒有重寫該方法,直接根據(jù)初始地址包裝請(qǐng)求對(duì)象發(fā)送請(qǐng)求
            for url in self.start_urls:
                yield Request(url, dont_filter=True)

我們可以從源代碼中查看到,我們定義的爬蟲處理類繼承的scrapy.Spider類型中,對(duì)于初始化的name和start_urls初始地址進(jìn)行了初始化,然后自動(dòng)調(diào)用start_requests函數(shù)包裝Request請(qǐng)求對(duì)象,然后通過協(xié)程調(diào)用的方法將請(qǐng)求交給調(diào)度器進(jìn)行后續(xù)的處理

這里就需要了解請(qǐng)求對(duì)象中到底做了哪些事情?!

(1) Request對(duì)象
Request請(qǐng)求對(duì)象是scrapy框架中的核心對(duì)象,通過將字符串url地址包裝成請(qǐng)求對(duì)象交給調(diào)度器進(jìn)行調(diào)度管理,之后交給下載模塊進(jìn)行數(shù)據(jù)采集的操作

Request底層操作部分源碼如下:

# scrapy中的Request請(qǐng)求對(duì)象
class Request(object_ref):

    # 默認(rèn)構(gòu)建時(shí),method="GET"包裝的是GET請(qǐng)求的采集方式
    # 參數(shù)url:請(qǐng)求地址字符串
    # 參數(shù)callback:請(qǐng)求的回調(diào)函數(shù)
    # 參數(shù)headers:默認(rèn)的請(qǐng)求頭
    # 參數(shù)body: 請(qǐng)求體
    # 參數(shù)cookies:請(qǐng)求中包含的cookie對(duì)象
    # 參數(shù)encoding:請(qǐng)求編碼方式
    def __init__(self, url, callback=None, method='GET', headers=None, body=None,
                 cookies=None, meta=None, encoding='utf-8', priority=0,
                 dont_filter=False, errback=None, flags=None):

        self._encoding = encoding  # this one has to be set first
        self.method = str(method).upper()
        self._set_url(url)
        self._set_body(body)
        assert isinstance(priority, int), "Request priority not an integer: %r" % priority
        self.priority = priority

        if callback is not None and not callable(callback):
            raise TypeError('callback must be a callable, got %s' % type(callback).__name__)
        if errback is not None and not callable(errback):
            raise TypeError('errback must be a callable, got %s' % type(errback).__name__)
        assert callback or not errback, "Cannot use errback without a callback"
        self.callback = callback
        self.errback = errback

        self.cookies = cookies or {}
        self.headers = Headers(headers or {}, encoding=encoding)
        self.dont_filter = dont_filter

        self._meta = dict(meta) if meta else None
        self.flags = [] if flags is None else list(flags)

那么在實(shí)際操作中,我們通過如下三點(diǎn)詳細(xì)說明:

如何發(fā)送get請(qǐng)求
直接編寫爬蟲程序,定義strat_urls中的初始地址和爬蟲的name名稱,然后重寫父類中的parse()函數(shù)即可,請(qǐng)求的發(fā)送默認(rèn)就是get()方式進(jìn)行數(shù)據(jù)采集:

import scrapy

# 定義自己的爬蟲處理類
class MySpider(scrapy.Spider):
    # 定義爬蟲名稱
    name = 'myspider'
    # 定義初始化url地址列表
    start_urls = ("http://www.baidu.com", )
    # 定義域名限制
    allowed_domains = ["baidu.com"]
    
    # 定義數(shù)據(jù)處理方式
    def parse(self, response):
        # 數(shù)據(jù)處理部分
        pass

如何發(fā)送post請(qǐng)求
因?yàn)閟carpy默認(rèn)的Request是get方式發(fā)送請(qǐng)求,如果要通過post方式發(fā)送請(qǐng)求采集數(shù)據(jù),需要重新編寫start_requests()函數(shù)覆蓋父類中的請(qǐng)求包裝方式

import scrapy 

class MySpider(scrapy.Spider):
    # 定義爬蟲名稱
    name = 'myspider'
    # 定義初始化url地址列表
    start_urls = ("http://www.baidu.com", )
    # 定義域名限制
    allowed_domains = ["baidu.com"]
    
    # 重寫父類請(qǐng)求初始化發(fā)送方式
    def start_requests(self, response):
        # 循環(huán)初始話地址,發(fā)送post請(qǐng)求
        for url in self.start_urls:
            yield scrapy.FormRequest(
                url = url,
                formdata = {post參數(shù)字典},
                callback = self.parse_response,
            )
        
    # 重新編寫響應(yīng)數(shù)據(jù)處理函數(shù)
    def parse_response(self, response):
        # 處理采集到的response數(shù)據(jù)
        pass

同時(shí),也可以通過響應(yīng)對(duì)象構(gòu)建一個(gè)POST請(qǐng)求重新發(fā)送,如下:

import scrapy

class MySpider(scarpy.Spider):

    # 定義爬蟲名稱
    name = 'myspider'
    # 定義初始化url地址列表
    start_urls = ("http://www.baidu.com", )
    # 定義域名限制
    allowed_domains = ["baidu.com"]
    
    # 重寫父類請(qǐng)求初始化發(fā)送方式
    def parse(self, response):
        # 通過響應(yīng)對(duì)象重新構(gòu)建一個(gè)POST請(qǐng)求再次發(fā)送
        return scrapy.FormRequest.from_response(
            response,
            formdata = {"post參數(shù)字典數(shù)據(jù)"},
            callback = self.parse_response
        )
        
    # 重新編寫響應(yīng)數(shù)據(jù)處理函數(shù)
    def parse_response(self, response):
        # 處理采集到的response數(shù)據(jù)
        pass

(2) Response對(duì)象
Response對(duì)象在項(xiàng)目中的直接操作并不是很多,參考源代碼如下:

# 部分代碼
class Response(object_ref):
    def __init__(self, url, status=200, headers=None, body='', flags=None, request=None):
        self.headers = Headers(headers or {})
        self.status = int(status)       # 響應(yīng)碼
        self._set_body(body)            # 響應(yīng)體
        self._set_url(url)              # 響應(yīng)url
        self.request = request          # 請(qǐng)求對(duì)象
        self.flags = [] if flags is None else list(flags)

    @property
    def meta(self):
        try:
            return self.request.meta
        except AttributeError:
            raise AttributeError("Response.meta not available, this response " \
                "is not tied to any request")

(3)案例操作:模擬CSDN登錄

創(chuàng)建爬蟲項(xiàng)目

scrapy startproject csdnspider

在csdnspider/csdnspider/spiders/目錄中創(chuàng)建csdnspider.py文件,創(chuàng)建爬蟲類如下:

# coding:utf-8

import scrapy


class CsdnSpider(scrapy.Spider):
    '''
    CSDN登錄爬蟲處理類
    '''
    # 爬蟲名稱
    name = "cs"
    # 初始登錄地址
    start_urls = ["https://passport.csdn.net/account/login"]

    def parse(self, response):

        # 匹配登錄流水號(hào)
        lt = response.xpath("http://form[@id='fm1']/input[@type='hidden']/@value").extract()[1]

        # 發(fā)送post請(qǐng)求完成登錄
        return scrapy.FormRequest.from_response(
            response,
            formdata = {
                "username": "15682808270",
                "password": "DAMUpython2016",
                "lt": lt,
                # "execution": "e2s1",
                # "_eventId": "submit"
            },
            callback=self.parse_response
        )

    def parse_response(self, response):
        # 得到登錄后的數(shù)據(jù),進(jìn)行后續(xù)處理
        with open("csdn.html", "w") as f:
            f.write(response.body)

(4). 深度采集數(shù)據(jù):爬取智聯(lián)某工作崗位所有頁面工作數(shù)據(jù)

創(chuàng)建爬蟲程序

scrapy startproject zlspider

分析請(qǐng)求,定義Item對(duì)象

# -*- coding: utf-8 -*-

# Define here the models for your scraped items
#
# See documentation in:
# https://doc.scrapy.org/en/latest/topics/items.html

import scrapy


class ZhilianItem(scrapy.Item):
    '''
    定義采集數(shù)據(jù)的類型,該類型中,會(huì)封裝采集到的數(shù)據(jù)
        繼承scrapy.Item類型,scrapy框架才會(huì)調(diào)用內(nèi)建函數(shù)繼續(xù)自動(dòng)化操作
    '''
    # 通過scrapy.Field()定義屬性字段,每個(gè)字段都是采集數(shù)據(jù)的一部分
    job_name = scrapy.Field()
    company = scrapy.Field()
    salary = scrapy.Field()

創(chuàng)建數(shù)據(jù)庫,定義數(shù)據(jù)表,用于存儲(chǔ)數(shù)據(jù)

# 創(chuàng)建數(shù)據(jù)庫
DROP DATABASE py1709_spider;
CREATE DATABASE py1709_spider DEFAULT CHARSET 'utf8';

USE py1709_spider;

# 創(chuàng)建數(shù)據(jù)表
CREATE TABLE jobs(
    id INT AUTO_INCREMENT PRIMARY KEY,
    job_name VARCHAR(200),
    company VARCHAR(200),
    salary VARCHAR(50)
);
SELECT COUNT(1) FROM jobs;
SELECT * FROM jobs;
TRUNCATE TABLE jobs;

開發(fā)爬蟲程序,通過請(qǐng)求對(duì)象的自定義包裝,完成請(qǐng)求鏈接[分頁連接]跟蹤爬取
在zlspider/zlspider/spider/文件夾中,創(chuàng)建zhilianspider.py文件,編輯爬蟲程序如下:

# coding:utf-8

# 引入scrapy模塊
import scrapy

from ..items import ZhilianItem


class ZhilianSpider(scrapy.Spider):
    '''
    智聯(lián)招聘數(shù)據(jù)采集爬蟲程序
        需要繼承scrapy.Spider類型,讓scrapy負(fù)責(zé)調(diào)度爬蟲程序進(jìn)行數(shù)據(jù)的采集
    '''
    # name屬性:爬蟲名稱
    name = "zl"
    # allowed_domains屬性:限定采集數(shù)據(jù)的域名
    allowed_domains = ["zhaopin.com"]
    # 起始url地址
    start_urls = [
        #"http://sou.zhaopin.com/jobs/searchresult.ashx?jl=%E5%8C%97%E4%BA%AC&kw=%E7%88%AC%E8%99%AB&sm=0&sg=cab76822e6044ff4b4b1a907661851f9&p=1",
        "http://sou.zhaopin.com/jobs/searchresult.ashx?jl=%E5%8C%97%E4%BA%AC%2b%E4%B8%8A%E6%B5%B7%2b%E5%B9%BF%E5%B7%9E%2b%E6%B7%B1%E5%9C%B3&kw=python&isadv=0&sg=7cd76e75888443e6b906df8f5cf121c1&p=1",
    ]

    def parse(self, response):
        '''
        采集的數(shù)據(jù)解析函數(shù)(響應(yīng)數(shù)據(jù)解析函數(shù))
            主要用于進(jìn)行響應(yīng)數(shù)據(jù)的篩選:篩選目標(biāo)數(shù)據(jù)分裝成Item對(duì)象
        :param response:
        :return:
        '''


        # # 再次從響應(yīng)中獲取要進(jìn)行下一次爬取的url地址[其他頁面請(qǐng)求]
        # next_page = response.xpath("http://div[@class='pagesDown']/ul/li/a/@href").extract()
        # # 循環(huán)處理請(qǐng)求
        # for page in next_page:
        #     page = response.urljoin(page)
        #     # 重新發(fā)起請(qǐng)求采集下一組url地址的數(shù)據(jù)[第一個(gè)參數(shù):發(fā)起的請(qǐng)求地址,第二個(gè)參數(shù):請(qǐng)求數(shù)據(jù)一旦被采集~交個(gè)哪個(gè)函數(shù)進(jìn)行處理]
        #     yield scrapy.Request(page, callback=self.parse_response)
        url = response.urljoin(self.start_urls[0])
        yield scrapy.Request(url, callback=self.parse_response)


    def parse_response(self, response):
        # 篩選得到工作列表
        job_list = response.xpath("http://div[@id='newlist_list_content_table']/table[position()>1]/tr[1]")
        # 循環(huán)獲取采集的字段信息
        for job in job_list:
            # 崗位名稱
            job_name = job.xpath("td[@class='zwmc']/div/a").xpath("string(.)").extract()[0]
            # 公司名稱
            company = job.xpath("td[@class='gsmc']/a").xpath("string(.)").extract()[0]
            # 薪水
            salary = job.xpath("td[@class='zwyx']").xpath("string(.)").extract()[0]

            # 封裝成item對(duì)象
            item = ZhilianItem()
            item['job_name'] = job_name
            item['company'] = company
            item['salary'] = salary

            # 通過協(xié)程的方式移交給pipeline進(jìn)行處理
            yield item
        # 再次從響應(yīng)中獲取要進(jìn)行下一次爬取的url地址[其他頁面請(qǐng)求]
        next_page = response.xpath("http://div[@class='pagesDown']/ul/li/a/@href").extract()
        # 循環(huán)處理請(qǐng)求
        for page in next_page:
            page = response.urljoin(page)
            # 重新發(fā)起請(qǐng)求采集下一組url地址的數(shù)據(jù)[第一個(gè)參數(shù):發(fā)起的請(qǐng)求地址,第二個(gè)參數(shù):請(qǐng)求數(shù)據(jù)一旦被采集~交個(gè)哪個(gè)函數(shù)進(jìn)行處理]
            yield scrapy.Request(page, callback=self.parse_response)

運(yùn)行測(cè)試程序
在終端命令行窗口中,運(yùn)行程序

scrapy crawl zl

查看數(shù)據(jù)庫中的數(shù)據(jù)記錄

備注:在這樣的深度采集數(shù)據(jù)時(shí),首頁數(shù)據(jù)很有可能會(huì)重復(fù),所以,將數(shù)據(jù)解析函數(shù)分成了兩個(gè)步驟執(zhí)行,第一步通過parse()函數(shù)處理首頁地址增加到response.urljoin()中,然后通過parse_response()函數(shù)進(jìn)行實(shí)際的數(shù)據(jù)采集工作,達(dá)到首頁數(shù)據(jù)去重的目的!
  1. Spider CrawlSpider完成數(shù)據(jù)深度采集
    Scrapy框架針對(duì)深度爬蟲,提供了一種深度爬蟲的封裝類型scrapy.CrawlSpider,我們自己定義開發(fā)的爬蟲處理類需要繼承該類型,才能使用scrapy提供封裝的各項(xiàng)深度爬蟲的功能

scrapy.CrawlSpider是從scrapy.Spider繼承并進(jìn)行功能擴(kuò)展的類型,在該類中,通過定義Url地址的提取規(guī)則,跟蹤連接地址,從已經(jīng)采集得到的響應(yīng)數(shù)據(jù)中繼續(xù)提取符合規(guī)則的地址進(jìn)行跟蹤爬取數(shù)據(jù)

部分源代碼如下:

class CrawlSpider(Spider):
    rules = ()
    def __init__(self, *a, **kw):
        super(CrawlSpider, self).__init__(*a, **kw)
        self._compile_rules()

    # 1. 調(diào)用重寫父類的parse()函數(shù)來處理start_urls中返回的response對(duì)象
    # 2. parse()則將這些response對(duì)象再次傳遞給了_parse_response()函數(shù)處理
    # 2.1. _parse_response()函數(shù)中設(shè)置follow為True,該參數(shù)用于打開是否跟進(jìn)鏈接提取
    # 3. parse將返回item和跟進(jìn)了的Request對(duì)象    
    def parse(self, response):
        return self._parse_response(response, self.parse_start_url, cb_kwargs={}, follow=True)

    # 定義處理start_url中返回的response的函數(shù),需要重寫
    def parse_start_url(self, response):
        return []

    # 結(jié)果過濾函數(shù)
    def process_results(self, response, results):
        return results

    # 從response中抽取符合任一用戶定義'規(guī)則'的鏈接,并構(gòu)造成Resquest對(duì)象返回
    def _requests_to_follow(self, response):
        if not isinstance(response, HtmlResponse):
            return
        seen = set()
        
        # 循環(huán)獲取定義的url地址提取規(guī)則
        for n, rule in enumerate(self._rules):
            # 得到所有的提取規(guī)則列表
            links = [l for l in rule.link_extractor.extract_links(response) if l not in seen]
            # 使用用戶指定的process_links處理每個(gè)連接
            if links and rule.process_links:
                links = rule.process_links(links)
            #將鏈接加入seen集合,為每個(gè)鏈接生成Request對(duì)象,并設(shè)置回調(diào)函數(shù)為_repsonse_downloaded()
            for link in links:
                seen.add(link)
                # 構(gòu)造Request對(duì)象,并將Rule規(guī)則中定義的回調(diào)函數(shù)作為這個(gè)Request對(duì)象的回調(diào)函數(shù)
                r = Request(url=link.url, callback=self._response_downloaded)
                r.meta.update(rule=n, link_text=link.text)
                # 對(duì)每個(gè)Request調(diào)用process_request()函數(shù)。該函數(shù)默認(rèn)為indentify,即不做任何處理,直接返回該Request.
                yield rule.process_request(r)

    # 采集數(shù)據(jù)鏈接處理,從符合規(guī)則的rule中提取鏈接并返回item和request
    def _response_downloaded(self, response):
        rule = self._rules[response.meta['rule']]
        return self._parse_response(response, rule.callback, rule.cb_kwargs, rule.follow)

    # 解析response對(duì)象,通過callback回調(diào)函數(shù)解析處理,并返回request或Item對(duì)象
    def _parse_response(self, response, callback, cb_kwargs, follow=True):
        # 首先判斷是否設(shè)置了回調(diào)函數(shù)。(該回調(diào)函數(shù)可能是rule中的解析函數(shù),也可能是 parse_start_url函數(shù))
        #如果設(shè)置了回調(diào)函數(shù)(parse_start_url()),那么首先用parse_start_url()處理response對(duì)象,
        # 然后再交給process_results處理。返回cb_res的一個(gè)列表
        if callback:
            #如果是parse調(diào)用的,則會(huì)解析成Request對(duì)象
            #如果是rule callback,則會(huì)解析成Item
            cb_res = callback(response, **cb_kwargs) or ()
            cb_res = self.process_results(response, cb_res)
            for requests_or_item in iterate_spider_output(cb_res):
                yield requests_or_item

        # 如果需要跟進(jìn),那么使用定義的Rule規(guī)則提取并返回這些Request對(duì)象
        if follow and self._follow_links:
            #返回每個(gè)Request對(duì)象
            for request_or_item in self._requests_to_follow(response):
                yield request_or_item

    # 規(guī)則過濾
    def _compile_rules(self):
        def get_method(method):
            if callable(method):
                return method
            elif isinstance(method, basestring):
                return getattr(self, method, None)

        self._rules = [copy.copy(r) for r in self.rules]
        for rule in self._rules:
            rule.callback = get_method(rule.callback)
            rule.process_links = get_method(rule.process_links)
            rule.process_request = get_method(rule.process_request)

    # 鏈接跟蹤全局配置設(shè)置
    def set_crawler(self, crawler):
        super(CrawlSpider, self).set_crawler(crawler)
        self._follow_links = crawler.settings.getbool('CRAWLSPIDER_FOLLOW_LINKS', True)

(1) LinkExtractor鏈接提取對(duì)象

LinkExtract類型,主要目的是用于定義鏈接的提取匹配方式

該類中的方法extract_link()用于從響應(yīng)對(duì)象response中提取符合定義規(guī)則的鏈接

該類型只會(huì)被實(shí)例化一次,但是在每次采集得到數(shù)據(jù)時(shí)重復(fù)調(diào)用

class scrapy.linkextractors.LinkExtractor(
    allow = (),         # 正則表達(dá)式,符合規(guī)則的鏈接會(huì)提取
    deny = (),          # 正則表達(dá)式,負(fù)責(zé)規(guī)則的鏈接會(huì)排除
    allow_domains = (), # 允許的域名
    deny_domains = (),  # 禁止的域名
    deny_extensions = None, # 是否允許擴(kuò)展
    restrict_xpaths = (),   # xpath表達(dá)式,和allow配合使用精確提取數(shù)據(jù)
    tags = ('a','area'),    # 標(biāo)簽~
    attrs = ('href'),       # 指定提取的屬性
    canonicalize = True,    
    unique = True,          # 唯一約束,是否去重
    process_value = None
)

上述的參數(shù)中,我們可以看到通過一個(gè)linkextractors.LinkExtractor對(duì)象,可以定義各種提取規(guī)則,并且不需要考慮是否會(huì)將重復(fù)的鏈接添加到地址列表中

通過srapy shell做一個(gè)簡(jiǎn)單的測(cè)試,首先打開智聯(lián)工作列表頁面,終端命令行執(zhí)行如下命令:

scrapy shell "http://sou.zhaopin.com/jobs/searchresult.ashx?jl=%E5%8C%97%E4%BA%AC%2b%E4%B8%8A%E6%B5%B7%2b%E5%B9%BF%E5%B7%9E%2b%E6%B7%B1%E5%9C%B3&kw=python&isadv=0&sg=5b827b7808f548ad8261595837624f24&p=4"

此時(shí)scrapy就會(huì)自動(dòng)從指定的地址中采集數(shù)據(jù),并包含在response變量中,打開了python命令行,導(dǎo)入LinkExtractor類型并定義提取規(guī)則:

# 導(dǎo)入LinkExtractor類型
>>> from linkextractors import LinkExtractor
# 定義提取規(guī)則,包含指定字符的鏈接被提取
>>> links = LinkExtractor(allow=('7624f24&p=\d+'))

接下來,從響應(yīng)數(shù)據(jù)中提取符合規(guī)則的超鏈接,執(zhí)行extract_links()函數(shù)如下:

next_urls = links.extract_links(response)

打印next_urls,得到如下結(jié)果:

[Link(url='http://sou.zhaopin.com/jobs/searchresult.ashx
?jl=%E5%8C%97%E4%BA%AC%2b%E4%B8%8A%E6%B5%B7%2b%E5%B9%BF%E
5%B7%9E%2b%E6%B7%B1%E5%9C%B3&kw=python&isadv=0&sg=5b827b7
808f548ad8261595837624f24&p=4',text=u'\u767b\u5f55', frag
ment='', nofollow=True), Link(url='http://sou.zhaopin.com
/jobs/searchresult.ashx?j
l=%e5%8c%97%e4%ba%ac%2b%e4%b8%8a%e6%b5%b7%2b%e5%b9%bf%e5%
b7%9e%2b%e6%b7%b1%e5%9c%b3&kw=python&isadv=0&sg=5b827b780
8f548ad8261595837624f24&p=3', text=u'\u4e0a\u4e00\u9875',
fragment='', nofollow=False), Link(url='http://sou.zhaopi
n.com/jobs/searchresult.ashx?j
l=%e5%8c%97%e4%ba%ac%2b%e4%b8%8a%e6%b5%b7%2b%e5%b9%bf%e5%b
7%9e%2b%e6%b7%b1%e5%9c%b3&kw=python&isadv=0&sg=5b827b7808
f548ad8261595837624f24&p=1', text='1', fragment='', nofoll
ow=False), Link(url='http://sou.zhaopin.com/jobs/searchre
sult.ashx?jl=%e5%8c%97%e4%ba%ac%2b%e4%b8%8a%e6%b5%b7%2b%e
5%b9%bf%e5%b7%9e%2b%e6%b7%b1%e5%9c%b3&kw=python&isadv=0&s
g=5b827b7808f548ad8261595837624f24&p=2', text='2', fragme
nt='', nofollow=False), Link(url='http://sou.zhaopin.com/
jobs/searchresult.ashx?j
l=%e5%8c%97%e4%ba%ac%2b%e4%b8%8a%e6%b5%b7%2b%e5%b9%bf%e5%
b7%9e%2b%e6%b7%b1%e5%9c%b3&kw=python&isadv=0&sg=5b827b780
8f548ad8261595837624f24&p=5', text='5', fragment='', nofo
llow=False), Link(url='http://sou.zhaopin.com/jobs/search
result.ashx?jl=%e5%8c%97%e4%ba%ac%2b%e4%b8%8a%e6%b5%b7%2b
%e5%b9%bf%e5%b7%9e%2b%e6%b7%b1%e5%9c%b3&kw=python&isadv=0
&sg=5b827b7808f548ad8261595837624f24&p=6', text='6', frag
ment='', nofollow=False), Link(url='http://sou.zhaopin.co
m/jobs/searchresult.ashx?j
l=%e5%8c%97%e4%ba%ac%2b%e4%b8%8a%e6%b5%b7%2b%e5%b9%bf%e5%
b7%9e%2b%e6%b7%b1%e5%9c%b3&kw=python&isadv=0&sg=5b827b780
8f548ad8261595837624f24&p=7', text='7', fragment='', nofo
llow=False), Link(url='http://sou.zhaopin.com/jobs/search
result.ashx?jl=%e5%8c%97%e4%ba%ac%2b%e4%b8%8a%e6%b5%b7%2b
%e5%b9%bf%e5%b7%9e%2b%e6%b7%b1%e5%9c%b3&kw=python&isadv=0
&sg=5b827b7808f548ad8261595837624f24&p=8', text='8', frag
ment='', nofollow=False), Link(url='http://sou.zhaopin.co
m/jobs/searchresult.ashx?j
l=%e5%8c%97%e4%ba%ac%2b%e4%b8%8a%e6%b5%b7%2b%e5%b9%bf%e5%
b7%9e%2b%e6%b7%b1%e5%9c%b3&kw=python&isadv=0&sg=5b827b780
8f548ad8261595837624f24&p=9', text='...', fragment='', no
follow=False)]

我們可以很直觀的看到,所有符合規(guī)則的連接全部被提取了出來

(2) Rule規(guī)則對(duì)象

Rule對(duì)象是鏈接操作規(guī)則對(duì)象,主要定義了對(duì)于LinkExtractor類型提取的超鏈接url地址的操作行為,可以在一個(gè)爬蟲程序中定義多個(gè)Rule對(duì)象,包含在一個(gè)rules列表中即可

class scrapy.spiders.Rule(
        # LinkExtractor對(duì)象
        link_extractor,         
        # 回調(diào)函數(shù),得到數(shù)據(jù)庫之后調(diào)用的函數(shù)
        callback = None,        
        # 回調(diào)函數(shù)調(diào)用時(shí)傳遞的參數(shù)列表
        cb_kwargs = None,       
        # 是否從返回的響應(yīng)數(shù)據(jù)中根據(jù)LinkExtractor繼續(xù)提取,一般選擇True
        follow = None,          
        # 從LinkExtractor中提取的連接,會(huì)自動(dòng)調(diào)用該選項(xiàng)指定的函數(shù),用來進(jìn)行超鏈接的篩選
        process_links = None,   
        # 指定每個(gè)請(qǐng)求封裝處理時(shí)要調(diào)用的函數(shù)
        process_request = None  
)

(3) 案例操作

智聯(lián)招聘深度爬蟲操作案例:

創(chuàng)建爬蟲項(xiàng)目

scrapy startproject zhilianspider2

創(chuàng)建爬蟲程序
在zhilianspider2/zhilianspider2/spiders/目錄中創(chuàng)建zhilianspider.py文件,編輯如下:

# coding:utf-8

# 引入CrawlSpider, Rule, LinkExtractor模塊
from scrapy.linkextractors import LinkExtractor
from scrapy.spider import CrawlSpider, Rule


class ZhilianSpider(CrawlSpider):
    """
    智聯(lián)招聘深度爬蟲處理類
    繼承scrapy.spiders.CrawlSpider類型
    """
    # 定義爬蟲名稱
    name = "cs2"
    # 定義域名限制
    allowed_domains = ["zhaopin.com"]
    # 定義起始地址
    start_urls = ("http://sou.zhaopin.com/jobs/searchresult.ashx?jl=%E5%8C%97%E4%BA%AC%2b%E4%B8%8A%E6%B5%B7%2b%E5%B9%BF%E5%B7%9E%2b%E6%B7%B1%E5%9C%B3&kw=python&isadv=0&sg=5b827b7808f548ad8261595837624f24&p=1",)

    # 定義提取規(guī)則
    links = LinkExtractor(
        allow=("5837624f24&p=\d+")
    )

    # 定義操作規(guī)則
    rules = [
        # 定義一個(gè)操作規(guī)則
        Rule(links, follow=True, callback='parse_response'),
    ]

    # 定義數(shù)據(jù)處理函數(shù)
    def parse_response(self, response):
        # 提取數(shù)據(jù)
        job_list = response.xpath("http://div[@id='newlist_list_content_table']/table[@class='newlist'][position()>1]")
        # 循環(huán)篩選數(shù)據(jù)
        for job in job_list:
            job_name = job.xpath("tr[1]/td[@class='zwmc']/div/a").xpath("string(.)").extract()[0]

            print job_name

        print("*************************************************")

在終端命令行中執(zhí)行如下命令運(yùn)行爬蟲程序:

scrapy crawl cs2

可以在控制臺(tái)看到具體的爬取信息,對(duì)于提取的數(shù)據(jù)全部進(jìn)行了跟蹤處理

..
[scrapy.core.engine] DEBUG: Crawled (200) <GET http://sou.zhaopin.com/jobs/searchresult.ashx?jl=%e5%8c%97%e4%ba%ac%2
b%e4%b8%8a%e6%b5%b7%2b%e5%b9%bf%e5%b7%9e%2b%e6%b7%b1%e5%9c%b3&kw=python&isadv=0&sg=5b827b7808f548ad8261595837624f24&p=13> (referer: http
://sou.zhaopin.com/jobs/searchresult.ashx?jl=%e5%8c%97%e4%ba%ac%2b%e4%b8%8a%e6%b5%b7%2b%e5%b9%bf%e5%b7%9e%2b%e6%b7%b1%e5%9c%b3&kw=python
&isadv=0&sg=5b827b7808f548ad8261595837624f24&p=9)

....
圖像算法工程師
軟件測(cè)試工程師
********************************************************************
軟件測(cè)試經(jīng)理
高級(jí)軟件測(cè)試工程師

......
 'scheduler/enqueued/memory': 17,
 'spider_exceptions/IOError': 3,
 'spider_exceptions/UnicodeEncodeError': 1,
 'start_time': datetime.datetime(2018, 1, 17, 4, 33, 38, 441000)}
2018-01-17 12:35:56 [scrapy.core.engine] INFO: Spider closed (shutdown)

深度數(shù)據(jù)采集,不同的采集方式已經(jīng)完成了處理。

?著作權(quán)歸作者所有,轉(zhuǎn)載或內(nèi)容合作請(qǐng)聯(lián)系作者
【社區(qū)內(nèi)容提示】社區(qū)部分內(nèi)容疑似由AI輔助生成,瀏覽時(shí)請(qǐng)結(jié)合常識(shí)與多方信息審慎甄別。
平臺(tái)聲明:文章內(nèi)容(如有圖片或視頻亦包括在內(nèi))由作者上傳并發(fā)布,文章內(nèi)容僅代表作者本人觀點(diǎn),簡(jiǎn)書系信息發(fā)布平臺(tái),僅提供信息存儲(chǔ)服務(wù)。

相關(guān)閱讀更多精彩內(nèi)容

友情鏈接更多精彩內(nèi)容