伊人久久电影院,碰超在线观看AV国产

前言

作為一個爬蟲框架，與其他爬蟲差異為：靈活擴(kuò)展以及入門簡單

整體架構(gòu)

image

調(diào)度器
從請求管理器中取請求，然后調(diào)用下載器進(jìn)行下載，調(diào)度解析器進(jìn)行解析，對解析出的請求再次加入到請求管理器中；解析出的數(shù)據(jù)進(jìn)行緩存，最后寫入到文本中
調(diào)度器為爬蟲框架的核心，也是爬蟲的入口，其成員變量為下載器、解析器、請求管理器、文件寫入類、日志類，對于不同的爬蟲需求，可以對以上接口有不同的實(shí)現(xiàn)。
核心方法：
從請求管理器中取用 request,采用模板設(shè)計(jì)模式，不可被子類重寫

def run(self, request):
    '''
    運(yùn)行爬蟲方法，從requestManager中取出可用的request，然后扔進(jìn)下載器中進(jìn)行下載，通過解析器對下載到的文檔進(jìn)行解析；
    需要傳入一個或者一組request作為初始request進(jìn)行抓取

        @param :: request : Request類型請求
        return : None
    '''
    self.__start_icon()
    self.__logger.info('\tStart crawl...')
    self.__requestManager.add_new_request(request)
    while self.__requestManager.has_new_request():
        request = self.__requestManager.get_new_request()
        print(request.__dict__)
        self.crawl(request)
    self.__logger.info('\tEnd crawl...')

調(diào)度下載器與解析器，可被子類繼承擴(kuò)展，可擴(kuò)展為多線程、多進(jìn)程等

def crawl(self, request):
    '''
    對request進(jìn)行請求進(jìn)行爬取并解析結(jié)果的運(yùn)行單元，子類可對該方法重寫進(jìn)行多線程、多進(jìn)程運(yùn)行或異步抓取與解析；也可裝飾該方法，進(jìn)行多線程、多進(jìn)程抓取。
    下載器對傳入的request進(jìn)行下載，解析器解析下載到的文檔，并將解析出的request扔進(jìn)requestManager中進(jìn)行管理，以進(jìn)行深度爬??；將解析出的data扔進(jìn)writter中，將數(shù)據(jù)存儲到磁盤上

        @param :: request : Request類型請求
        return None
    '''
    try:
        self.__logger.info('\t'+request.url)
        response = self.__downloader.download(request)
        if response.status_code == 200:
            requests, data = self.__parser.parse(response)
            self.__requestManager.add_new_requests(requests)
            self.__writter.write_buffer(data)
        else:
            self.__logger.warn(
                'crawled data is None and response_status is ' + str(response.status_code))
    except Exception as e:
        self.__logger.exception('\tCrawling occurs error\n' + e.__repr__())

下載器
對傳入的請求進(jìn)行請求下載
抽象方法：

def download(self, request):
  '''
  對request進(jìn)行請求下載，需要將請求到的response返回

      @param :: request : 請求對象
      return response
  '''
  raise NotImplementedError("未實(shí)現(xiàn)的父類方法: %s.%s" % (
      self.__class__.__name__, get__function_name()))

解析器
對傳入的文本進(jìn)行解析
抽象方法：

def parse(self, response):
    '''
    對response進(jìn)行解析，需要將解析到的requests與data返回

        @param :: response : 響應(yīng)對象
        return requests,data
    '''
    raise NotImplementedError("未實(shí)現(xiàn)的父類方法: %s.%s" % (
        self.__class__.__name__, get__function_name()))

請求管理器
管理所有的請求
抽象方法：

def add_new_requests(self, requests):
    '''
    模板方法，不需要子類重寫
    添加requests序列到requestManager中進(jìn)行管理

        @param :: requests : request對象列表
        return None
    '''
    for request in requests:
        self.add_new_request(request)

def add_new_request(self, request):
    '''
    添加request對象到requestManager中進(jìn)行管理

        @param :: request : request對象
        return None
    '''
    raise NotImplementedError("未實(shí)現(xiàn)的父類方法: %s.%s" % (
        self.__class__.__name__, get__function_name()))

def has_new_request(self):
    '''
    判斷requestManager中是否還有新的請求，返回布爾類型結(jié)果

        return Bool
    '''
    raise NotImplementedError("未實(shí)現(xiàn)的父類方法: %s.%s" % (
        self.__class__.__name__, get__function_name()))

def get_new_request(self):
    '''
    從requestManager中取新的請求

        return request
    '''
    raise NotImplementedError("未實(shí)現(xiàn)的父類方法: %s.%s" % (
        self.__class__.__name__, get__function_name()))

數(shù)據(jù)寫入類
將數(shù)據(jù)以特定格式寫入到磁盤中
抽象方法：

def write(self, data):
     '''
     將數(shù)據(jù)data寫入磁盤

         return None
     '''
     raise NotImplementedError("未實(shí)現(xiàn)的父類方法: %s.%s" % (
         self.__class__.__name__, get__function_name()))

 def write_buffer(self, data):
     '''
     緩存數(shù)據(jù)

         return None
     '''
     raise NotImplementedError("未實(shí)現(xiàn)的父類方法: %s.%s" % (
         self.__class__.__name__, get__function_name()))

 def flush_buffer(self):
     '''
     刷新緩存數(shù)據(jù)到磁盤上

         return None
     '''
     raise NotImplementedError("未實(shí)現(xiàn)的父類方法: %s.%s" % (
         self.__class__.__name__, get__function_name()))

日志
記錄爬蟲運(yùn)行
抽象方法：

def debug(self, message):
    '''
    debug級別日志
    '''
    raise NotImplementedError("未實(shí)現(xiàn)的父類方法: %s.%s" % (
        self.__class__.__name__, get__function_name()))

def info(self, message):
    '''
    info級別日志
    '''
    raise NotImplementedError("未實(shí)現(xiàn)的父類方法: %s.%s" % (
        self.__class__.__name__, get__function_name()))

def warn(self, message):
    '''
    warn級別日志
    '''
    raise NotImplementedError("未實(shí)現(xiàn)的父類方法: %s.%s" % (
        self.__class__.__name__, get__function_name()))

def exception(self, message):
    '''
    exception級別日志
    '''
    raise NotImplementedError("未實(shí)現(xiàn)的父類方法: %s.%s" % (
        self.__class__.__name__, get__function_name()))

def error(self, message):
    '''
    error級別日志
    '''
    raise NotImplementedError("未實(shí)現(xiàn)的父類方法: %s.%s" % (
        self.__class__.__name__, get__function_name()))

設(shè)計(jì)思想

在 Abstract Factory 模式為將’抽象零件’組裝為’抽象產(chǎn)品’，因此對于下載器、解析器、請求管理器、數(shù)據(jù)寫入類、日志類僅是限定其抽象接口，在調(diào)度器中分別調(diào)用其每個零件的具體實(shí)現(xiàn)。

調(diào)度器則參照了模板方法設(shè)計(jì)模式，將開始爬取方法(startcrawl)固定，對于調(diào)度器的多線程多進(jìn)程擴(kuò)展可繼承該調(diào)度器，并重寫調(diào)度爬取方法(crawl)，提供多線程多進(jìn)程爬取方法

在爬蟲整體框架設(shè)計(jì)中，主體未調(diào)度器、下載器、解析器、請求管理器、數(shù)據(jù)寫入類、日志類，滿足了基本爬蟲的需求，在實(shí)際中爬蟲進(jìn)行爬取數(shù)據(jù)時，需要設(shè)置網(wǎng)絡(luò)代理、設(shè)置不同的 cookie 身份、以及分布式爬取調(diào)度任務(wù)等，可對以上基本組件運(yùn)用裝飾者模式進(jìn)行裝飾，已達(dá)到擴(kuò)展功能的需求

小白文章，還請各位多多請教！

色偷偷精品伊人,欧洲久久精品,欧美综合婷婷骚逼,国产AV主播,国产最新探花在线,九色在线视频一区,伊人大交九欧美,1769亚洲,黄色成人av

自己寫python爬蟲框架(一)

自己寫python爬蟲框架(一)

前言

設(shè)計(jì)思想

相關(guān)閱讀更多精彩內(nèi)容

友情鏈接更多精彩內(nèi)容

色偷偷精品伊人,欧洲久久精品,欧美综合婷婷骚逼,国产AV主播,国产最新探花在线,九色在线视频一区,伊人大交九 欧美,1769亚洲,黄色成人av

自己寫python爬蟲框架(一)

前言

設(shè)計(jì)思想

相關(guān)閱讀更多精彩內(nèi)容

友情鏈接更多精彩內(nèi)容

色偷偷精品伊人,欧洲久久精品,欧美综合婷婷骚逼,国产AV主播,国产最新探花在线,九色在线视频一区,伊人大交九欧美,1769亚洲,黄色成人av