Python代理IP池(IP proxy pool)構(gòu)建

本文轉(zhuǎn)自:https://blog.csdn.net/qq_42415326/article/details/95044280

使用的主要模塊:

  • requests,lxml,pymongo,F(xiàn)lask

代理池工作流程

image.png

文字描述

  • 代理IP采集模塊:抓取代理IP—>校驗(yàn)代理IP的可用性—>可用存儲(chǔ)數(shù)據(jù)庫(kù)
  • 校驗(yàn)?zāi)K:讀取數(shù)據(jù)庫(kù)的代理IP—>校驗(yàn)代理IP可用性—>更新或刪除代理IP
  • 帶來(lái)API模塊:從數(shù)據(jù)庫(kù)中獲取穩(wěn)定可用的代理IP,供其他爬蟲(chóng)使用

代理池項(xiàng)目結(jié)構(gòu):

image.png
  • mongo_pool模塊:代理IP增刪改查模塊
  • proxy_spider包:采集代理IP
  • httpbin_validator模塊:檢測(cè)代理的可用性—speed,協(xié)議類(lèi)型,匿名程度(原因: 網(wǎng)站上所標(biāo)注的協(xié)議類(lèi)型和匿名類(lèi)型是不準(zhǔn)確的)
  • proxy_api模塊:提供爬蟲(chóng)或穩(wěn)定可用代理IP和指定不可用域名的接口
  • proxy_test模塊:獲取數(shù)據(jù)庫(kù)中代理IP,定期檢測(cè)可用性
  • dbmodle模塊:代理IP數(shù)據(jù)模型
  • main模塊:程序入口
  • http模塊:提供隨機(jī)User-Agent的請(qǐng)求頭
  • log模塊:記錄日志
  • settings模塊:項(xiàng)目配置文件

項(xiàng)目實(shí)現(xiàn)思路

先實(shí)現(xiàn)不依賴其他模塊的基礎(chǔ)模塊,然后在實(shí)現(xiàn)具體功能模塊

1. 實(shí)現(xiàn)代理IP的數(shù)據(jù)模型類(lèi)(dbmodle.py)

'''
代理ip數(shù)據(jù)模型模塊
定義一個(gè)類(lèi), 繼承object
實(shí)現(xiàn)init方法, 負(fù)責(zé)初始化, 包含如下字段:
    ip: 代理的IP地址
    port: 代理IP的端口號(hào)
    protocol: 代理IP支持的協(xié)議類(lèi)型,http是0, https是1, https和http都支持是2
    nick_type: 代理IP的匿名程度, 高匿:0, 匿名: 1, 透明:2
    speed: 代理IP的響應(yīng)速度, 單位s
    area: 代理IP所在地區(qū)
    score: 代理IP的評(píng)分, 默認(rèn)分值可以通過(guò)配置文件進(jìn)行配置. 在進(jìn)行代理可用性檢查的時(shí)候, 每遇到一次請(qǐng)求失敗就減1份, 減到0的時(shí)候從池中刪除. 如果檢查代理可用, 就恢復(fù)默認(rèn)分值
    disable_domains: 不可用域名列表, 有些代理IP在某些域名下不可用, 但是在其他域名下可用
創(chuàng)建配置文件: settings.py; 定義MAX_SCORE = 50, 
'''
from settings import MAX_SCORE
 
class Proxy(object):
 
    def __init__(self,ip,port,protocol=-1,nick_type=-1,speed=-1,area=None,score=MAX_SCORE,disuseble_dommains=[]):
        #代理ip的地址
        self.ip = ip
        #代理ip的端口號(hào)
        self.port = port
        #代理ip支持協(xié)議類(lèi)型:支持http為0,支持https為1,都支持為2
        self.protocol = protocol
        #代理ip的匿名程度:高匿為0,匿名為1,透明為2
        self.nick_type =nick_type
        #代理ip的響應(yīng)速度
        self.speed = speed
        #代理ip所在地區(qū)
        self.area = area
        #代理ip的評(píng)分,衡量代理ip的可用性
        self.score =score
        #代理ip的不可用域名列表
        self.disuseble_dommains =disuseble_dommains
 
    def __str__(self):
        #返回?cái)?shù)據(jù)字符串
        return str(self.__dict)

2. 實(shí)現(xiàn)日志記錄模塊(log.py)

目的:

能夠方便的對(duì)程序進(jìn)行調(diào)試
能夠記錄程序的運(yùn)行狀態(tài)
記錄錯(cuò)誤信息
實(shí)現(xiàn):日志模塊在網(wǎng)上有很多現(xiàn)成的實(shí)現(xiàn), 我們開(kāi)發(fā)的時(shí)候, 通常不會(huì)再自己寫(xiě); 而是使用拿來(lái)主義,拿來(lái)用就完了。

把日志模塊中的相關(guān)配置信息放到配置文件中
修改日志模塊代碼,使用配置文件中的配置信息

'''
記錄日志的模塊
'''
 
import sys,os
#Python的標(biāo)準(zhǔn)日志模塊:logging
import logging
#將上級(jí)目錄添加到搜索路徑中
sys.path.append("../")
 
from settings import LOG_LEVEL,LOG_FMT ,LOG_DATEFMT,LOG_FILENAME 
 
class Logger(object):
 
    def __init__(self):
        #獲取一個(gè)logger對(duì)象
        self._logger = logging.getLogger()
        #設(shè)置format對(duì)象
        self.formatter = logging.Formatter(fmt=LOG_FMT,datefmt=LOG_DATEFMT)
        #設(shè)置日志輸出——文件日志模式
        self._logger.addHandler(self._get_file_handler(LOG_FILENAME))
        #設(shè)置日志輸出——終端日志模式
        self._logger.addHandler(self._get_console_handler())
        # 4. 設(shè)置日志等級(jí)
        self._logger.setLevel(LOG_LEVEL)
 
    def _get_file_handler(self, filename):
        '''
        返回一個(gè)文件日志handler
        '''
        # 獲取一個(gè)輸出為文件日志的handler
        filehandler = logging.FileHandler(filename=filename,encoding="utf-8")
        # 設(shè)置日志格式
        filehandler.setFormatter(self.formatter)
        # 返回
        return filehandler
 
    def _get_console_handler(self):
        '''
        返回一個(gè)輸出到終端日志handler
        '''
        #獲取一個(gè)輸出到終端的日志handler
        console_handler = logging.StreamHandler(sys.stdout)
        #設(shè)置日志格式
        console_handler.setFormatter(self.formatter)
        # 返回handler
        return console_handler
 
    #屬性裝飾器,返回一個(gè)logger對(duì)象
    @property
    def logger(self):
        return self._logger
 
# 初始化并配一個(gè)logger對(duì)象,達(dá)到單例
# 使用時(shí),直接導(dǎo)入logger就可以使用
logger = Logger().logger
 
if __name__ == '__main__':
    print(logger)
    logger.debug("調(diào)試信息")
    logger.info("狀態(tài)信息")
    logger.warning("警告信息")
    logger.error("錯(cuò)誤信息")
    logger.critical("嚴(yán)重錯(cuò)誤信息")

3.校驗(yàn)代理IP的協(xié)議類(lèi)型、匿名程度,速度(httpbin_validator.py)

'''
代理IP速度檢查: 就是發(fā)送請(qǐng)求到獲取響應(yīng)的時(shí)間間隔
匿名程度檢查:
        對(duì) http://httpbin.org/get 或 https://httpbin.org/get 發(fā)送請(qǐng)求
        if : origin 中有','分割的兩個(gè)IP就是透明代理IP
        if : headers 中包含 Proxy-Connection 說(shuō)明是匿名代理IP
        else : 就是高匿代理IP
檢查代理IP協(xié)議類(lèi)型:
    如果 http://httpbin.org/get 發(fā)送請(qǐng)求可以成功, 說(shuō)明支持http協(xié)議
    如果 https://httpbin.org/get 發(fā)送請(qǐng)求可以成功, 說(shuō)明支持https協(xié)議
'''
import sys
import time
import requests
import json
sys.path.append("..")
from proxy_utils import random_headers
from settings import CHECK_TIMEOUT
from proxy_utils.log import logger
from dbmodle import Proxy
 
 
def check_proxy(proxy):
    '''
    分別判斷http和https是否請(qǐng)求成功
    '''
    #代理ip
    proxies = {
    'http':'http://{}:{}'.format(proxy.ip,proxy.port),
    'https':'https://{}:{}'.format(proxy.ip,proxy.port),
    }
 
    http,http_nick_type,http_speed = http_check_proxies(proxies)
    https,https_nick_type,https_speed = http_check_proxies(proxies,False)
 
    if http and https:
        proxy.protocol = 2  #支持https和http
        proxy.nick_type =http_nick_type
        proxy.speed = http_speed
    elif http:
        proxy.protocol = 0  #只支持http
        proxy.nick_type =http_nick_type
        proxy.speed = http_speed
    elif https:
        proxy.protocol = 1 #只支持https
        proxy.nick_type =https_nick_type
        proxy.speed = https_speed
    else:  
        proxy.protocol = -1
        proxy.nick_type = -1
        proxy.speed = -1
    
    #logger.debug(proxy)
 
    return proxy
 
def http_check_proxies(proxies,isHttp = True):
    '''
    代理ip請(qǐng)求校驗(yàn)ip
    '''
    nick_type = -1 #匿名程度變量
    speed = -1  #響應(yīng)速度變量
    if isHttp:
        test_url = 'http://httpbin.org/get'
    else:
        test_url = 'https://httpbin.org/get'
    #requests庫(kù)請(qǐng)求test_url
    try:
        #響應(yīng)時(shí)間
        start_time = time.time()
        res = requests.get(test_url,headers = random_headers.get_request_headers(),proxies = proxies,timeout = TCHECK_TIMEOUT)
        end_time = time.time()
        cost_time =end_time-start_time
 
        if res.status_code == 200:
            #響應(yīng)速度
            speed = round(cost_time,2)
            #轉(zhuǎn)換為字典
            res_dict = json.loads(res.text)
            #獲取請(qǐng)求來(lái)源ip
            origin_ip = res_dict['origin']
            #獲取響應(yīng)請(qǐng)求頭中'Proxy-Connection',若有,說(shuō)明是匿名代理
            proxy_connection = res_dict['headers'].get('Proxy-Conntion',None)
 
            if "," in origin_ip:
                #如果響應(yīng)內(nèi)容中的源ip中有‘,’分割的兩個(gè)ip的話及時(shí)透明代理ip
                nick_type = 2 #透明
            elif proxy_connection:
                #'Proxy-Connection'存在說(shuō)明是匿名ip
                nick_type = 1 #匿名
            else:
                nick_type =0  #高匿
            return True,nick_type,speed
        else:
            return False,nick_type,speed
    except Exception as e:
        #logger.exception(e)
        return False,nick_type,speed
 
if __name__ == '__main__':
    proxy = Proxy('60.13.42.94','9999')
    result = check_proxy(proxy)
    print(result)

4. 實(shí)現(xiàn)數(shù)據(jù)庫(kù)模塊(增刪改查功能和api功能——mongo_pool.py)

定義MongoPool類(lèi), 繼承object

  1. init中, 建立數(shù)據(jù)連接, 獲取要操作的集合, 在 del 方法中關(guān)閉數(shù)據(jù)庫(kù)連接
  2. 提供基礎(chǔ)的增刪改查功能
    1. 實(shí)現(xiàn)插入功能
    2. 實(shí)現(xiàn)修改該功能
    3. 實(shí)現(xiàn)刪除代理: 根據(jù)代理的IP刪除代理
    4. 查詢所有代理IP的功能
  3. 提供代理API模塊使用的功能
    1. 實(shí)現(xiàn)查詢功能: 根據(jù)條件進(jìn)行查詢, 可以指定查詢數(shù)量, 先分?jǐn)?shù)降序, 速度升序排, 保證優(yōu)質(zhì)的代理IP在上面.
    2. 實(shí)現(xiàn)根據(jù)協(xié)議類(lèi)型 和 要訪問(wèn)網(wǎng)站的域名, 獲取代理IP列表
    3. 實(shí)現(xiàn)根據(jù)協(xié)議類(lèi)型 和 要訪問(wèn)網(wǎng)站的域名, 隨機(jī)獲取一個(gè)代理IP
    4. 實(shí)現(xiàn)把指定域名添加到指定IP的disable_domain列表中.
'''
針對(duì)proxies集合進(jìn)行數(shù)據(jù)庫(kù)的增刪改查的操作,并提供代理api使用
'''
import random
import pymongo
import sys
 
sys.path.append("..")
from settings import MONGO_URL
from proxy_utils.log import logger
from dbmodle import Proxy
 
class MongoPool(object):
 
    def __init__(self):
        #連接數(shù)據(jù)庫(kù)
        self.client = pymongo.MongoClient(MONGO_URL)
        #獲取操作字典集
        self.proxies = self.client['proxy_pool']['proxies']
 
    def __del__(self):
        #關(guān)閉數(shù)據(jù)庫(kù)連接
        self.client.close()
 
    def insert(self,proxy):
        '''
        代理ip插入方法
        '''
        count = self.proxies.count_documents({'_id':proxy.ip})
        if count == 0:
            #Proxy對(duì)象轉(zhuǎn)換為字典
            proxy_dict = proxy.__dict__
            #主鍵
            proxy_dict['_id'] = proxy.ip
            #向proxies字典集中插入代理ip
            self.proxies.insert_one(proxy_dict)
            logger.info('插入新的代理:{}'.format(proxy))
        else:
            logger.warning('已經(jīng)存在的代理{}'.format(proxy))
 
    def update(self,proxy):
        '''
        修改更新數(shù)據(jù)庫(kù)中代理ip
        '''
        self.proxies.update_one({'_id':proxy.ip},{'$set':proxy.__dict__})
        logger.info('更新代理ip:{}'.format(proxy))
 
    def delete(self,proxy):
        '''
        刪除數(shù)據(jù)庫(kù)中代理ip
        '''
        self.proxies.delete_one({'_id':proxy.ip})
        logger.info('刪除代理ip:{}'.format(proxy))
 
    def find_all(self):
        '''
        查詢數(shù)據(jù)庫(kù)中所有的代理ip
        '''
        cursor = self.proxies.find()
 
        for item in cursor:
            #刪除_id鍵值對(duì)
            item.pop('_id')
            proxy = Proxy(**item)
            #生成器yield
            yield proxy
 
    def limit_find(self,conditions = {},count = 0):
        '''根據(jù)條件進(jìn)行查詢, 
        可以指定查詢數(shù)量, 先分?jǐn)?shù)降序, 速度升序排, 
        保證優(yōu)質(zhì)的代理IP在上面'''
        cursor = self.proxies.find(conditions,limit = count).sort([
            ('score',pymongo.DESCENDING),('speed',pymongo.ASCENDING)])
        #接受查詢所得代理IP
        proxy_list = []
 
        for item in cursor:
            itme.pop('_id')
            proxy = Proxy(**item)
            proxy_list.append(proxy)
        return proxy_list
 
    def get_proxies(self,protocol =None,domain = None,nick_type =0,count = 0):
        '''
        實(shí)現(xiàn)根據(jù)協(xié)議類(lèi)型和要訪問(wèn)網(wǎng)站的域名, 獲取代理IP列表
        '''
        conditions = {'nike_type':nick_type}
        if protocol is None:
            conditions['protocol'] = 2
        elif protocol.lower() == 'http':
            conditions['protocol'] ={'$in':[0,2]}
        else:
            conditons['protocol'] ={'$in':[1,2]}
 
        if domain:
            conditons['disable_domains'] = {'$nin':[domain]}
 
        return self.limit_find(conditions,count = count)
 
    def random_proxy(self,protocol = None,domain =None,count = 0,nick_type =0):
        '''
        根據(jù)協(xié)議類(lèi)型 和 要訪問(wèn)網(wǎng)站的域名, 隨機(jī)獲取一個(gè)代理IP
        '''
        proxy_list = self.get_proxies(protocol =protocol,domain = domain ,count = count ,nick_type =nick_type)
 
        return random.choice(proxy_list)
 
    def add_disable_domain(self,ip,domain):
        '''
        把指定域名添加到指定IP的disable_domain列表中,沒(méi)有才添加
        '''
        count = self.proxies.count_documents({'_id':ip,'disable_domains':domain})
        if count == 0:
            self.proxies.update_one({'_id':ip},{'$push':{'disable_domains':domain}})
 
 
 
if __name__ == '__main__':
    mongo = MongoPool()
    #插入測(cè)試
    #proxy = Proxy('202.104.113.32','53281')
    #mongo.insert(proxy)
 
    #更新測(cè)試
    #proxy = Proxy('202.104.113.32','8888')
    #mongo.update(proxy)
 
    #刪除測(cè)試
    #proxy = Proxy('202.104.113.32','8888')
    #mongo.delete(proxy)
 
    #查詢所有測(cè)試
    #for proxy in mongo.find_all():
        #print(proxy)

5.實(shí)現(xiàn)隨機(jī)獲取User-Agent 的請(qǐng)求頭模塊(random_headers.py)


'''
獲取隨機(jī)User-Agent的請(qǐng)求頭
'''
import random
 
#用戶代理User-Agent列表
USER_AGENTS = [
    "Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_6_8; en-us) AppleWebKit/534.50 (KHTML, like Gecko) Version/5.1 Safari/534.50",
    "Mozilla/5.0 (Windows; U; Windows NT 6.1; en-us) AppleWebKit/534.50 (KHTML, like Gecko) Version/5.1 Safari/534.50",
    "Mozilla/5.0 (Windows NT 10.0; WOW64; rv:38.0) Gecko/20100101 Firefox/38.0",
    "Mozilla/5.0 (Windows NT 10.0; WOW64; Trident/7.0; .NET4.0C; .NET4.0E; .NET CLR 2.0.50727; .NET CLR 3.0.30729; .NET CLR 3.5.30729; InfoPath.3; rv:11.0) like Gecko",
    "Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Trident/5.0)",
    "Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.0; Trident/4.0)",
    "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.0)",
    "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)",
    "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.6; rv:2.0.1) Gecko/20100101 Firefox/4.0.1",
    "Mozilla/5.0 (Windows NT 6.1; rv:2.0.1) Gecko/20100101 Firefox/4.0.1",
    "Opera/9.80 (Macintosh; Intel Mac OS X 10.6.8; U; en) Presto/2.8.131 Version/11.11",
    "Opera/9.80 (Windows NT 6.1; U; en) Presto/2.8.131 Version/11.11",
    "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_0) AppleWebKit/535.11 (KHTML, like Gecko) Chrome/17.0.963.56 Safari/535.11",
    "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; Maxthon 2.0)",
    "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; TencentTraveler 4.0)",
    "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)",
    "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; The World)",
    "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; Trident/4.0; SE 2.X MetaSr 1.0; SE 2.X MetaSr 1.0; .NET CLR 2.0.50727; SE 2.X MetaSr 1.0)",
    "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; 360SE)",
    "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; Avant Browser)",
    "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)",
    "Mozilla/5.0 (iPhone; U; CPU iPhone OS 4_3_3 like Mac OS X; en-us) AppleWebKit/533.17.9 (KHTML, like Gecko) Version/5.0.2 Mobile/8J2 Safari/6533.18.5",
    "Mozilla/5.0 (iPod; U; CPU iPhone OS 4_3_3 like Mac OS X; en-us) AppleWebKit/533.17.9 (KHTML, like Gecko) Version/5.0.2 Mobile/8J2 Safari/6533.18.5",
    "Mozilla/5.0 (iPad; U; CPU OS 4_3_3 like Mac OS X; en-us) AppleWebKit/533.17.9 (KHTML, like Gecko) Version/5.0.2 Mobile/8J2 Safari/6533.18.5",
    "Mozilla/5.0 (Linux; U; Android 2.3.7; en-us; Nexus One Build/FRF91) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1",
    "MQQBrowser/26 Mozilla/5.0 (Linux; U; Android 2.3.7; zh-cn; MB200 Build/GRJ22; CyanogenMod-7) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1",
    "Opera/9.80 (Android 2.3.4; Linux; Opera Mobi/build-1107180945; U; en-GB) Presto/2.8.149 Version/11.10",
    "Mozilla/5.0 (Linux; U; Android 3.0; en-us; Xoom Build/HRI39) AppleWebKit/534.13 (KHTML, like Gecko) Version/4.0 Safari/534.13",
    "Mozilla/5.0 (BlackBerry; U; BlackBerry 9800; en) AppleWebKit/534.1+ (KHTML, like Gecko) Version/6.0.0.337 Mobile Safari/534.1+",
    "Mozilla/5.0 (hp-tablet; Linux; hpwOS/3.0.0; U; en-US) AppleWebKit/534.6 (KHTML, like Gecko) wOSBrowser/233.70 Safari/534.6 TouchPad/1.0",
    "Mozilla/5.0 (SymbianOS/9.4; Series60/5.0 NokiaN97-1/20.0.019; Profile/MIDP-2.1 Configuration/CLDC-1.1) AppleWebKit/525 (KHTML, like Gecko) BrowserNG/7.1.18124",
    "Mozilla/5.0 (compatible; MSIE 9.0; Windows Phone OS 7.5; Trident/5.0; IEMobile/9.0; HTC; Titan)",
    "UCWEB7.0.2.37/28/999",
    "NOKIA5700/ UCWEB7.0.2.37/28/999",
    "Openwave/ UCWEB7.0.2.37/28/999",
    "Mozilla/4.0 (compatible; MSIE 6.0; ) Opera/UCWEB7.0.2.37/28/999",
    # iPhone 6:
    "Mozilla/6.0 (iPhone; CPU iPhone OS 8_0 like Mac OS X) AppleWebKit/536.26 (KHTML, like Gecko) Version/8.0 Mobile/10A5376e Safari/8536.25",
]
 
#隨機(jī)獲取一個(gè)用戶代理User-Agent的請(qǐng)求頭
def get_request_headers():
    headers = {
    'User-Agent':random.choice(USER_AGENTS),
    'Accept':'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3',
    'Accept-language':'zh-CN,zh;q=0.9,en;q=0.8',
    'Referer':'https://www.baidu.com',
    'Accept-Encoding':'gzip, deflate,br',
    'Connection':'keep-alive',
    }
    return headers
 
    
 
if __name__ == '__main__':
    #測(cè)試隨機(jī)與否
    print(get_request_headers())
    print("------------"*20)
    print(get_request_headers())

6.實(shí)現(xiàn)通用爬蟲(chóng),作為具體爬蟲(chóng)的父類(lèi)(base_poxies.py)

'''
通用爬蟲(chóng):通過(guò)指定URL列表, 分組XPATH和組內(nèi)XPATH, 來(lái)提取不同網(wǎng)站的代理IP
定義一個(gè)BaseSpider類(lèi), 繼承object
  - 提供三個(gè)類(lèi)成員變量:urls, group_xpath, detail_xpath: ip, port, area
  - 提供初始方法, 傳入爬蟲(chóng)URL列表, 分組XPATH, 詳情(組內(nèi))XPATH
  - 對(duì)外提供一個(gè)獲取代理IP的方法
'''
 
import requests
import sys
import time,random
from lxml import etree
sys.path.append('..')
from proxies_utils.random_headers import get_request_headers
from dbmodle import Proxy
 
class BaseSpider(object):
    #類(lèi)成員變量
    #代理IP網(wǎng)址的URL的列表
    urls =[]
    #分組XPATH, 獲取包含代理IP信息標(biāo)簽列表的XPATH
    group_xpath = ''
    #組內(nèi)XPATH, 獲取代理IP詳情的信息XPATH, 格式為: {'ip':'xx', 'port':'xx', 'area':'xx'}
    detail_xpath ={}
    def __init__(self,urls =[],group_xpath='',detail_xpath={}):
        #提供初始方法, 傳入爬蟲(chóng)URL列表, 分組XPATH, 詳情(組內(nèi))XPATH
        if urls:
            self.urls =urls
        if group_xpath:
            self.group_xpath =group_xpath
        if detail_xpath:
            self.detail_xpath=detail_xpath
 
 
    def get_proxies(self):
        #獲取頁(yè)面數(shù)據(jù)
        for url in self.urls:
            page_html = self.get_page(url)
            proxies =self.get_html_proxies(page_html)
            #yeild from 返回的是proxies內(nèi)的數(shù)據(jù)
            yield from proxies
 
    def get_page(self,url):
        #請(qǐng)求頁(yè)面數(shù)據(jù)
        res =requests.get(url,headers = get_request_headers())
        #每次請(qǐng)求url休眠1秒
        time.sleep(random.uniform(1,5))
        return res.content
 
    def get_html_proxies(self,page_html):
        element = etree.HTML(page_html)
        trs = element.xpath(self.group_xpath)
        for tr in trs:
            ip = self.get_list_first(tr.xpath(self.detail_xpath['ip']))
            port = self.get_list_first(tr.xpath(self.detail_xpath['port']))
            area = self.get_list_first(tr.xpath(self.detail_xpath['area']))
            proxy = Proxy(ip,port,area=area)
            yield proxy
 
    def get_list_first(self,lst):
        #返回列表的第一個(gè)元素
        return lst[0] if len(lst) !=0 else ''
 
if __name__ == '__main__':
    config = {
    'urls':['http://www.ip3366.net/free/?stype=1&page={}'.format(i) for i in range(1,3)],
    'group_xpath':'//*[@id="list"]/table/tbody/tr',
    'detail_xpath':{
    'ip':'./td[1]/text()',
    'port':'./td[2]/text()',
    'area':'./td[5]/text()',
    }
    }
    spider = BaseSpider(**config)
    for proxy in spider.get_proxies():
        print(proxy)

7.實(shí)現(xiàn)具體的爬蟲(chóng)類(lèi)(proxy_spiders.py)

'''
實(shí)現(xiàn)具體的爬蟲(chóng)類(lèi)
'''
#import time
#import random
import requests
import sys
sys.path.append('../')
#import re
#import js2py
from proxy_spider.base_spider import BaseSpider
 
class XiciSpider(BaseSpider):
    '''西刺代理爬蟲(chóng)   '''
    urls = ['http://www.xicidaili.com/nn/{}'.format(i) for i in range(1,21)]
 
    group_xpath = '//*[@id="ip_list"]//tr[position()>1]'
    detail_xpath = {
    'ip':'./td[2]/text()',
    'port':'./td[3]/text()',
    'area':'./td[4]/a/text()',
    }
 
class Ip3366Spider(BaseSpider):
    '''
    ip3366代理爬蟲(chóng)
    '''
 
    urls = ['http://www.ip3366.net/free/?stype={}&page={}'.format(i,j) for i in range(1,4,2) for j in range(1,8)]
 
    group_xpath = '//*[@id="list"]/table/tbody/tr'
    detail_xpath = {
    'ip':'./td[1]/text()',
    'port':'./td[2]/text()',
    'area':'./td[5]/text()',
    }
 
 
class kuaiSpider(BaseSpider):
    '''
    快代理爬蟲(chóng)
    '''
 
    urls = ['http://www.kuaidaili.com/free/in{}/{}'.format(i,j) for i in ['ha','tr'] for j in range(1,21)]
 
    group_xpath = '//*[@id="list"]/table/tbody/tr'
    detail_xpath = {
    'ip':'./td[1]/text()',
    'port':'./td[2]/text()',
    'area':'./td[5]/a/text()',
    }
 
    '''
    def get_page(self,url):
        #隨機(jī)等待時(shí)間
        time.sleep(random.uniform(1,2))
        return super().get_page(url)
    '''
 
class Free89ipSpider(BaseSpider):
    '''
    89ip代理爬蟲(chóng)
    '''
    urls = ['http://www.89ip.cn/index{}.html'.format(i) for i in range(1,17)]
 
    group_xpath = '//div[3]//table/tbody/tr'
    detail_xpath = {
    'ip':'./td[1]/text()',
    'port':'./td[2]/text()',
    'area':'./td[3]/text()',
    }
 
    def get_page(self,url):
        return super().get_page(url).decode()
 
    def get_proxies(self):
        proxies = super().get_proxies()
        for item in proxies:
            item.ip = str(item.ip).replace("\n","").replace("\t","")
            item.area = str(item.area).replace("\n","").replace("\t","")
            item.port = str(item.port).replace("\n","").replace("\t","")
            #返回Proxy對(duì)象
            yield item
 
if __name__ == '__main__':
    spider = Free89ipSpider()
    count= 0
    for proxy in spider.get_proxies():
        count+=1
        print(proxy)

8. 實(shí)現(xiàn)運(yùn)行爬蟲(chóng)模塊(run_spider.py)

創(chuàng)建RunSpider類(lèi)
    run方法運(yùn)行爬蟲(chóng), 作為運(yùn)行爬蟲(chóng)的入口,獲取爬蟲(chóng)列表并運(yùn)行,檢測(cè)代理IP,可用,寫(xiě)入數(shù)據(jù)庫(kù)并處理爬蟲(chóng)內(nèi)部異常
    使用協(xié)程異步來(lái)執(zhí)行每一個(gè)爬蟲(chóng)任務(wù), 以提高抓取代理IP效率
    使用schedule模塊, 實(shí)現(xiàn)每隔一定的時(shí)間, 執(zhí)行一次爬取任務(wù)
'''
 
from gevent import monkey
monkey.patch_all()
from gevent.pool import Pool
 
import importlib
import sys,time
import schedule
sys.path.append('../')
from settings import PROXIES_SPIDERS,SPIDERS_RUN_INTERVAL
from proxy_validate.httpbin_validator import check_proxy
from proxies_db.mongo_pool import MongoPool
from proxies_utils.log import logger
 
class RunSpider(object):
 
    def __init__(self):
 
        self.mongo_pool = MongoPool()
        self.coroutine_pool = Pool()
 
    def get_spider_from_settings(self):
        '''
        獲取配置文件中的具體爬蟲(chóng)列表創(chuàng)建對(duì)象
        '''
        for full_class_name in PROXIES_SPIDERS:
            module_name,class_name = full_class_name.rsplit('.',maxsplit =1)
            #動(dòng)態(tài)導(dǎo)入模塊
            module = importlib.import_module(module_name)
 
            cls = getattr(module,class_name)
            spider = cls()
            yield spider
 
    def run(self):
        '''
        遍歷爬蟲(chóng)對(duì)象,執(zhí)行g(shù)et_proxies方法
        '''
        spiders = self.get_spider_from_settings()
        for spider in spiders:
            self.coroutine_pool.apply_async(self.__run_one_spider,args=(spider,))
        #當(dāng)前線程等待爬蟲(chóng)執(zhí)行完畢
        self.coroutine_pool.join()
    
 
    def __run_one_spider(self,spider):
        try:
            for proxy in spider.get_proxies():
                time.sleep(0.1)
                checked_proxy = check_proxy(proxy)
                if proxy.speed != -1:
                    self.mongo_pool.insert(checked_proxy)
        except Exception as er:
            logger.exception(er)
            logger.exception("爬蟲(chóng){} 出現(xiàn)錯(cuò)誤".format(spider))
 
    @classmethod
    def start(cls):
        '''
        類(lèi)方法,依據(jù)配置文件匯總的時(shí)間間隔run爬蟲(chóng),單位小時(shí)
        '''
        rs = RunSpider()
        rs.run()
        schedule.every(SPIDERS_RUN_INTERVAL).hours.do(rs.run)
 
        while 1:
            schedule.run_pending()
            time.sleep(60)
 
 
 
if __name__ == '__main__':
    #類(lèi)方法調(diào)用
    RunSpider.start()
    #app = RunSpider()
    #app.run()
 
    #測(cè)試schedue
    '''def task():
        print("haha")
    schedule.every(10).seconds.do(task)
    while 1:
        schedule.run_pending()
        time.sleep(1)'''

9. 代理IP檢查模塊(proxy_test.py)

'''
定期檢測(cè)數(shù)據(jù)庫(kù)中的代理ip的可用性,分?jǐn)?shù)評(píng)級(jí),更新數(shù)據(jù)庫(kù)
'''
from gevent import monkey
monkey.patch_all()
from gevent.pool import Pool
from queue import Queue
import schedule
import sys
sys.path.append('../')
from proxy_validate.httpbin_validator import check_proxy
from proxies_db.mongo_pool import MongoPool
from settings import TEST_PROXIES_ASYNC_COUNT,MAX_SCORE,TEST_RUN_INTERVAL
 
 
class DbProxiesCheck(object):
 
    def __init__(self):
        #創(chuàng)建操作數(shù)據(jù)庫(kù)對(duì)象
        self.mongo_pool = MongoPool()
        #待檢測(cè)ip隊(duì)列
        self.queue = Queue()
        #協(xié)程池
        self.coroutine_pool = Pool()
 
    #異步回調(diào)函數(shù)
    def __check_callback(self,temp):
        self.coroutine_pool.apply_async(self.__check_one,callback = self.__check_one())
 
 
    def run(self):
        #處理檢測(cè)代理ip核心邏輯
        proxies = self.mongo_pool.find_all()
 
        for proxy in proxies:
            self.queue.put(proxy)
 
        #開(kāi)啟多異步任務(wù)
        for i in range(TEST_PROXIES_ASYNC_COUNT):
            #異步回調(diào),死循環(huán)執(zhí)行該方法
            self.coroutine_pool.apply_async(self.__check_one,callback =self.__check_one())
        #當(dāng)前線程等待隊(duì)列任務(wù)完成
        self.queue.join()
 
 
 
    def __check_one(self):
        #檢查一個(gè)代理ip可用性
        #從隊(duì)列中獲取一個(gè)proxy
        proxy = self.queue.get()
 
        checked_proxy = check_proxy(proxy)
 
        if checked_proxy.speed == -1:
            checked_proxy.score -= 1
            if checked_proxy.score == 0:
                self.mongo_pool.delete(checked_proxy)
            else:
                self.mongo_pool.update(checked_proxy)
        else:
            checked_proxy.score = MAX_SCORE
            self.mongo_pool.updata(checked_proxy)
        #調(diào)度隊(duì)列的task_done方法(一個(gè)任務(wù)完成)
        self.queue.task_done()
 
 
    @classmethod
    def start(cls):
        '''
        類(lèi)方法,依據(jù)配置文件的時(shí)間間隔運(yùn)行檢測(cè)數(shù)據(jù)庫(kù)中的ip可用性,單位小時(shí)
        '''
        test = DbProxiesCheck()
        test.run()
        schedule.every(TEST_RUN_INTERVAL).hours.do(test.run)
 
        while 1:
            schedule.run_pending()
            time.sleep(60)
 
 
if __name__ == '__main__':
    DbProxiesCheck.start()
    #test = DbProxiesCheck()
    #test.run()

10. 代理IP池的API模塊(proxy_api.py)

'''
為爬蟲(chóng)提供穩(wěn)定可用的代理ip的接口
    根據(jù)協(xié)議類(lèi)型和域名,提供隨機(jī)的穩(wěn)定可用ip的服務(wù)
    根據(jù)協(xié)議類(lèi)型和域名,提供獲取多個(gè)高可用代理ip的服務(wù)
    給指定ip上追加不可用域名的服務(wù)
'''
from flask import Flask
from flask import request
import json
 
from proxies_db.mongo_pool import MongoPool
 
from settings import PROXIES_MAX_COUNT
 
 
class ProxyApi(object):
 
    def __init__(self):
 
        self.app = Flask(__name__)
 
        #操作數(shù)據(jù)庫(kù)的對(duì)象
        self.mongo_pool =  MongoPool()
 
        #獲取接口url中參數(shù)
        @self.app.route('/random')
        def random():
            protocol = request.args.get('protocol')
            domain = request.args.get('domain')
            proxy = self.mongo_pool.random_proxy(protocal,domain,count = PROXIES_MAX_COUNT)
 
            if protocol:
                return '{}://{}:{}'.format(protocol,proxy.ip,proxy.port)
            else:
                return '{}:{}'.format(proxy.ip,proxy.port)
 
        @self.app.route('/proxies')
        def proxies():
            protocol = request.args.get('protocol')
            domain = request.args.get('domain')
            proxies =self.mongo_pool.get_proxies(protocol,domain,count =PROXIES_MAX_COUNT)
            #proxies是proxy對(duì)象構(gòu)成的列表,需要轉(zhuǎn)換為字典的列表
 
            proxies_dict_list =[proxy.__dict__ for proxy in proxies]
            return json.dumps(proxies_dict_list)
        
        @self.app.route('/disabldomain')
        def disable_domain():
            ip = request.args.get('ip')
            domain = request.args.get('domain')
 
            if ip is None:
                return '請(qǐng)?zhí)峁﹊p參數(shù)'
            if domain is None:
                return '請(qǐng)?zhí)峁┯蛎鹍omain參數(shù)'
            self.mongo_pool.add_disable_domain(ip,domain)
            return '{} 禁用域名 {} 成功'.format(ip,domain)
 
    def run(self,debug):
        self.app.run('0.0.0.0',port = 16888,debug = debug)
 
    @classmethod
    def start(cls,debug = None):
        proxy_api = cls()
        proxy_api.run(debug = debug)
 
if __name__ == '__main__':
    ProxyApi.start(debug = True)
    #proxy_api = ProxyApi()
    #proxy_api.run(debug = True)

11. 代理池啟動(dòng)入口(main.py)

'''
代理池統(tǒng)一入口:
   開(kāi)啟多個(gè)進(jìn)程,分別啟動(dòng),爬蟲(chóng),檢測(cè)代理ip,WEB服務(wù)
'''
 
from multiprocessing import Process
from proxy_spider.run_spider import RunSpider
from proxy_test import DbProxiesCheck
from proxy_api import ProxyApi
 
def run():
    process_list = []
    #啟動(dòng)爬蟲(chóng)
    process_list.append(Process(target = RunSpider.start))
    #啟動(dòng)檢測(cè)
    process_list.append(Process(target = DbProxiesCheck.start))
    #啟動(dòng)web服務(wù)
    process_list.append(Process(target = ProxyApi.start))
 
    for process in process_list:
        #設(shè)置守護(hù)進(jìn)程
        process.daemon = True
        process.start()
    #主進(jìn)程等待子進(jìn)程的完成
    for process in process_list:
        process.join()
 
 
if __name__ == '__main__':
    run()

12 .配置文件模塊(settings.py)

#默代理IP的默認(rèn)最高分?jǐn)?shù)
MAX_SCORE =50
 
import logging
 
#日志模塊默認(rèn)配置:
# 默認(rèn)等級(jí)
LOG_LEVEL = logging.DEBUG
#默認(rèn)日志格式
LOG_FMT = '%(asctime)s %(filename)s [line:%(lineno)d] %(levelname)s: %(message)s'
# 默認(rèn)時(shí)間格式
LOG_DATEFMT = '%Y-%m-%d %H:%M:%S'
# 默認(rèn)日志文件名稱
LOG_FILENAME = 'log.log'
 
#請(qǐng)求超時(shí)參數(shù)
CHECK_TIMEOUT = 10
 
#mongodb的URL配置
MONGO_URL = 'mongodb://127.0.0.1:27017/'
 
#具體爬蟲(chóng)的配置列表
PROXIES_SPIDERS = [
"proxy_spider.proxy_spiders.XiciSpider",
"proxy_spider.proxy_spiders.Ip3366Spider",
"proxy_spider.proxy_spiders.kuaiSpider",
"proxy_spider.proxy_spiders.Free89ipSpider",
]
 
#爬蟲(chóng)間隔自動(dòng)運(yùn)行時(shí)間
SPIDERS_RUN_INTERVAL = 4
 
#配置檢測(cè)代理ip的異步數(shù)量
TEST_PROXIES_ASYNC_COUNT = 10
 
#db中ip間隔自動(dòng)運(yùn)行時(shí)間
TEST_RUN_INTERVAL = 2
 
#隨機(jī)獲取代理ip的最大數(shù)量
PROXIES_MAX_COUNT = 50
最后編輯于
?著作權(quán)歸作者所有,轉(zhuǎn)載或內(nèi)容合作請(qǐng)聯(lián)系作者
【社區(qū)內(nèi)容提示】社區(qū)部分內(nèi)容疑似由AI輔助生成,瀏覽時(shí)請(qǐng)結(jié)合常識(shí)與多方信息審慎甄別。
平臺(tái)聲明:文章內(nèi)容(如有圖片或視頻亦包括在內(nèi))由作者上傳并發(fā)布,文章內(nèi)容僅代表作者本人觀點(diǎn),簡(jiǎn)書(shū)系信息發(fā)布平臺(tái),僅提供信息存儲(chǔ)服務(wù)。

相關(guān)閱讀更多精彩內(nèi)容

  • 一、背景 前段時(shí)間,寫(xiě)java爬蟲(chóng)。不料,爬了一段時(shí)間后ip被封禁了。由此,想到了使用ip代理,但是找了很多的ip...
    多彩海洋閱讀 3,857評(píng)論 0 1
  • feisky云計(jì)算、虛擬化與Linux技術(shù)筆記posts - 1014, comments - 298, trac...
    不排版閱讀 4,346評(píng)論 0 5
  • 爬蟲(chóng)代理IP池 在公司做分布式深網(wǎng)爬蟲(chóng),搭建了一套穩(wěn)定的代理池服務(wù),為上千個(gè)爬蟲(chóng)提供有效的代理,保證各個(gè)爬蟲(chóng)拿到的...
    派派森森閱讀 470評(píng)論 0 1
  • 我要見(jiàn)馬云,我要見(jiàn)馬云 ----文:紅精靈 告訴他我快成機(jī)器人,告訴他我抱著電腦 一天只睡兩個(gè)時(shí)辰,告訴他我怕天黑...
    紅精靈閱讀 313評(píng)論 2 1
  • 《六頂思考帽》 1.白娘子:事實(shí) 言必說(shuō)數(shù)據(jù)的人 只有愿景和事實(shí)能說(shuō)服人 2. 紅孩兒:感覺(jué) 說(shuō)印象講感覺(jué)的人 3...
    作家阿紫閱讀 248評(píng)論 0 0

友情鏈接更多精彩內(nèi)容