阿里云天池——金融風(fēng)控-貸款違約預(yù)測(cè)(一)

賽題理解

賽題以金融風(fēng)控中的個(gè)人信貸為背景,要求選手根據(jù)貸款申請(qǐng)人的數(shù)據(jù)信息預(yù)測(cè)其是否有違約的可能,以此判斷是否通過(guò)此項(xiàng)貸款,這是一個(gè)典型的分類問(wèn)題。通過(guò)這道賽題來(lái)引導(dǎo)大家了解金融風(fēng)控中的一些業(yè)務(wù)背景,解決實(shí)際問(wèn)題,幫助競(jìng)賽新人進(jìn)行自我練習(xí)、自我提高。

數(shù)據(jù)是給好的,但是實(shí)際的風(fēng)控?cái)?shù)據(jù)的X的定義和y標(biāo)簽的定義都是很有學(xué)問(wèn)的。
簡(jiǎn)單來(lái)說(shuō):得劃分,觀察期——觀察點(diǎn)——表現(xiàn)期。
觀察期確定X,利用運(yùn)營(yíng)商數(shù)據(jù)、電商數(shù)據(jù)、金融機(jī)構(gòu)數(shù)據(jù)、第三方數(shù)據(jù)等并進(jìn)行衍生。
觀察點(diǎn):一般選擇授信日。
表現(xiàn)期(需要結(jié)合滾動(dòng)率分析、Vintage分析進(jìn)行確定):確定Y標(biāo)簽。

賽題概況

比賽要求參賽選手根據(jù)給定的數(shù)據(jù)集,建立模型,預(yù)測(cè)金融風(fēng)險(xiǎn)。賽題以預(yù)測(cè)金融風(fēng)險(xiǎn)為任務(wù),數(shù)據(jù)集報(bào)名后可見(jiàn)并可下載,該數(shù)據(jù)來(lái)自某信貸平臺(tái)的貸款記錄,總數(shù)據(jù)量超過(guò)120w,包含47列變量信息,其中15列為匿名變量。為了保證比賽的公平性,將會(huì)從中抽取80萬(wàn)條作為訓(xùn)練集,20萬(wàn)條作為測(cè)試集A,20萬(wàn)條作為測(cè)試集B,同時(shí)會(huì)對(duì)employmentTitle、purpose、postCode和title等信息進(jìn)行脫敏。

數(shù)據(jù)概況

一般而言,對(duì)于數(shù)據(jù)在比賽界面都有對(duì)應(yīng)的數(shù)據(jù)概況介紹(匿名特征除外),說(shuō)明列的性質(zhì)特征。了解列的性質(zhì)會(huì)有助于我們對(duì)于數(shù)據(jù)的理解和后續(xù)分析。 Tip:匿名特征,就是未告知數(shù)據(jù)列所屬的性質(zhì)的特征列。
train.csv

  1. id 為貸款清單分配的唯一信用證標(biāo)識(shí)
  2. loanAmnt 貸款金額
  3. term 貸款期限(year)
  4. interestRate 貸款利率
  5. installment 分期付款金額
  6. grade 貸款等級(jí)
  7. subGrade 貸款等級(jí)之子級(jí)
  8. employmentTitle就業(yè)職稱
  9. employmentLength 就業(yè)年限(年)
  10. homeOwnership 借款人在登記時(shí)提供的房屋所有權(quán)狀況
  11. annualIncome 年收入
  12. verificationStatus 驗(yàn)證狀態(tài)
  13. issueDate 貸款發(fā)放的月份
  14. purpose 借款人在貸款申請(qǐng)時(shí)的貸款用途類別
  15. postCode 借款人在貸款申請(qǐng)中提供的郵政編碼的前3位數(shù)字
  16. regionCode 地區(qū)編碼
  17. dti 債務(wù)收入比
  18. delinquency_2years 借款人過(guò)去2年信用檔案中逾期30天以上的違約事件數(shù)
  19. ficoRangeLow 借款人在貸款發(fā)放時(shí)的fico所屬的下限范圍
  20. ficoRangeHigh 借款人在貸款發(fā)放時(shí)的fico所屬的上限范圍
  21. openAcc 借款人信用檔案中未結(jié)信用額度的數(shù)量
  22. pubRec 貶損公共記錄的數(shù)量
  23. pubRecBankruptcies 公開(kāi)記錄清除的數(shù)量
  24. revolBal 信貸周轉(zhuǎn)余額合計(jì)
  25. revolUtil 循環(huán)額度利用率,或借款人使用的相對(duì)于所有可用循環(huán)信貸的信貸金額
  26. totalAcc 借款人信用檔案中當(dāng)前的信用額度總數(shù)
  27. initialListStatus 貸款的初始列表狀態(tài)
  28. applicationType 表明貸款是個(gè)人申請(qǐng)還是與兩個(gè)共同借款人的聯(lián)合申請(qǐng)
  29. earliesCreditLine 借款人最早報(bào)告的信用額度開(kāi)立的月份
  30. title 借款人提供的貸款名稱
  31. policyCode 公開(kāi)可用的策略代碼=1新產(chǎn)品不公開(kāi)可用的策略代碼=2
  32. n系列匿名特征 匿名特征n0-n14,為一些貸款人行為計(jì)數(shù)特征的處理

預(yù)測(cè)指標(biāo)

競(jìng)賽采用AUC作為評(píng)價(jià)指標(biāo)。AUC(Area Under Curve)被定義為 ROC曲線 下與坐標(biāo)軸圍成的面積。

為什么用AUC呢?(之前實(shí)習(xí)的時(shí)候,剛開(kāi)始我就做了一個(gè)準(zhǔn)確率90%+的模型,興沖沖的拿給領(lǐng)導(dǎo)去看,然后就被告知,這個(gè)模型的入模變量有問(wèn)題,再進(jìn)行篩選吧)
后邊也知道了不光得看AUC,還得結(jié)合KS指標(biāo)進(jìn)行評(píng)價(jià)。

為什么風(fēng)控模型不拿準(zhǔn)確率來(lái)衡量呢?為什么要用AUC和KS呢?

因?yàn)轱L(fēng)控模型并不和貓狗二分類問(wèn)題一樣,信貸風(fēng)控追求的是風(fēng)險(xiǎn)與收益之間的平衡,因此好壞定義常常是模糊的。原因在于,壞的客群雖然能帶來(lái)壞賬損失,但同時(shí)也能帶來(lái)利息、罰息等收入。那么,我們能接受多壞的客群呢?這就取決于風(fēng)險(xiǎn)容忍度。在風(fēng)控中,y的定義并不是非黑即白(離散型),而用概率分布(連續(xù)型)來(lái)衡量或許更合理。

還有一個(gè)問(wèn)題就是在風(fēng)控場(chǎng)景中,樣本不均衡問(wèn)題非常普遍,一般正負(fù)樣本比都能達(dá)到1:100甚至更低。此時(shí),評(píng)估模型的準(zhǔn)確率是不可靠的。因?yàn)橹灰款A(yù)測(cè)為負(fù)樣本,就能達(dá)到很高的準(zhǔn)確率。例如,如果數(shù)據(jù)集中有95個(gè)貓和5個(gè)狗,分類器會(huì)簡(jiǎn)單的將其都分為貓,此時(shí)準(zhǔn)確率是95%。因此,評(píng)估準(zhǔn)確率是沒(méi)有意義的。
(參考:客戶層申請(qǐng)?jiān)u分卡(A卡)模型
風(fēng)控模型—區(qū)分度評(píng)估指標(biāo)(KS)深入理解應(yīng)用)

AUC

真陽(yáng)率:TPR = \frac{TP}{TP+FN}
假陽(yáng)率:FPR = \frac{FP}{FP+TN}

業(yè)務(wù)的目的:追求更高的TPR,也就是"抓對(duì)了";以及更低的FPR,也就是"抓錯(cuò)了"。

  1. 給定不同的閾值T,低于閾值預(yù)測(cè)為bad,高于閾值預(yù)測(cè)為good,然后計(jì)算TPR和FPR。
  2. 重復(fù)多次,在不同閾值T下計(jì)算得到多個(gè)TPR和FPR。
  3. 以FPR為橫軸,TPR為縱軸,畫(huà)出ROC曲線。曲線下方的面積即為AUC值。

我們的訴求就是更高的TPR更低的FPR,因此可以的定義如下目標(biāo)函數(shù)(也就是KS了):

KS = MAX(|TPR - FPR|)

而TPR是比FPR大的,因此有TPR = KS + FPR,也就是說(shuō)ROC曲線上點(diǎn)的切線的截距項(xiàng)反應(yīng)了KS值的大小,而KS反應(yīng)的是累計(jì)壞客戶率(TPR = 閾值左方區(qū)域(預(yù)測(cè)為bad & 真實(shí)為bad) / 整體區(qū)域真實(shí)為bad)和累計(jì)好客戶率(FPR = 閾值左方區(qū)域(預(yù)測(cè)為bad & 真實(shí)為good)/ 整體區(qū)域真實(shí)為good)的區(qū)分。

詳情請(qǐng)看求是旺在路上大神的文章,我也是在看過(guò)他的多篇文章后才對(duì)風(fēng)控有了更深的理解的。

  1. 若希望KS盡可能大,那么切點(diǎn)需要盡可能接近(0,1),此時(shí)AUC一般也會(huì)增大。
  2. 對(duì)于相同的KS值,在KS曲線上有兩個(gè)選擇,但TPR和FPR同時(shí)大或同時(shí)小。雖然我們的目的通常是抓對(duì)更多的壞人(TPR?),盡可能減少錯(cuò)抓的好人(FPR?),但兩者需要trade-off。到底選擇哪個(gè)閾值,取決于業(yè)務(wù)目標(biāo):是希望對(duì)bad有更高的召回,還是對(duì)good有更低的誤傷?
  3. 由于KS只是在一個(gè)最大分隔點(diǎn)時(shí)的值,并不夠全面。通常我們也會(huì)同時(shí)參考KS和AUC(或Gini)

基本的評(píng)分卡模型(邏輯回歸)

這一用到了兩個(gè)特別好的評(píng)分卡建模的包:toad、scorecardpy。

如果本地運(yùn)行較慢可以將數(shù)據(jù)上傳到kaggle或者天池的在線編程。

pip install toad
pip install scorecardpy

參考:
https://github.com/ShichenXie/scorecardpy 這個(gè)有例子(雖然例子里有小錯(cuò)誤),但不影響使用
https://toad.readthedocs.io/en/latest/ 也有中文教程 但還是英語(yǔ)的好讀一些

加載包

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import datetime
from tqdm import tqdm
from sklearn.preprocessing import LabelEncoder
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2
from sklearn.preprocessing import MinMaxScaler
import xgboost as xgb
import lightgbm as lgb
from catboost import CatBoostRegressor
import warnings
from sklearn.model_selection import StratifiedKFold, KFold
from sklearn.metrics import accuracy_score, f1_score, roc_auc_score, log_loss
warnings.filterwarnings('ignore')
import toad
import scorecardpy as sc

讀取數(shù)據(jù)

data_train =pd.read_csv('../input/fengkong/train.csv', index_col='id')
data_test_a = pd.read_csv('../input/fengkong/testA.csv', index_col='id')

數(shù)據(jù)清洗

區(qū)分?jǐn)?shù)值列和非數(shù)值列(object類型:日期、字符串等)

'''
# 非數(shù)值列
s = data_train.apply(lambda x:x.dtype)
tecols = s[s=='object'].index.tolist()
'''
numerical_fea = list(data_train.select_dtypes(exclude=['object']).columns)
category_fea = list(filter(lambda x: x not in numerical_fea,list(data_train.columns)))
label = 'isDefault'
numerical_fea.remove(label)

對(duì)于非數(shù)值列進(jìn)行編碼

category_fea
['grade', 'subGrade', 'employmentLength', 'issueDate', 'earliesCreditLine']

最初的想法是,對(duì)于'grade', 'subGrade'采用LabelEncoder(),對(duì)于'employmentLength'提取字符串中的數(shù)字,'issueDate', 'earliesCreditLine'都轉(zhuǎn)為與最小時(shí)間的差值。

'''
for data in [data_train, data_test_a]:
    data['grade'] = data['grade'].map({'A':1,'B':2,'C':3,'D':4,'E':5,'F':6,'G':7})
    data['subGrade'] = data['subGrade'].map({'A1':1.0,'A2':1.2,'A3':1.4,'A4':1.6,'A5':1.8,
                                       'B1':2.0,'B2':2.2,'B3':2.4,'B4':2.6,'B5':2.8,
                                       'C1':3.0,'C2':3.2,'C3':3.4,'C4':3.6,'C5':3.8,
                                       'D1':4.0,'D2':4.2,'D3':4.4,'D4':4.6,'D5':4.8,
                                       'E1':5.0,'R2':5.2,'E3':5.4,'E4':5.6,'E5':5.8,
                                       'F1':6.0,'F2':6.2,'F3':6.4,'F4':6.6,'F5':6.8,
                                       })

def employmentLength_to_int(s):
    if pd.isnull(s):
        return s
    else:
        return np.int8(s.split()[0])
    
for data in [data_train, data_test_a]:
    data['employmentLength'].replace(to_replace='10+ years', value='10 years', inplace=True)
    data['employmentLength'].replace('< 1 year', '0 years', inplace=True)
    data['employmentLength'] = data['employmentLength'].apply(employmentLength_to_int)
    
data_train['employmentLength'].value_counts(dropna=False).sort_index()
'''
'''
#轉(zhuǎn)化成時(shí)間格式
for data in [data_train, data_test_a]:
    data['issueDate'] = pd.to_datetime(data['issueDate'],format='%Y-%m-%d')
    startdate = datetime.datetime.strptime('2007-06-01', '%Y-%m-%d')
    #構(gòu)造時(shí)間特征
    data['issueDate'] = data['issueDate'].apply(lambda x: x-startdate).dt.days
data_train['issueDate'].sample(5)
'''
'''
for data in [data_train, data_test_a]:
    data['earliesCreditLine'] = pd.to_datetime(data['earliesCreditLine'])
    startdate = np.min(data['earliesCreditLine'])
    #構(gòu)造時(shí)間特征
    data['earliesCreditLine'] = data['earliesCreditLine'].apply(lambda x: x-startdate).dt.days
data_train['earliesCreditLine'].sample(5)
'''

但是后期發(fā)現(xiàn)了一個(gè)更加簡(jiǎn)單的 TargetEncoder
https://zhuanlan.zhihu.com/p/40231966
https://blog.csdn.net/SHU15121856/article/details/102100689

簡(jiǎn)而言之就是:對(duì)每個(gè)特征c的每個(gè)取值,將其變成使target為1的頻率
\frac{這個(gè)特征出現(xiàn)時(shí),target為1的次數(shù)}{這個(gè)特征出現(xiàn)的總次數(shù)}
其實(shí)就是這個(gè)特征對(duì)應(yīng)的壞客戶率。

為了避免過(guò)擬合,K折目標(biāo)編碼將要編碼的樣本分成K份,每其中一份中的樣本的目標(biāo)編碼,使用的是另外K-1份數(shù)據(jù)中相同類別的那些樣本的頻率值。

from category_encoders.target_encoder import TargetEncoder
te = TargetEncoder(cols=category_fea)
train = te.fit_transform(data_train, target)
test = te.transform(data_test_a)

數(shù)據(jù)探索

與describe()類似,但toad中的detect函數(shù)更全面,不光能給出數(shù)值變量的統(tǒng)計(jì)特征,也可以對(duì)類別變量進(jìn)行統(tǒng)計(jì)分析。
可以看到數(shù)值類型、數(shù)據(jù)大小、缺失情況、唯一值的數(shù)量、均值、方差、分位數(shù)(頻率前幾的類別型變量)。

toad.detect(train)

可以看出policyCode均為1,這個(gè)變量對(duì)于分類是不起作用的。

特征篩選

  1. 計(jì)算IV值
    scorecardpy中,計(jì)算sc.iv(dt, y, x=None, positive='bad|1', order=True):
    toad中,toad.quality(dataframe, target=’target’, iv_only=False): 輸出IV(信息值),基尼系數(shù),熵和唯一值的數(shù)量的集合。 功能按IV降序排序。 “ target”是目標(biāo)變量,“ iv_only”指定是否僅計(jì)算IV。
toad.quality(train, target=target, iv_only=True)

變量選擇

基本可以從以下幾個(gè)方面進(jìn)行篩選:缺失率、單一值、變異系數(shù)、穩(wěn)定性PSI、信息量IV值、基于RF/XGBoost特征重要性、線性相關(guān)性、多重共線性、逐步回歸、P值的顯著性檢驗(yàn)。

scorecardpy中: 通過(guò)IV值,缺失率,單一值率進(jìn)行篩選

var_filter(dt, y, x=None, iv_limit=0.02, missing_limit=0.95,
identical_limit=0.95, var_rm=None, var_kp=None,
return_rm_reason=False, positive='bad|1'):

 Params
    ------
    dt: A data frame with both x (predictor/feature) and y 
      (response/label) variables.
    y: Name of y variable.
    x: Name of x variables. Default is NULL. If x is NULL, then all 
      variables except y are counted as x variables.
    iv_limit: The information value of kept variables should>=iv_limit. 
      The default is 0.02.
    missing_limit: The missing rate of kept variables should<=missing_limit. 
      The default is 0.95.
    identical_limit: The identical value rate (excluding NAs) of kept 
      variables should <= identical_limit. The default is 0.95.
    var_rm: Name of force removed variables, default is NULL.
    var_kp: Name of force kept variables, default is NULL.
    return_rm_reason: Logical, default is FALSE.
    positive: Value of positive class, default is "bad|1".
    
    Returns
    ------
    DataFrame
        A data.table with y and selected x variables
    Dict(if return_rm_reason == TRUE)
        A DataFrame with y and selected x variables and 
          a DataFrame with the reason of removed x variable.

toad中:

toad.selection.select(dataframe, target=’target’, empty=0.9, iv=0.02, corr=0.7, return_drop=False, exclude=None):

根據(jù)缺失百分比,IV和相關(guān)性(與其他特征)進(jìn)行初步特征選擇,變量為: 
empyt = 0.9:缺失百分比大于90%的要素被過(guò)濾;
iv = 0.02:消除IV小于0.02的特征; 
corr = 0.7:如果兩個(gè)或多個(gè)特征的Pearson相關(guān)性大于0.7,則IV值較低的特征將被消除;
return_drop = False:如果設(shè)置為T(mén)rue,則該函數(shù)返回已刪除列的列表;否則為false。
exclude = None:輸入要從算法中排除的功能列表,通常是ID列和month列。

這里使用Toad中的函數(shù)進(jìn)行篩選變量:

train_selected, dropped = toad.selection.select(train, target=target, empty=0.9, iv=0.02, corr=0.9, return_drop=True)

print("keep:",train_selected.shape[1],
      "drop empty:",len(dropped['empty']),
      "drop iv:",len(dropped['iv']),
      "drop corr:",len(dropped['corr']))

輸出

keep: 15 drop empty: 0 drop iv: 24 drop corr: 6
{'empty': array([], dtype=float64),
 'iv': array(['employmentLength', 'purpose', 'postCode', 'regionCode',
        'delinquency_2years', 'openAcc', 'pubRec', 'pubRecBankruptcies',
        'revolBal', 'totalAcc', 'initialListStatus', 'applicationType',
        'policyCode', 'n0', 'n1', 'n4', 'n5', 'n6', 'n7', 'n8', 'n10',
        'n11', 'n12', 'n13'], dtype=object),
 'corr': array(['n9', 'grade', 'n2.1', 'installment', 'ficoRangeHigh',
        'interestRate'], dtype=object)}

分箱

scorecardpy中:默認(rèn)決策樹(shù)分箱

woebin(dt, y, x=None, 
           var_skip=None, breaks_list=None, special_values=None, 
           stop_limit=0.1, count_distr_limit=0.05, bin_num_limit=8, 
           # min_perc_fine_bin=0.02, min_perc_coarse_bin=0.05, max_num_bin=8, 
           positive="bad|1", no_cores=None, print_step=0, method="tree",
           ignore_const_cols=True, ignore_datetime_cols=True, 
           check_cate_num=True, replace_blank=True, 
           save_breaks_list=None, **kwargs):
 WOE Binning
    ------
    `woebin` generates optimal binning for numerical, factor and categorical 
    variables using methods including tree-like segmentation or chi-square 
    merge. woebin can also customizing breakpoints if the breaks_list or 
    special_values was provided.
    
    The default woe is defined as ln(Distr_Bad_i/Distr_Good_i). If you 
    prefer ln(Distr_Good_i/Distr_Bad_i), please set the argument `positive` 
    as negative value, such as '0' or 'good'. If there is a zero frequency 
    class when calculating woe, the zero will replaced by 0.99 to make the 
    woe calculable.
    
    Params
    ------
    dt: A data frame with both x (predictor/feature) and y (response/label) variables.
    y: Name of y variable.
    x: Name of x variables. Default is None. If x is None, 
      then all variables except y are counted as x variables.
    var_skip: Name of variables that will skip for binning. Defaults to None.
    breaks_list: List of break points, default is None. 
      If it is not None, variable binning will based on the 
      provided breaks.
    special_values: the values specified in special_values 
      will be in separate bins. Default is None.
    count_distr_limit: The minimum percentage of final binning 
      class number over total. Accepted range: 0.01-0.2; default 
      is 0.05.
    stop_limit: Stop binning segmentation when information value 
      gain ratio less than the stop_limit, or stop binning merge 
      when the minimum of chi-square less than 'qchisq(1-stoplimit, 1)'. 
      Accepted range: 0-0.5; default is 0.1.
    bin_num_limit: Integer. The maximum number of binning.
    positive: Value of positive class, default "bad|1".
    no_cores: Number of CPU cores for parallel computation. 
      Defaults None. If no_cores is None, the no_cores will 
      set as 1 if length of x variables less than 10, and will 
      set as the number of all CPU cores if the length of x variables 
      greater than or equal to 10.
    print_step: A non-negative integer. Default is 1. If print_step>0, 
      print variable names by each print_step-th iteration. 
      If print_step=0 or no_cores>1, no message is print.
    method: Optimal binning method, it should be "tree" or "chimerge". 
      Default is "tree".
    ignore_const_cols: Logical. Ignore constant columns. Defaults to True.
    ignore_datetime_cols: Logical. Ignore datetime columns. Defaults to True.
    check_cate_num: Logical. Check whether the number of unique values in 
      categorical columns larger than 50. It might make the binning process slow 
      if there are too many unique categories. Defaults to True.
    replace_blank: Logical. Replace blank values with None. Defaults to True.
    save_breaks_list: The file name to save breaks_list. Default is None.
    
    Returns
    ------
    dictionary
        Optimal or customized binning dataframe.

toad中默認(rèn)卡房分箱:

Toad的分箱功能同時(shí)支持類別變量和數(shù)值變量。 
toad.transform.Combiner()用于訓(xùn)練 
1.初始化:c = toad.transform.Combiner() 
2. 訓(xùn)練分箱:c.fit(dataframe, y = ‘target’, method = ‘chi’, min_samples = None, n_bins = None, empty_separate = False) 
§ y: target variable; 
§ method: the method to apply binning. Suport ‘chi’ (Chi-squared), ‘dt’, (decisin tree), ‘kmeans’ (K-means), ‘quantile’ (by the same percentile), and ‘step’ (by the same step); 
§ min_samples: can be a number or a porportion. Minimum number / porportion of samples required in each bucket; 
§ n_bins: mininum number of buckets. If the number is too large, the algorithm will return the maxinum number of buckets it can get; 
§ empty_separate: whether to seperate the missing values in a bucket. If False, missing values will be put along with the bucket of most close bad rate. 
3. 分箱結(jié)果:c.export() 
4. 分箱調(diào)整: c.set_rules(dict) 
5. 應(yīng)用分箱并轉(zhuǎn)為離散值 c.transform(dataframe, labels=False): 
§ labels: whether to convert data to explanatory labels. Returns 0, 1, 2 … when False. Categorical features will be sorted in a descending order of porportion. Returns (-inf, 0], (0,10], (10, inf) when True. 

Note: 1. remember to exclude the unwanted columns, especially ID column and timestamp column. 2. Columns with large number of unique values may take much time to train.

兩者都有調(diào)整分箱的方法:
scorecardpy : sc.woebin(dt_s, y="creditability", breaks_list=breaks_adj)
toad :c.set_rules(dict)

這里采用scorecardpy中的方法進(jìn)行WOE分箱

train_selected = pd.concat([train_selected, target.rename('isDefault')], axis=1) 
bins = sc.woebin(train_selected, y="isDefault")

將分箱可視化并分析單調(diào)性:

sc.woebin_plot(bins)

幾個(gè)例子:





這里需要觀察單調(diào)性,然后進(jìn)行分箱調(diào)整(沒(méi)有做)。

然后講訓(xùn)練集和測(cè)試集都轉(zhuǎn)為WOE編碼

train_woe = sc.woebin_ply(train_selected, bins)
test_a_woe = sc.woebin_ply(test_a_selected, bins)

模型訓(xùn)練

# breaking dt into train and val
train, val = sc.split_df(train_woe, 'isDefault').values()

y_train = train.loc[:,'isDefault']
X_train = train.loc[:,train.columns != 'isDefault']
y_val = val.loc[:,'isDefault']
X_val = val.loc[:,val.columns != 'isDefault']

# logistic regression ------
from sklearn.linear_model import LogisticRegression
lr = LogisticRegression(penalty='l1', C=0.9, solver='saga', n_jobs=-1)
lr.fit(X_train, y_train)
# lr.coef_
# lr.intercept_

# predicted proability
train_pred = lr.predict_proba(X_train)[:,1]
val_pred = lr.predict_proba(X_val)[:,1]

查看在訓(xùn)練集和驗(yàn)證集上的AUC及KS

train_perf = sc.perf_eva(y_train, train_pred, title = "train")
val_perf = sc.perf_eva(y_val, val_pred, title = "val")

KS、AUC都在合理的范圍之內(nèi),并且驗(yàn)證集合訓(xùn)練集的表現(xiàn)的十分接近,說(shuō)明模型十分穩(wěn)健。

評(píng)分卡

將變量分箱轉(zhuǎn)為分?jǐn)?shù):

card = sc.scorecard(bins, lr, xcolumns = X_train.columns)
{'basepoints':      variable  bin  points
 0  basepoints  NaN   488.0,
 'n14':    variable         bin  points
 26      n14  [-inf,1.0)     9.0
 27      n14   [1.0,3.0)     3.0
 28      n14   [3.0,5.0)    -6.0
 29      n14   [5.0,inf)   -12.0,
 'employmentTitle':            variable                  bin  points
 30  employmentTitle      [-inf,200000.0)    -0.0
 31  employmentTitle  [200000.0,240000.0)     1.0
 32  employmentTitle  [240000.0,310000.0)     1.0
 33  employmentTitle       [310000.0,inf)     0.0,
 'earliesCreditLine':             variable                                        bin  points
 0  earliesCreditLine                 [-inf,0.17999999999999997)     6.0
 1  earliesCreditLine  [0.17999999999999997,0.19999999999999996)     2.0
 2  earliesCreditLine  [0.19999999999999996,0.20999999999999996)    -1.0
 3  earliesCreditLine  [0.20999999999999996,0.22999999999999995)    -4.0
 4  earliesCreditLine                  [0.22999999999999995,inf)    -8.0,
 'homeOwnership':         variable         bin  points
 5  homeOwnership  [-inf,1.0)    12.0
 6  homeOwnership   [1.0,2.0)   -13.0
 7  homeOwnership   [2.0,inf)    -3.0,
 'verificationStatus':               variable         bin  points
 8   verificationStatus  [-inf,1.0)     7.0
 9   verificationStatus   [1.0,2.0)    -1.0
 10  verificationStatus   [2.0,inf)    -5.0,
 'revolUtil':      variable          bin  points
 34  revolUtil  [-inf,20.0)     1.0
 35  revolUtil  [20.0,35.0)     0.0
 36  revolUtil  [35.0,55.0)     0.0
 37  revolUtil  [55.0,75.0)    -0.0
 38  revolUtil   [75.0,inf)    -0.0,
 'annualIncome':         variable                 bin  points
 39  annualIncome      [-inf,45000.0)   -14.0
 40  annualIncome   [45000.0,65000.0)    -6.0
 41  annualIncome   [65000.0,75000.0)     0.0
 42  annualIncome  [75000.0,105000.0)     8.0
 43  annualIncome      [105000.0,inf)    20.0,
 'title':    variable         bin  points
 11    title  [-inf,4.0)     0.0
 12    title   [4.0,5.0)    -0.0
 13    title   [5.0,6.0)    -0.0
 14    title  [6.0,20.0)     0.0
 15    title  [20.0,inf)    -0.0,
 'loanAmnt':     variable                bin  points
 44  loanAmnt      [-inf,4000.0)    17.0
 45  loanAmnt   [4000.0,10000.0)    11.0
 46  loanAmnt  [10000.0,16000.0)    -2.0
 47  loanAmnt      [16000.0,inf)    -9.0,
 'n2':    variable         bin  points
 52       n2  [-inf,4.0)     7.0
 53       n2   [4.0,6.0)     3.0
 54       n2   [6.0,9.0)    -3.0
 55       n2   [9.0,inf)   -11.0,
 'issueDate':      variable                                        bin  points
 48  issueDate                 [-inf,0.17999999999999994)    24.0
 49  issueDate  [0.17999999999999994,0.19999999999999996)     3.0
 50  issueDate  [0.19999999999999996,0.21999999999999995)    -4.0
 51  issueDate                  [0.21999999999999995,inf)   -18.0,
 'subGrade':     variable                        bin  points
 16  subGrade                 [-inf,0.1)    64.0
 17  subGrade                  [0.1,0.2)    19.0
 18  subGrade  [0.2,0.30000000000000004)   -13.0
 19  subGrade  [0.30000000000000004,0.4)   -34.0
 20  subGrade                  [0.4,inf)   -54.0,
 'dti':    variable          bin  points
 21      dti  [-inf,14.0)     8.0
 22      dti  [14.0,21.0)     2.0
 23      dti  [21.0,25.0)    -3.0
 24      dti  [25.0,30.0)    -8.0
 25      dti   [30.0,inf)   -14.0,
 'term':    variable         bin  points
 56     term  [-inf,5.0)    11.0
 57     term   [5.0,inf)   -26.0,
 'ficoRangeLow':         variable            bin  points
 58  ficoRangeLow   [-inf,685.0)    -7.0
 59  ficoRangeLow  [685.0,710.0)     0.0
 60  ficoRangeLow  [710.0,740.0)     9.0
 61  ficoRangeLow  [740.0,760.0)    18.0
 62  ficoRangeLow    [760.0,inf)    25.0}

模型驗(yàn)證

得到y(tǒng)變量相應(yīng)的得分,并檢查訓(xùn)練集和驗(yàn)證集所得分?jǐn)?shù)的穩(wěn)定性(PSI指標(biāo)):

train_data = train_selected.loc[train.index].drop(columns=['isDefault'])
val_data = train_selected.loc[val.index].drop(columns=['isDefault'])
# credit score
train_score = sc.scorecard_ply(train_data, card, print_step=0)
val_score = sc.scorecard_ply(val_data, card, print_step=0)
# psi
sc.perf_psi(
  score = {'train':train_score, 'test':val_score},
  label = {'train':y_train, 'test':y_val}
)

這結(jié)果有些離譜。。

完成測(cè)試集的預(yù)測(cè)

lr2 = LogisticRegression(penalty='l1', C=0.9, solver='saga', n_jobs=-1)
lr2.fit(X_train, y_train)

# predicted proability
test_pred = lr2.predict_proba(test_a_woe)[:,1]

線上驗(yàn)證的結(jié)果為0.7113與(訓(xùn)練集的0.7141比較接近)。

?著作權(quán)歸作者所有,轉(zhuǎn)載或內(nèi)容合作請(qǐng)聯(lián)系作者
【社區(qū)內(nèi)容提示】社區(qū)部分內(nèi)容疑似由AI輔助生成,瀏覽時(shí)請(qǐng)結(jié)合常識(shí)與多方信息審慎甄別。
平臺(tái)聲明:文章內(nèi)容(如有圖片或視頻亦包括在內(nèi))由作者上傳并發(fā)布,文章內(nèi)容僅代表作者本人觀點(diǎn),簡(jiǎn)書(shū)系信息發(fā)布平臺(tái),僅提供信息存儲(chǔ)服務(wù)。

友情鏈接更多精彩內(nèi)容