前言

本次是在阿里云天池的第一個競賽《零基礎入門金融風控-貸款違約預測》，賽題以金融風控中的個人信貸為背景，要求選手根據(jù)貸款申請人的數(shù)據(jù)信息預測其是否有違約的可能，以此判斷是否通過此項貸款，這是一個典型的分類問題。
最后精度為0.7287。

這個比賽很有意義，主要是可以參考論壇中大牛寫的文章，內(nèi)容全面詳細，跟著順序做完會有很大收獲。Datawhale零基礎入門金融風控 Task1 賽題理解

1 數(shù)據(jù)分析+可視化

變量信息

以前可視化我是用Excel加python畫圖實現(xiàn)的，現(xiàn)在有個巨無敵的庫，巨無敵的可視化工具——pandas_profiling??！

import pandas_profiling
def profiling(train,test):
    train_y0 = train[train['isDefault'] == 0]
    train_y1 = train[train['isDefault'] == 1]
    pfr_y1 = pandas_profiling.ProfileReport(train_y1)
    pfr_y1.to_file("./train_y1.html")
    pfr_y0 = pandas_profiling.ProfileReport(train_y0)
    pfr_y0.to_file("./train_y0.html")
    pfr = pandas_profiling.ProfileReport(train)
    pfr.to_file("./train.html")
    pfr_y = pandas_profiling.ProfileReport(test)
    pfr_y.to_file("./test.html")

1.1 Overview

從Overview中，我們可以得知訓練集有800000個，其中缺失值占據(jù)1.8%；其變量（包含預測值）有47個，數(shù)值型為36個，類別型為7個，布爾值為4個。

1.2 Warnings

Constant表示只有一個變量值；High cardinality是指高數(shù)量類別特征；High correlation是指高相似特征。

Missing表示缺失值。

Skewed表示偏態(tài)分布；Unique表示唯一值；Zeros表示變量值大多為0。

1.3 Variables

以loanAmnt為例，我們可以查看一個變量的各種統(tǒng)計，像最值、均值、中值、標準差等，當然還可以查看Common values和Extreme values這兩類普遍值和極端值。

2 特征工程

將特征人為區(qū)分開來（因為有些數(shù)值特征實際上是無順序的變量特征，如postCode，難以用代碼區(qū)分）

    numerrical = ['loanAmnt','interestRate','installment','annualIncome','dti',
                  'delinquency_2years','ficoRangeHigh','ficoRangeLow','openAcc',
                  'pubRec','pubRecBankruptcies','revolBal','revolUtil','totalAcc']
    nominal = ['term','employmentTitle','homeOwnership','verificationStatus',
               'purpose','postCode','regionCode','initialListStatus','applicationType',
               'title','n0','n1','n2','n3','n4','n5','n6','n7','n8','n9','n10','n11','n12',
               'n13','n14','id']
    ordinal = ['grade','subGrade','employmentLength','earliesCreditLine','issueDate']
    y = ['isDefault']

numerrical——表示數(shù)值特征
nominal——表示無順序的類別特征
ordina——表示有順序的類別特征
y——表示預測值。

2.1 提取新特征

①債權類——從annualIncome、installment、loanAmnt、annualIncome、dti幾個財務類信息互相組合提取出新特征。

    x['Income_installment']=round(x.loc[:,'annualIncome']/x.loc[:,'installment'],2)
    x['loanAmnt_installment']=round(x.loc[:,'loanAmnt']/x.loc[:,'installment'],2)
    x['debt']=round(x.loc[:,'annualIncome']*x.loc[:,'dti'],2)
    x['loanAmnt_debt']=round(x.loc[:,'annualIncome']/x.loc[:,'debt'],2)

②fico——求fico的平均值。

    x['fico']=(x.loc[:,'ficoRangeHigh']+x.loc[:,'ficoRangeLow'])*0.5

③employmentLength——提取出就業(yè)年限的數(shù)字（轉(zhuǎn)換為連續(xù)變量）。

    def employmentLength_to_int(s):
        if pd.isnull(s):
            return s
        else:
            return np.int8(s.split()[0])
    x["employmentLength"].replace(to_replace="10+ years", value="10 years", inplace=True)
    x["employmentLength"].replace(to_replace="< 1 year", value="0 years", inplace=True)
    x['employmentLength'] = x.loc[:,"employmentLength"].apply(employmentLength_to_int)

④CreditLine——計算信用開戶到本次借貸的時間，即信用賬戶的年限。

    x['issueDate'] = x.loc[:,"issueDate"].apply(lambda s: int(s[:4]))
    x['earliesCreditLine'] = x.loc[:,'earliesCreditLine'].apply(lambda s: int(s[-4:]))
    x['CreditLine'] = x.loc[:,'earliesCreditLine'] - x.loc[:,'issueDate']

最后，增添新特征；根據(jù)之前Warnings，刪除相似特征(High correlation)、唯一值(Unique)、單變量值(Constant)，和以上用來生成新特征的舊特征。

    numerrical=list(set(numerrical) - {'ficoRangeHigh', 'ficoRangeLow'}) + 
    ['Income_installment','loanAmnt_installment','loanAmnt_debt','fico']
    nominal=list(set(nominal)-{'id','n10', 'n2'})
    ordinal=list(set(ordinal) - {'grade', 'earliesCreditLine', 'issueDate'}) + ['CreditLine']

2.3 特征編碼

在決定編碼前我選擇了XGB模型，所以按照模型去查詢編碼方式。
根據(jù)XGBoost之類別特征的處理和kaggle編碼categorical feature總結(jié)兩篇編碼總結(jié)。

from category_encoders import WOEEncoder ,OneHotEncoder,CatBoostEncoder,TargetEncoder
    def Category_Encoders(train_x, train_y, test_x, vel_x):
        for col in nominal:
            distinct = train_x[col].nunique()
            if distinct < 4 and distinct >2:
                enc = OneHotEncoder(handle_missing='indicator').fit(train_x[col], train_y)
            elif distinct >= 4:
                # enc = WOEEncoder().fit(train_x[col], train_y)
                # enc = TargetEncoder().fit(train_x[col],train_y)
                enc = CatBoostEncoder().fit(train_x[col],train_y)

            train_x[col] = enc.transform(train_x[col])
            test_x[col] = enc.transform(test_x[col])
            vel_x[col] = enc.transform(vel_x[col])

        return train_x, test_x, vel_x

2.4 缺失值處理和分箱處理

暫不處理，將缺失值視為一種信息，交給xgb模型處理。
分箱暫且沒想到好的分箱方法，也交給xgb模型處理。

3 訓練模型

模型方面自然是選擇了百試百靈的XGBClassifier模型

3.1 模型調(diào)參

當年GridSearchCV用的賊爽，但當我用到這個80w大數(shù)據(jù)集時候，簡直不要太慢了！！
總之，當需要調(diào)很多參數(shù)或是數(shù)據(jù)集很大的時候，歡迎使用貝葉斯優(yōu)化調(diào)參示例代碼 (xgboost,lgbm)

def BO_xgb(x,y):
    t1=time.clock()

    def xgb_cv(max_depth,gamma,min_child_weight,max_delta_step,subsample,colsample_bytree):
        paramt={'booster': 'gbtree',
                'max_depth': int(max_depth),
                'gamma': gamma,
                'eta': 0.1,
                'objective': 'binary:logistic',
                'nthread': 4,
                'eval_metric': 'auc',
                'subsample': max(min(subsample, 1), 0),
                'colsample_bytree': max(min(colsample_bytree, 1), 0),
                'min_child_weight': min_child_weight,
                'max_delta_step': int(max_delta_step),
                'seed': 1001}
        model=XGBClassifier(**paramt)
        res = cross_val_score(model,x, y, scoring='roc_auc', cv=5).mean()
        return res
    cv_params ={'max_depth': (5, 12),
                'gamma': (0.001, 10.0),
                'min_child_weight': (0, 20),
                'max_delta_step': (0, 10),
                'subsample': (0.4, 1.0),
                'colsample_bytree': (0.4, 1.0)}
    xgb_op = BayesianOptimization(xgb_cv,cv_params)
    xgb_op.maximize(n_iter=20)
    print(xgb_op.max)

    t2=time.clock()
    print('耗時：',(t2-t1))

    return xgb_op.max

我們對'max_depth'，'gamma','min_child_weight'，'max_delta_step'，'subsample'，'colsample_bytree'六個參數(shù)進行調(diào)參，并最后賦予'n_estimators':1000，'learning_rate':0.02。
最終最佳參數(shù)為：

'booster': 'gbtree','eta': 0.1,'nthread': 4,'eval_metric': 'auc','objective': 'binary:logistic',
                    'colsample_bytree': 0.4354, 'gamma': 9.888, 'max_delta_step': 4,'n_estimators':1000,'learning_rate':0.02,
                    'max_depth': 10, 'min_child_weight': 3.268, 'subsample': 0.7157

3.2 ROC可視化

分別look下預測集和訓練集的ROC。

def roc(m,x,y,name):
    y_pred = m.predict_proba(x)[:,1]
    """"預測并計算roc的相關指標"""
    fpr, tpr, threshold = metrics.roc_curve(y, y_pred)
    roc_auc = metrics.auc(fpr, tpr)
    print(name+'AUC：{}'.format(roc_auc))
    """畫出roc曲線圖"""
    plt.figure(figsize=(8, 8))
    plt.title(name)
    plt.plot(fpr, tpr, 'b', label = name + 'AUC = %0.4f' % roc_auc)
    plt.ylim(0,1)
    plt.xlim(0,1)
    plt.legend(loc='best')
    plt.title('ROC')
    plt.ylabel('True Positive Rate')
    plt.xlabel('False Positive Rate')
    # 畫出對角線
    plt.plot([0,1],[0,1],'r--')
    plt.show()

3 提交成績

def prediction(m,x):
    submit=pd.read_csv('sample_submit.csv')
    y_pred = m.predict_proba(x)[:,1]
    submit['isDefault'] = y_pred
    submit.to_csv('prediction.csv', index=False)

最終成績roc=0.7287，排名357。

小結(jié)

作為一名信用管理專業(yè)的學生，來到天池做這個貸款違約預測，也算和本專業(yè)結(jié)合了。
這次最大困難我覺得是調(diào)參的不利，對于這種大數(shù)據(jù)處理，模型調(diào)參實在是太慢了。我之前習慣于特征改進一步就嘗試調(diào)參優(yōu)化，現(xiàn)在即使是使用貝葉斯調(diào)參，也是需要1小時以上的時間，并且還是在參數(shù)n_estimators默認100的情況下（n_estimators越大耗時越長）。
最后，此模型應該還有調(diào)參優(yōu)化的可能，需要我再去求學。

代碼

色偷偷精品伊人,欧洲久久精品,欧美综合婷婷骚逼,国产AV主播,国产最新探花在线,九色在线视频一区,伊人大交九欧美,1769亚洲,黄色成人av

天池——貸款違約預測0.7287

天池——貸款違約預測0.7287

前言

1 數(shù)據(jù)分析+可視化

1.1 Overview

1.2 Warnings

1.3 Variables

2 特征工程

2.1 提取新特征

2.3 特征編碼

2.4 缺失值處理和分箱處理

3 訓練模型

3.1 模型調(diào)參

3.2 ROC可視化

3 提交成績

小結(jié)

友情鏈接更多精彩內(nèi)容

色偷偷精品伊人,欧洲久久精品,欧美综合婷婷骚逼,国产AV主播,国产最新探花在线,九色在线视频一区,伊人大交九 欧美,1769亚洲,黄色成人av

天池——貸款違約預測0.7287

前言

1 數(shù)據(jù)分析+可視化

1.1 Overview

1.2 Warnings

1.3 Variables

2 特征工程

2.1 提取新特征

2.3 特征編碼

2.4 缺失值處理和分箱處理

3 訓練模型

3.1 模型調(diào)參

3.2 ROC可視化

3 提交成績

小結(jié)

友情鏈接更多精彩內(nèi)容

色偷偷精品伊人,欧洲久久精品,欧美综合婷婷骚逼,国产AV主播,国产最新探花在线,九色在线视频一区,伊人大交九欧美,1769亚洲,黄色成人av