超碰1024,久久婷婷777

1.本項(xiàng)目需解決的問(wèn)題

本項(xiàng)目通過(guò)利用信用卡的歷史交易數(shù)據(jù)，進(jìn)行機(jī)器學(xué)習(xí)，構(gòu)建信用卡反欺詐預(yù)測(cè)模型，提前發(fā)現(xiàn)客戶(hù)信用卡被盜刷的事件。

2.建模思路

思路圖.png

3.項(xiàng)目背景

數(shù)據(jù)集包含由歐洲持卡人于2013年9月使用信用卡進(jìn)行交的數(shù)據(jù)。此數(shù)據(jù)集顯示兩天內(nèi)發(fā)生的交易，其中284,807筆交易中有492筆被盜刷。數(shù)據(jù)集非常不平衡，積極的類(lèi)（被盜刷）占所有交易的0.172％。
它只包含作為PCA轉(zhuǎn)換結(jié)果的數(shù)字輸入變量。不幸的是，由于保密問(wèn)題，我們無(wú)法提供有關(guān)數(shù)據(jù)的原始功能和更多背景信息。特征V1，V2，... V28是使用PCA 獲得的主要組件，沒(méi)有用PCA轉(zhuǎn)換的唯一特征是“時(shí)間”和“量”。特征'時(shí)間'包含數(shù)據(jù)集中每個(gè)事務(wù)和第一個(gè)事務(wù)之間經(jīng)過(guò)的秒數(shù)。特征“金額”是交易金額，此特征可用于實(shí)例依賴(lài)的成本認(rèn)知學(xué)習(xí)。特征'類(lèi)'是響應(yīng)變量，如果發(fā)生被盜刷，則取值1，否則為0。以上取自Kaggle官網(wǎng)對(duì)本數(shù)據(jù)集部分介紹。

4.場(chǎng)景解析（算法選擇）

根據(jù)歷史記錄數(shù)據(jù)學(xué)習(xí)并對(duì)信用卡持卡人是否會(huì)發(fā)生被盜刷進(jìn)行預(yù)測(cè)，二分類(lèi)監(jiān)督學(xué)習(xí)場(chǎng)景，選擇邏輯斯蒂回歸（Logistic Regression）算法;
分析數(shù)據(jù)：數(shù)據(jù)是結(jié)構(gòu)化數(shù)據(jù) ，不需要做特征抽象。特征V1至V28是經(jīng)過(guò)PCA處理，而特征Time和Amount的數(shù)據(jù)規(guī)格與其他特征差別較大，需要對(duì)其做特征縮放，將特征縮放至同一個(gè)規(guī)格.

五.導(dǎo)包

#numpy、pandas、matplotlib
import pandas as pd
import numpy as np

#畫(huà)圖#
import matplotlib.pyplot as plt
import matplotlib.gridspec as gridspec
%matplotlib inline
#算法和數(shù)據(jù)處理、模型評(píng)估
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix
from sklearn.metrics import auc
from sklearn.metrics import roc_curve
from sklearn.metrics import accuracy_score
from sklearn.preprocessing import StandardScaler
#樣本不均衡處理
#忽略彈出的warnings
import warnings
warnings.filterwarnings('ignore')  
#導(dǎo)入過(guò)采樣的工具包處理類(lèi)別不平衡問(wèn)題
from imblearn.over_sampling import SMOTE
import itertools

六.獲取數(shù)據(jù)

#設(shè)置float類(lèi)型數(shù)據(jù)保留位數(shù)
data_cr=pd.read_csv(r"./creditcard.csv")#讀取數(shù)據(jù)
pd.set_option('display.float_format', lambda x: '%.3f' % x)#設(shè)置pandas讀入數(shù)據(jù)保留3位小數(shù)點(diǎn)
data_cr.head()#查看數(shù)據(jù)的前5行，目的是快速查看數(shù)據(jù)的基本信息

七.查看數(shù)據(jù)是否缺失

data_cr.info()

八.特征工程

#正負(fù)樣本均衡問(wèn)題
#目標(biāo)變量可視化
fig,ax=plt.subplots(1,2,figsize=(12,8))
from pylab import mpl#用于顯示中文
plt.style.use('ggplot')
mpl.rcParams['font.sans-serif'] = ['SimHei']   # 指定默認(rèn)字體
mpl.rcParams['axes.unicode_minus'] = False          # 解決保存圖像是負(fù)號(hào)'-'顯示為方塊的問(wèn)題
data_cr["Class"].value_counts().plot(kind="bar",ax=ax[0],fontsize=23)
ax[0].set_title("目標(biāo)變量中每類(lèi)的頻數(shù)分布直方圖")
data_cr["Class"].value_counts().plot(kind="pie",ax=ax[1],fontsize=23,autopct='%1.2f%%')#長(zhǎng)度為1，保留百分號(hào)前面的2個(gè)小數(shù)點(diǎn)
ax[1].set_title("目標(biāo)變量中的每類(lèi)頻率分布餅圖",fontproperties = 'SimHei')

正負(fù)樣本均衡.png

#特征轉(zhuǎn)換,將時(shí)間從單位每秒化為單位每小時(shí)
data_cr["Time"]=data_cr["Time"].apply(lambda x: divmod(x,3600)[0])

#特征選擇
v_feature=data_cr.iloc[:,1:29].columns#獲取特征名
plt.figure(figsize=(16,28*5))
gs=gridspec.GridSpec(28,1)
# 圖形需要繪制10分鐘
for i,cn in enumerate(v_feature):
    ax=plt.subplot(gs[i])
    ax.hist(data_cr[cn][data_cr["Class"]==1],bins=50,normed = True)
    ax.hist(data_cr[cn][data_cr["Class"]==0],bins=100,normed = True) 
    data_cr[cn][data_cr["Class"]==1].plot(kind = 'kde',ax = ax)
    data_cr[cn][data_cr["Class"]==0].plot(kind = 'kde',ax = ax)
    ax.set_title("直方圖分布"+str(cn),fontproperties = 'SimHei')

#我們將選擇在不同信用卡狀態(tài)下的分布有明顯區(qū)別的變量。因此剔除變量V8、V13 、V15 、V20 、V21 、V22、 V23 、V24 、V25 、V26 、V27 和V28變量
#刪除相關(guān)變量
droplist=["V8","V13","V15","V20","V21","V22","V23","V24","V25","V26","V27","V28"]#提成相關(guān)性較弱的12列，還剩下19列
data_new=data_cr.drop(droplist,axis=1)

#特征縮放
#Amount變量和Time變量的取值范圍與其他變量相差較大，所以要對(duì)其進(jìn)行特征縮放
col=["Amount","Time"]
from sklearn.preprocessing import StandardScaler
data_new[col]=StandardScaler().fit_transform(data_new[col])
data_new.head()

#對(duì)特征的重要性進(jìn)行排序，以進(jìn)一步減少變量
#構(gòu)造X和Y變量
x_val=data_new.iloc[:,:-1]
y_val=data_new.iloc[:,-1]
#利用GBDT梯度提升決策樹(shù)進(jìn)行特征重要性排序
from sklearn.ensemble import GradientBoostingClassifier as GDBT
clf=GDBT()
clf.fit(x_val,y_val)
#排序可視化
# plt.style.use('fivethirtyeight')
plt.rcParams['figure.figsize'] = (12,6)
importance=clf.feature_importances_
feature_name=data_new.columns[:-1]
indices=np.argsort(importance)[::-1]
fig = plt.figure(figsize=(20,6))
plt.title("Feature importances by GDBTClassifier")
plt.bar(range(len(importance)),importance[indices],color="blue",align="center")
plt.xticks(range(len(importance)),feature_name[indices],rotation='vertical',fontsize=14)
plt.xlim([-1, len(indices)])

排序.png

#刪除次要屬性
droplist1=['V16','Time','V7','V5','V4','V19','V11','V1','Amount']
data_new1=data_new.drop(droplist1,axis=1)

九.模型訓(xùn)練

邏輯斯蒂回歸
SMOTE（Synthetic Minority Oversampling Technique），SMOET的基本原理是：采樣最鄰近算法，計(jì)算出每個(gè)少數(shù)類(lèi)樣本的K個(gè)近鄰，從K個(gè)近鄰中隨機(jī)挑選N個(gè)樣本進(jìn)行隨機(jī)線性插值，構(gòu)造新的少數(shù)樣本，同時(shí)將新樣本與原數(shù)據(jù)合成，產(chǎn)生新的訓(xùn)練集。

#SMOTE過(guò)采樣
#重新構(gòu)造X變量和Y變量
x_all=data_new1.iloc[:,:-1]
y_all=data_new1.iloc[:,-1]

X_train,X_test,Y_train,Y_test=train_test_split(x_val,y_val,test_size=0.3)
n_samples=len(Y_train)
pos_samples=Y_train[Y_train==1].shape[0]
print("過(guò)采樣之前被盜刷所占的比例{:.2%}".format(pos_samples/n_samples))
#為了保證預(yù)測(cè)的數(shù)據(jù)分布的真實(shí)性，所以我們只在訓(xùn)練集上進(jìn)行過(guò)采樣處理
X_train_new,Y_train_new=SMOTE(random_state=12).fit_sample(X_train,Y_train)
n_samples_new=len(Y_train_new)
pos_samples_new=Y_train_new[Y_train_new==1].shape[0]
print("過(guò)采樣之后被盜刷所占的比例{:.2%}".format(pos_samples_new/n_samples_new))

#百分比餅圖
fig,ax=plt.subplots(1,2,figsize=(12,8))
plt.style.use('seaborn-darkgrid')
from pylab import mpl#用于顯示中文
mpl.rcParams['font.sans-serif'] = ['SimHei']   # 指定默認(rèn)字體
mpl.rcParams['axes.unicode_minus'] = False          # 解決保存圖像是負(fù)號(hào)'-'顯示為方塊的問(wèn)題
data_cr["Class"].value_counts().plot(kind="pie",ax=ax[0],fontsize=23,autopct='%1.2f%%')
ax[0].set_title("SMOTE采樣之前的頻率分布餅圖")
pd.Series(Y_train_new).value_counts().plot(kind="pie",ax=ax[1],fontsize=23,autopct='%1.2f%%')#長(zhǎng)度為1，保留百分號(hào)前面的2個(gè)小數(shù)點(diǎn)
ax[1].set_title("SMOTE采樣之后的頻率分布餅圖")
ax[1].set_ylabel("Class")
plt.savefig("./smote.jpg")

百分比.png

#自定義可視化函數(shù)
def plot_confusion_matrix(cm, classes,
                          title='Confusion matrix',
                          cmap=plt.cm.Blues):
    """
    This function prints and plots the confusion matrix.
    """
    plt.imshow(cm, interpolation='nearest', cmap=cmap)
    plt.title(title)
    plt.colorbar()
    tick_marks = np.arange(len(classes))
    plt.xticks(tick_marks, classes, rotation=0)
    plt.yticks(tick_marks, classes)

    threshold = cm.max() / 2.
    for i, j in itertools.product(range(cm.shape[0]), range(cm.shape[1])):
        plt.text(j, i, cm[i, j],
                 horizontalalignment="center",
                 color="white" if cm[i, j] > threshold else "black")#若對(duì)應(yīng)格子上面的數(shù)量不超過(guò)閾值則，上面的字體為白色，為了方便查看

    plt.tight_layout()
    plt.ylabel('True label')
    plt.xlabel('Predicted label')

#單獨(dú)的邏輯回歸求得查全率Recall rate
from sklearn.linear_model import LogisticRegression
lg_clf=LogisticRegression()
lg_clf.fit(X_train_new,Y_train_new)
lg_pred=lg_clf.predict(X_test)
cnf_matrix_lg = confusion_matrix(Y_test,lg_pred)  # 生成混淆矩陣
np.set_printoptions(precision=2)#精確到兩位小數(shù)點(diǎn)
# Plot non-normalized confusion matrix
class_names = [0,1]
plt.figure(figsize=(6,4))
plt.subplot(111)
plot_confusion_matrix(cnf_matrix_lg
                      , classes=class_names
                      , title='logit_Confusion matrix,recall is {:.4f}'.format(cnf_matrix_lg[1,1]/(cnf_matrix_lg[1,0]+cnf_matrix_lg[1,1])))
print("邏輯回歸在測(cè)試集上的查全率（Recall rate）: ", cnf_matrix_lg[1,1]/(cnf_matrix_lg[1,0]+cnf_matrix_lg[1,1]))
plt.savefig(r"./邏輯回歸.jpg",dpi=600)
plt.show()

Recall rate.png

#利用GridSearchCV進(jìn)行交叉驗(yàn)證和模型參數(shù)自動(dòng)調(diào)優(yōu)
#利用邏輯回歸算法分類(lèi)
from sklearn.linear_model import LogisticRegression
para_logit= {'C': [100,1,10]}#候選參數(shù)集
clf=GridSearchCV(LogisticRegression(dual=True),param_grid=para_logit,cv=10,iid=False,n_jobs=-1)#構(gòu)建分類(lèi)器，10折交叉驗(yàn)證
clf.fit(X_train_new,Y_train_new)#使用訓(xùn)練集進(jìn)行訓(xùn)練
print("最佳參數(shù)組合: {}".format(clf.best_params_))
print("最佳交叉驗(yàn)證擬合分?jǐn)?shù){:.5f}".format(clf.best_score_))

#預(yù)測(cè)
y_pred = clf.predict(X_test)
print("預(yù)測(cè)集的準(zhǔn)確率: {:.5f}".format(accuracy_score(Y_test, y_pred,)))

#結(jié)果可視化
# Compute confusion matrix
cnf_matrix_lg = confusion_matrix(Y_test,lg_pred)  # 生成混淆矩陣
cnf_matrix_gd = confusion_matrix(Y_test,y_pred)  # 生成混淆矩陣
# Plot non-normalized confusion matrix
class_names = [0,1]
plt.figure(figsize=(16,8))
plt.subplot(121)
plot_confusion_matrix(cnf_matrix_lg
                      , classes=class_names
                      , title='lg_Confusion matrix,recall is {:.4f}'.format(cnf_matrix_lg[1,1]/(cnf_matrix_lg[1,0]+cnf_matrix_lg[1,1])))
plt.subplot(122)
plot_confusion_matrix(cnf_matrix_gd
                      , classes=class_names
                      , title='GridSearchCV_Confusion matrix,recall is {:.4f}'.format(cnf_matrix_gd[1,1]/(cnf_matrix_gd[1,0]+cnf_matrix_gd[1,1])))
print("邏輯回歸在測(cè)試集上的查全率（Recall rate）: ", cnf_matrix_lg[1,1]/(cnf_matrix_lg[1,0]+cnf_matrix_lg[1,1]))
print("GridSearchCV在測(cè)試集上的查全率（Recall rate）: ", cnf_matrix_gd[1,1]/(cnf_matrix_gd[1,0]+cnf_matrix_gd[1,1]))
plt.show()

可視化.png

十.模型評(píng)估

#考慮設(shè)置閾值，來(lái)調(diào)整預(yù)測(cè)被盜刷的概率，依次來(lái)調(diào)整模型的查全率（Recall）
from itertools import cycle
y_pred_proba=clf.predict_proba(X_test)
thresholds=[0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9]#設(shè)定不同的閾值，本來(lái)原始的閾值是0.5
plt.figure(figsize=(12,8))
j=1
re=[]
pr=[]
a=[]
for i in thresholds:
    y_test_predictions_high_recall=y_pred_proba[:,1]>i#將預(yù)測(cè)為被盜刷的概率值與閾值作比較，大于閾值則預(yù)測(cè)為被盜刷，小于閾值則預(yù)測(cè)為正常
    plt.subplot(3,3,j)
    j+=1
    # Compute confusion matrix
    cnf_matrix1 = confusion_matrix(Y_test, y_test_predictions_high_recall)
    fpr, tpr, _ = roc_curve(Y_test, y_test_predictions_high_recall)
    area = auc(fpr, tpr)
    recall_rate=cnf_matrix1[1,1]/(cnf_matrix1[1,0]+cnf_matrix1[1,1])#查全率
    precision_rate=(cnf_matrix1[1,1]+cnf_matrix1[0,0])/(cnf_matrix1.sum())#準(zhǔn)確率
    print("When threshold is {0},  Recall rate is {1:0.5f},  AUC is {2:.5f}".format(i, recall_rate,area))
    # Plot non-normalized confusion matrix
    class_names = [0,1]
    plot_confusion_matrix(cnf_matrix1, classes=class_names,title="threshold>{}".format(i)) 
    re.append(recall_rate)
    pr.append(precision_rate)
    a.append(area)
plt.savefig('./模型評(píng)估.jpg')

查全率.png

#趨勢(shì)圖
plt.figure(figsize=(8,6))
plt.plot(thresholds,re,label="recall_rate")
plt.plot(thresholds,pr,label="precision_rate")
plt.plot(thresholds,a,label="Value of AUC")
plt.legend(fontsize=23)
plt.xlabel("threshold",fontsize=19)
plt.ylabel("value",fontsize=19)
plt.title("Recall, Precision rate and thresholds")
plt.show()

趨勢(shì)圖.png

找出模型最優(yōu)的閾值:
precision和recall是一組矛盾的變量。從上面混淆矩陣和PRC曲線可以看到，閾值越小，recall值越大，模型能找出信用卡被盜刷的數(shù)量也就更多，但換來(lái)的代價(jià)是誤判的數(shù)量也較大。隨著閾值的提高，recall值逐漸降低，precision值也逐漸提高，誤判的數(shù)量也隨之減少。通過(guò)調(diào)整模型閾值，控制模型反信用卡欺詐的力度，若想找出更多的信用卡被盜刷就設(shè)置較小的閾值，反之，則設(shè)置較大的閾值。
實(shí)際業(yè)務(wù)中，閾值的選擇取決于公司業(yè)務(wù)邊際利潤(rùn)和邊際成本的比較；當(dāng)模型閾值設(shè)置較小的值，確實(shí)能找出更多的信用卡被盜刷的持卡人，但隨著誤判數(shù)量增加，不僅加大了貸后團(tuán)隊(duì)的工作量，也會(huì)降低正常情況誤判為信用卡被盜刷客戶(hù)的消費(fèi)體驗(yàn)，從而導(dǎo)致客戶(hù)滿(mǎn)意度下降，如果某個(gè)模型閾值能讓業(yè)務(wù)的邊際利潤(rùn)和邊際成本達(dá)到平衡時(shí)，則該模型的閾值為最優(yōu)值。當(dāng)然也有例外的情況，發(fā)生金融危機(jī)，往往伴隨著貸款違約或信用卡被盜刷的幾率會(huì)增大，而金融機(jī)構(gòu)會(huì)更愿意設(shè)置小閾值，不惜一切代價(jià)守住風(fēng)險(xiǎn)的底線。

色偷偷精品伊人,欧洲久久精品,欧美综合婷婷骚逼,国产AV主播,国产最新探花在线,九色在线视频一区,伊人大交九欧美,1769亚洲,黄色成人av

13-特征工程之金融反欺詐

13-特征工程之金融反欺詐

1.本項(xiàng)目需解決的問(wèn)題

2.建模思路

3.項(xiàng)目背景

4.場(chǎng)景解析（算法選擇）

五.導(dǎo)包

六.獲取數(shù)據(jù)

七.查看數(shù)據(jù)是否缺失

八.特征工程

九.模型訓(xùn)練

十.模型評(píng)估

相關(guān)閱讀更多精彩內(nèi)容

友情鏈接更多精彩內(nèi)容

色偷偷精品伊人,欧洲久久精品,欧美综合婷婷骚逼,国产AV主播,国产最新探花在线,九色在线视频一区,伊人大交九 欧美,1769亚洲,黄色成人av

13-特征工程之金融反欺詐

1.本項(xiàng)目需解決的問(wèn)題

2.建模思路

3.項(xiàng)目背景

4.場(chǎng)景解析（算法選擇）

五.導(dǎo)包

六.獲取數(shù)據(jù)

七.查看數(shù)據(jù)是否缺失

八.特征工程

九.模型訓(xùn)練

十.模型評(píng)估

相關(guān)閱讀更多精彩內(nèi)容

友情鏈接更多精彩內(nèi)容

色偷偷精品伊人,欧洲久久精品,欧美综合婷婷骚逼,国产AV主播,国产最新探花在线,九色在线视频一区,伊人大交九欧美,1769亚洲,黄色成人av