應(yīng)用決策樹和隨機森林預(yù)測銀行用戶是否購買儲蓄產(chǎn)品

最近接觸了決策樹(Decision Tree, DT),將其應(yīng)用于數(shù)據(jù)分析,任務(wù)目標是根據(jù)銀行用戶信息,應(yīng)用DT和RF對銀行用戶是否購買銀行儲蓄產(chǎn)品進行預(yù)測。整理了處理過程記錄在這篇博客里。第一版:2017.10.16下午。更新:2017.10.26.

1 數(shù)據(jù)來源

2 實驗環(huán)境

3 數(shù)據(jù)預(yù)處理

4?算法實現(xiàn)

5 參考文獻


1 數(shù)據(jù)來源

清理Chrome時刪掉了下載鏈接,按照數(shù)據(jù)提供者的request,列出下面信息供使用者參考

This dataset is public available for research. The details are described in [Moro et al., 2011].

Please include this citation if you plan to use this database:

[Moro et al., 2011] S. Moro, R. Laureano and P. Cortez. Using Data Mining for Bank Direct Marketing: An Application of the CRISP-DM Methodology. In P. Novais et al. (Eds.), Proceedings of the European Simulation and Modelling Conference - ESM'2011, pp. 117-121, Guimar?es, Portugal, October, 2011. EUROSIS.

Available at: [pdf] http://hdl.handle.net/1822/14838

? ? ? ? ? ? ? ? ? ? [bib] http://www3.dsi.uminho.pt/pcortez/bib/2011-esm-1.txt

這組數(shù)據(jù)存儲在兩個.csv格式的文件中,分別是非完整版和完整版。16個features中,每個qualitative feature的缺失值都用unknown代替,quantitative?feature沒有缺失值。違約default?label有yes和no。

2 實驗環(huán)境

整個任務(wù)在Python的Spyder下實現(xiàn),調(diào)用的外部modules有Pandas、matplotlib和sklearn。

3 數(shù)據(jù)預(yù)處理

打開數(shù)據(jù)存儲的.csv文件,看到的數(shù)據(jù)是這樣的:

數(shù)據(jù)的第一行是feature?names,之后每行存儲一個用戶信息,并保存在一個單元格中,不同的feature值以逗號','為間隔。為便于處理,對這些數(shù)據(jù)做分割,目的是使每一列表示一個feature。

代碼如下:

import os

os.chdir('C:/WinPython-64bit-3.6.1.0Qt5/notebooks/DT/')

ori_file = open('bank-full.csv')

tmpdata = ori_file.readlines()

data0 = [''.join([term1 for term1 in term if term1 != '"']).split(';') for term in tmpdata]

data1 = [[int(term1) if term1[1:].isnumeric()==True or term1.isnumeric()==True else term1 \

for term1 in term] for term in data0]

data1 = [ [term1[:-1] if isinstance(term1,str) and term1.endswith('\n') else term1 \

for term1 in term]? for term in data1]

代碼中第14行將每一行的數(shù)據(jù)做分割,以逗號為標志,得到列表data0中的每個元素都是一個list,含17個元素,即用戶的features。

如果將data0信息直接做處理,會注意到data0中凡是quantitative?feature的數(shù)值都是以str格式保存,無法直接處理,如age信息如果是58歲,則調(diào)用時顯示'58'而非58。于是用第15行代碼將str的數(shù)字都轉(zhuǎn)化成int。注意到這個表中的所有quantitative?feature都是整數(shù)表示,所以用int(term)可以實現(xiàn)轉(zhuǎn)化并且不引入誤差。處理結(jié)果保存在data1中。

在后面用sklearn的決策樹對數(shù)據(jù)做擬合的過程中出現(xiàn)了一個問題,即處理后的data1中,總有信息是以'\n'結(jié)尾的(換行),無法實現(xiàn)擬合。觀察數(shù)據(jù)注意到data1每個元素的list中的最后一個元素,也就是用戶的最后一個feature,從feature?name到feature?value都是以'\n'結(jié)束的。而數(shù)據(jù)的其他部分不含有'\n',因此加入第17行代碼,刪除以'\n'結(jié)尾的字符串的尾部。得到的data1完全不含'\n',可放心使用。

接下來將數(shù)據(jù)轉(zhuǎn)化成我更容易處理的DataFrame格式,并保存到本地電腦:

data = pd.DataFrame(data1[1:],columns = data1[0])

writer = pd.ExcelWriter('bank-full.xls')

data.to_excel(writer,'Sheet1')

writer.save()

del tmpdata,ori_file,data0,data1

在算法實現(xiàn)中,我對原始數(shù)據(jù)做訓(xùn)練集和測試集的區(qū)分,比例為8:2,即80%為訓(xùn)練集其余為測試集。在選取過程中訓(xùn)練集從全部數(shù)據(jù)中隨機選取,以保證每次運行過程中得到的訓(xùn)練集都不同。代碼如下:

import random,time

training_ratio = 0.8

random.seed(time.time())

tmp_tt_index = list(range(data.shape[0]))

random.shuffle(tmp_tt_index)

training_index = sorted(tmp_tt_index[:int(tmp_tt_index.__len__()*training_ratio)])

test_index =? sorted(tmp_tt_index[int(tmp_tt_index.__len__()*training_ratio):])

del tmp_tt_index

代碼的第32行定義了隨機數(shù)的seed,由運行時的當前時間決定。random.shuffle用于打亂index序列,training_index是經(jīng)過排序的隨機選出的用于訓(xùn)練的數(shù)據(jù)index。再按如下代碼就可以實現(xiàn)訓(xùn)練集和測試集的生成:

features_selected = list(data.columns)

features_selected.remove('default')

data_training = pd.DataFrame(data.loc[training_index,features_selected],columns=features_selected)

data_test_ori = pd.DataFrame(data.loc[test_index,features_selected],columns=features_selected)

用于sklearn的決策樹的數(shù)據(jù),我目前不知道如何處理feature?value不是數(shù)值的方法,所以需要把所有qualitative?feature的值都轉(zhuǎn)化成數(shù)字,這包括將是否違約的default數(shù)據(jù)的值'yes' & 'no'轉(zhuǎn)化成數(shù)字。下面代碼實現(xiàn)了這個功能,它看起來很笨:

label_vec_test_mat = label_vec_test.values

data_tra_mat = data_training.values

label_vec_tra_mat = label_vec_training.values

for i in range(len(label_vec_tra_mat)):

if label_vec_tra_mat[i] == 'yes':

label_vec_tra_mat[i] = 1

elif label_vec_tra_mat[i] == 'no' :

label_vec_tra_mat[i] = 0

for i in range(len(label_vec_test_mat)):

if label_vec_test_mat[i] == 'yes':

label_vec_test_mat[i] = 1

elif label_vec_test_mat[i] == 'no':

label_vec_test_mat[i] = 0

for i in range(data_tra_mat.shape[0]):

for j in range(data_tra_mat.shape[1]):

if data_tra_mat[i,j] == 'yes':

data_tra_mat[i,j] = 1

elif data_tra_mat[i,j] == 'no':

data_tra_mat[i,j] = 0

for i in range(data_test_mat.shape[0]):

for j in range(data_test_mat.shape[1]):

if data_test_mat[i,j] == 'yes':

data_test_mat[i,j] = 1

elif data_test_mat[i,j] == 'no':

data_test_mat[i,j] = 0

for i in range(data_tra_mat.shape[1]):

if isinstance(data_tra_mat[1,i],str):

list_space = list(set(data_tra_mat[:,i]))

list_space = sorted(list_space)

for j in range(data_tra_mat.shape[0]):

data_tra_mat[j,i] = list_space.index(data_tra_mat[j,i])

for k in range(data_test_mat.shape[0]):

data_test_mat[k,i] = list_space.index(data_test_mat[k,i])

轉(zhuǎn)化過程的第45行代碼,用于統(tǒng)計各qualitative?feature中有多少unique?value,再對屬性賦值。

至此,數(shù)據(jù)預(yù)處理基本完成,但如果現(xiàn)在調(diào)用sklearn處理數(shù)據(jù),還是會出現(xiàn)錯誤提示。需要加入下面代碼:

from sklearn import preprocessing

lab_enc = preprocessing.LabelEncoder()

label_vec_tra_mat = lab_enc.fit_transform(label_vec_tra_mat)

這個代碼的細節(jié)未知。

下面就可以把數(shù)據(jù)傳給sklearn處理了。

4?算法實現(xiàn)

搞定了前面的數(shù)據(jù)預(yù)處理,只用下面幾行就實現(xiàn)了CART分類:

from sklearn.tree import DecisionTreeClassifier

from sklearn.metrics import roc_curve

treefile = 'C:/WinPython-64bit-3.6.1.0Qt5/notebooks/DT/tree.pkl'

tree = DecisionTreeClassifier()

tree.fit(data_tra_mat,label_vec_tra_mat)

from sklearn.externals import joblib

joblib.dump(tree,treefile)

fpr,tpr,thresholds = roc_curve(label_vec_test_mat,tree.predict_proba(data_test_mat[:,:])[:,1],pos_label=1)

plt.plot(fpr,tpr,linewidth=2,label='ROC of CART')

plt.xlabel('False positive Rate')

plt.ylabel('True positive rate')

plt.ylim(0,1.05)

plt.xlim(0,1.05)

plt.legend(loc=4)

plt.show()

運行結(jié)果用ROC曲線評價分類的質(zhì)量:

簡單來說,這條ROC?curve下面覆蓋的面積越大,分類效果越好,最理想的點是true?positive?rate=1.0,false?positive?rate = 0.0的點。這里得到的曲線僅僅略微好于y = x即diagnoal 。分類效果不理想。

做到這里,我想到的可以進一步優(yōu)化分類質(zhì)量的工作有feature?selection,另外就是也許僅僅使用DT并不能高質(zhì)量分類數(shù)據(jù),或許加入Random?forest。這是兩個可以改進的地方。

又或者是我哪里做錯了嗎?

2017.10.26更新:

在處理過程中重新閱讀了數(shù)據(jù)提供者的論文,注意到提供者并非將數(shù)據(jù)用于預(yù)測是否違約,而是預(yù)測是否購買儲蓄產(chǎn)品。對本文研究中的內(nèi)容做相應(yīng)修改,label從default改成y,也就是是否購買的label。其他數(shù)據(jù)都作為features/attributes.

在DT之后,進一步采用RF對數(shù)據(jù)做處理。預(yù)測結(jié)果達到了89%以上的正確率。但是考慮到label中絕大多數(shù)的class都是no,即不買儲蓄產(chǎn)品,所以用正確率這個指標并不能充分描述預(yù)測性能。加入了precision/recall/f1這三個指標,precision=TP/(TP+FP), recall = TP/(TP+FN), f1 = 1/(1/precision + 1/recall),這三個指標的計算都需要用到confusion?matrix。

預(yù)測結(jié)果precision > 70%, recall ~= 11%.?

因數(shù)據(jù)已經(jīng)經(jīng)過一定程度的預(yù)處理,故在本研究中并不包含feature?engineering。DT和RF本身的特點決定了不需要data?scaling和centering。通過設(shè)置max?depth和max?nodes?number可預(yù)防overfitting。

5 參考文獻

[1]?https://stackoverflow.com/questions/41925157/logisticregression-unknown-label-type-continuous-using-sklearn-in-python (這個回答解決了用sklearn指令調(diào)用數(shù)據(jù)顯示錯誤的問題)

[2] Python數(shù)據(jù)分析和挖掘?qū)崙?zhàn),機械工業(yè)出版社,張良均等,2016年


附:

數(shù)據(jù)feature信息

1. Title: Bank Marketing

2. Sources

Created by: Paulo Cortez (Univ. Minho) and Sérgio Moro (ISCTE-IUL) @ 2012

3. Past Usage:

The full dataset was described and analyzed in:

S. Moro, R. Laureano and P. Cortez. Using Data Mining for Bank Direct Marketing: An Application of the CRISP-DM Methodology.

In P. Novais et al. (Eds.), Proceedings of the European Simulation and Modelling Conference - ESM'2011, pp. 117-121, Guimar?es,

Portugal, October, 2011. EUROSIS.

4. Relevant Information:

The data is related with direct marketing campaigns of a Portuguese banking institution.

The marketing campaigns were based on phone calls. Often, more than one contact to the same client was required,

in order to access if the product (bank term deposit) would be (or not) subscribed.

There are two datasets:

1) bank-full.csv with all examples, ordered by date (from May 2008 to November 2010).

2) bank.csv with 10% of the examples (4521), randomly selected from bank-full.csv.

The smallest dataset is provided to test more computationally demanding machine learning algorithms (e.g. SVM).

The classification goal is to predict if the client will subscribe a term deposit (variable y).

5. Number of Instances: 45211 for bank-full.csv (4521 for bank.csv)

6. Number of Attributes: 16 + output attribute.

7. Attribute information:

For more information, read [Moro et al., 2011].

Input variables:

# bank client data:

1 - age (numeric)

2 - job : type of job (categorical: "admin.","unknown","unemployed","management","housemaid","entrepreneur","student",

"blue-collar","self-employed","retired","technician","services")

3 - marital : marital status (categorical: "married","divorced","single"; note: "divorced" means divorced or widowed)

4 - education (categorical: "unknown","secondary","primary","tertiary")

5 - default: has credit in default? (binary: "yes","no")

6 - balance: average yearly balance, in euros (numeric)

7 - housing: has housing loan? (binary: "yes","no")

8 - loan: has personal loan? (binary: "yes","no")

# related with the last contact of the current campaign:

9 - contact: contact communication type (categorical: "unknown","telephone","cellular")

10 - day: last contact day of the month (numeric)

11 - month: last contact month of year (categorical: "jan", "feb", "mar", ..., "nov", "dec")

12 - duration: last contact duration, in seconds (numeric)

# other attributes:

13 - campaign: number of contacts performed during this campaign and for this client (numeric, includes last contact)

14 - pdays: number of days that passed by after the client was last contacted from a previous campaign (numeric, -1 means client was not previously contacted)

15 - previous: number of contacts performed before this campaign and for this client (numeric)

16 - poutcome: outcome of the previous marketing campaign (categorical: "unknown","other","failure","success")

Output variable (desired target):

17 - y - has the client subscribed a term deposit? (binary: "yes","no")

8. Missing Attribute Values: None

最后編輯于
?著作權(quán)歸作者所有,轉(zhuǎn)載或內(nèi)容合作請聯(lián)系作者
【社區(qū)內(nèi)容提示】社區(qū)部分內(nèi)容疑似由AI輔助生成,瀏覽時請結(jié)合常識與多方信息審慎甄別。
平臺聲明:文章內(nèi)容(如有圖片或視頻亦包括在內(nèi))由作者上傳并發(fā)布,文章內(nèi)容僅代表作者本人觀點,簡書系信息發(fā)布平臺,僅提供信息存儲服務(wù)。

相關(guān)閱讀更多精彩內(nèi)容

友情鏈接更多精彩內(nèi)容