相信數(shù)據(jù)挖掘愛好者們都聽說過kaggle這個競賽平臺,相比國內(nèi)的天池大數(shù)據(jù)平臺而言,Kaggle中的項目更多,而且更加輕量化,更適合缺乏強力硬件支持的初學者練手。
今天就來講一下如何在對機器學習算法一知半解的情況下在Kaggle入門項目——Titanic生存預(yù)測中刷進前10%。
在Kaggle的Kernels板塊中,有很多人分享的項目算法,有R語言和Python語言兩種,感興趣的同學們可以去觀摩一下。其中Python語言中絕大部分是使用Jupyter Notebook完成的。

1. 數(shù)據(jù)總覽
Titanic生存預(yù)測中提供了兩組數(shù)據(jù):train.csv 和test.csv,分別是訓練集和測試集。
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
train_data = pd.read_csv('I://model/titanic/train.csv')
test_data = pd.read_csv('I://model/titanic/test.csv')
train_data.info()
test_data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
PassengerId 891 non-null int64
Survived 891 non-null int64
Pclass 891 non-null int64
Name 891 non-null object
Sex 891 non-null object
Age 714 non-null float64
SibSp 891 non-null int64
Parch 891 non-null int64
Ticket 891 non-null object
Fare 891 non-null float64
Cabin 204 non-null object
Embarked 889 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 83.6+ KB
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 418 entries, 0 to 417
Data columns (total 11 columns):
PassengerId 418 non-null int64
Pclass 418 non-null int64
Name 418 non-null object
Sex 418 non-null object
Age 332 non-null float64
SibSp 418 non-null int64
Parch 418 non-null int64
Ticket 418 non-null object
Fare 417 non-null float64
Cabin 91 non-null object
Embarked 418 non-null object
dtypes: float64(2), int64(4), object(5)
memory usage: 36.0+ KB
存活比例
train_data['Survived'].value_counts().plot.pie(autopct='%1.1f%%')

2. 數(shù)據(jù)關(guān)系分析
(1)性別與生存的關(guān)系
train_data.groupby(['Sex','Survived'])['Survived'].count()
Sex Survived
female 0 81
1 233
male 0 468
1 109
Name: Survived, dtype: int64
train_data[['Sex','Survived']].groupby(['Sex']).mean().plot.bar()

(2)船艙等級與生存的關(guān)系
train_data[['Pclass','Survived']].groupby(['Pclass']).mean().plot.bar(color=['r','g','b'])

train_data[['Sex','Pclass','Survived']].groupby(['Pclass','Sex']).mean().plot.bar()

train_data.groupby(['Sex','Pclass','Survived'])['Survived'].count()
Sex Pclass Survived
female 1 0 3
1 91
2 0 6
1 70
3 0 72
1 72
male 1 0 77
1 45
2 0 91
1 17
3 0 300
1 47
Name: Survived, dtype: int64
從上圖和表中明顯可以看到,雖然泰坦尼克號逃生總體符合婦女優(yōu)先,但是對各個等級船艙還是有區(qū)別的,而且一等艙中的男子憑借自身的社會地位強行混入了救生艇。如白星航運公司主席伊斯梅(他否決了配備48艘救生艇的想法,認為少點也沒關(guān)系)則拋下他的乘客、他的船員、他的船,在最后一刻跳進可折疊式救生艇C(共有39名乘客)。
(3)年齡與存活的關(guān)系
f,ax=plt.subplots(1,2,figsize=(18,8))
sns.violinplot("Pclass","Age", hue="Survived", data=train_data,split=True,ax=ax[0])
ax[0].set_title('Pclass and Age vs Survived')
ax[0].set_yticks(range(0,110,10))
sns.violinplot("Sex","Age", hue="Survived", data=train_data,split=True,ax=ax[1])
ax[1].set_title('Sex and Age vs Survived')
ax[1].set_yticks(range(0,110,10))
plt.show()

(4)稱呼與存活關(guān)系
在數(shù)據(jù)的Name項中包含了對該乘客的稱呼,如Mr、Miss、Mrs等,這些信息包含了乘客的年齡、性別、也有可能包含社會地位,如Dr、Lady、Major、Master等稱呼。
這一項不方便用圖表展示,但是在特征工程中,我們會將其加入到特征中。
(5)登船港口與存活關(guān)系
泰坦尼克號從英國的南安普頓港出發(fā),途徑法國瑟堡和愛爾蘭昆士敦,一部分在瑟堡或昆士敦下船的人逃過了一劫。
sns.countplot('Embarked',hue='Survived',data=train_data)
plt.title('Embarked and Survived')

(6)船上親友人數(shù)與存活關(guān)系
f,ax=plt.subplots(1,2,figsize=(18,8))
train_data[['Parch','Survived']].groupby(['Parch']).mean().plot.bar(ax=ax[0])
ax[0].set_title('Parch and Survived')
train_data[['SibSp','Survived']].groupby(['SibSp']).mean().plot.bar(ax=ax[1])
ax[1].set_title('SibSp and Survived')

從圖中可以看到,孤身一人存活率很低,但是如果親友太多,難以估計周全,也很危險。
(7)其他因素
剩余因素還有船票價格、船艙號和船票號,這三個因素都可能會影響乘客在船中的位置從而影響逃生順序,但是因為這三個因素與生存之間看不出明顯規(guī)律,所以在后期模型融合時,將這些因素交給模型來決定其重要性。
3. 特征工程
首先將train和test合并一起進行特征工程處理:
train_data_org = pd.read_csv('train.csv')
test_data_org = pd.read_csv('test.csv')
test_data_org['Survived'] = 0
combined_train_test = train_data_org.append(test_data_org)
特征工程即從各項參數(shù)中提取出可能影響到最終結(jié)果的特征,作為模型的預(yù)測依據(jù)。特征工程一般應(yīng)先從含有缺失值即NaN的項開始。
(1)Embarked
先填充缺失值,對缺失的Embarked以眾數(shù)來填補
if combined_train_test['Embarked'].isnull().sum() != 0:
combined_train_test['Embarked'].fillna(combined_train_test['Embarked'].mode().iloc[0], inplace=True)
再將Embarked的三個上船港口分為3列,每一列均只包含0和1兩個值
emb_dummies_df = pd.get_dummies(combined_train_test['Embarked'],prefix=combined_train_test[['Embarked']].columns[0])
combined_train_test = pd.concat([combined_train_test, emb_dummies_df], axis=1)
(2)Sex
無缺失值,直接分列
sex_dummies_df = pd.get_dummies(combined_train_test['Sex'], prefix=combined_train_test[['Sex']].columns[0])
combined_train_test = pd.concat([combined_train_test, sex_dummies_df], axis=1)
(3)Name
從名字中提取出稱呼:
combined_train_test['Title'] = combined_train_test['Name'].str.extract('.+,(.+)').str.extract( '^(.+?)\.').str.strip()
將各式稱呼統(tǒng)一:
title_Dict = {}
title_Dict.update(dict.fromkeys(['Capt', 'Col', 'Major', 'Dr', 'Rev'], 'Officer'))
title_Dict.update(dict.fromkeys(['Jonkheer', 'Don', 'Sir', 'the Countess', 'Dona', 'Lady'], 'Royalty'))
title_Dict.update(dict.fromkeys(['Mme', 'Ms', 'Mrs'], 'Mrs'))
title_Dict.update(dict.fromkeys(['Mlle', 'Miss'], 'Miss'))
title_Dict.update(dict.fromkeys(['Mr'], 'Mr'))
title_Dict.update(dict.fromkeys(['Master'], 'Master'))
combined_train_test['Title'] = combined_train_test['Title'].map(title_Dict)
分列
title_dummies_df = pd.get_dummies(combined_train_test['Title'], prefix=combined_train_test[['Title']].columns[0])
combined_train_test = pd.concat([combined_train_test, title_dummies_df], axis=1)
(4)Fare
填充NaN,按一二三等艙各自的均價來填充。
if combined_train_test['Fare'].isnull().sum() != 0:
combined_train_test['Fare'] = combined_train_test[['Fare']].fillna(combined_train_test.groupby('Pclass').transform('mean'))
泰坦尼克號中有家庭團體票(分析Ticket號可以得到),所以需要將團體票分到每個人。
combined_train_test['Group_Ticket'] = combined_train_test['Fare'].groupby(by=combined_train_test['Ticket']).transform('count')
combined_train_test['Fare'] = combined_train_test['Fare'] / combined_train_test['Group_Ticket']
combined_train_test.drop(['Group_Ticket'], axis=1, inplace=True)
票價分級
def fare_category(fare):
if fare <= 4:
return 0
elif fare <= 10:
return 1
elif fare <= 30:
return 2
elif fare <= 45:
return 3
else:
return 4
combined_train_test['Fare_Category'] = combined_train_test['Fare'].map(fare_category)
分列(這一項分列與不分列均可)
fare_cat_dummies_df = pd.get_dummies(combined_train_test['Fare_Category'],prefix=combined_train_test[['Fare_Category']].columns[0])
combined_train_test = pd.concat([combined_train_test, fare_cat_dummies_df], axis=1)
(5)Pclass
Pclass項本身已經(jīng)不需要處理,為了更好地利用這一項,我們假設(shè)一二三等艙各自內(nèi)部的票價也與逃生方式相關(guān),從而分出高價一等艙、低價一等艙……這樣的分類。
Pclass_1_mean_fare = combined_train_test['Fare'].groupby(by=combined_train_test['Pclass']).mean().get([1]).values[0]
Pclass_2_mean_fare = combined_train_test['Fare'].groupby(by=combined_train_test['Pclass']).mean().get([2]).values[0]
Pclass_3_mean_fare = combined_train_test['Fare'].groupby(by=combined_train_test['Pclass']).mean().get([3]).values[0]
# 建立Pclass_Fare Category
combined_train_test['Pclass_Fare_Category'] = combined_train_test.apply(pclass_fare_category, args=(Pclass_1_mean_fare, Pclass_2_mean_fare, Pclass_3_mean_fare), axis=1)
p_fare = LabelEncoder()
p_fare.fit(np.array(['Pclass_1_Low_Fare', 'Pclass_1_High_Fare', 'Pclass_2_Low_Fare', 'Pclass_2_High_Fare', 'Pclass_3_Low_Fare','Pclass_3_High_Fare']))#給每一項添加標簽
combined_train_test['Pclass_Fare_Category'] = p_fare.transform(combined_train_test['Pclass_Fare_Category'])#轉(zhuǎn)換成數(shù)值
(6)Parch and SibSp
這兩組數(shù)據(jù)都能顯著影響到Survived,但是影響方式不完全相同,所以將這兩項合并成FamilySize組的同時保留這兩項。
combined_train_test['Family_Size'] = combined_train_test['Parch'] + combined_train_test['SibSp'] + 1
combined_train_test['Family_Size_Category'] = combined_train_test['Family_Size'].map(family_size_category)
le_family = LabelEncoder()
le_family.fit(np.array(['Single', 'Small_Family', 'Large_Family']))
combined_train_test['Family_Size_Category'] = le_family.transform(combined_train_test['Family_Size_Category'])
fam_size_cat_dummies_df = pd.get_dummies(combined_train_test['Family_Size_Category'],
prefix=combined_train_test[['Family_Size_Category']].columns[0])
combined_train_test = pd.concat([combined_train_test, fam_size_cat_dummies_df], axis=1)
(7)Age
因為Age項缺失較多,所以不能直接將其填充為眾數(shù)或者平均數(shù)。常見有兩種填充法,一是根據(jù)Title項中的Mr、Master、Miss等稱呼的平均年齡填充,或者綜合幾項(Sex、Title、Pclass)的Age均值。二是利用其他組特征量,采用機器學習算法來預(yù)測Age,本例采用的是第二種方法。
將Age完整的項作為訓練集、將Age缺失的項作為測試集。
missing_age_df = pd.DataFrame(combined_train_test[['Age', 'Parch', 'Sex', 'SibSp', 'Family_Size', 'Family_Size_Category',
'Title', 'Fare', 'Fare_Category', 'Pclass', 'Embarked']])
missing_age_df = pd.get_dummies(missing_age_df,columns=['Title', 'Family_Size_Category', 'Fare_Category', 'Sex', 'Pclass' ,'Embarked'])
missing_age_train = missing_age_df[missing_age_df['Age'].notnull()]
missing_age_test = missing_age_df[missing_age_df['Age'].isnull()]
建立融合模型
def fill_missing_age(missing_age_train, missing_age_test):
missing_age_X_train = missing_age_train.drop(['Age'], axis=1)
missing_age_Y_train = missing_age_train['Age']
missing_age_X_test = missing_age_test.drop(['Age'], axis=1)
#模型1
gbm_reg = ensemble.GradientBoostingRegressor(random_state=42)
gbm_reg_param_grid = {'n_estimators': [2000], 'max_depth': [3],'learning_rate': [0.01], 'max_features': [3]}
gbm_reg_grid = model_selection.GridSearchCV(gbm_reg, gbm_reg_param_grid, cv=10, n_jobs=25, verbose=1, scoring='neg_mean_squared_error')
gbm_reg_grid.fit(missing_age_X_train, missing_age_Y_train)
print('Age feature Best GB Params:' + str(gbm_reg_grid.best_params_))
print('Age feature Best GB Score:' + str(gbm_reg_grid.best_score_))
print('GB Train Error for "Age" Feature Regressor:'+ str(gbm_reg_grid.score(missing_age_X_train, missing_age_Y_train)))
missing_age_test['Age_GB'] = gbm_reg_grid.predict(missing_age_X_test)
print(missing_age_test['Age_GB'][:4])
#模型2
lrf_reg = LinearRegression()
lrf_reg_param_grid = {'fit_intercept': [True], 'normalize': [True]}
lrf_reg_grid = model_selection.GridSearchCV(lrf_reg, lrf_reg_param_grid, cv=10, n_jobs=25, verbose=1, scoring='neg_mean_squared_error')
lrf_reg_grid.fit(missing_age_X_train, missing_age_Y_train)
print('Age feature Best LR Params:' + str(lrf_reg_grid.best_params_))
print('Age feature Best LR Score:' + str(lrf_reg_grid.best_score_))
print('LR Train Error for "Age" Feature Regressor' + str(lrf_reg_grid.score(missing_age_X_train, missing_age_Y_train)))
missing_age_test['Age_LRF'] = lrf_reg_grid.predict(missing_age_X_test)
print(missing_age_test['Age_LRF'][:4])
#將兩個模型預(yù)測后的均值作為最終預(yù)測結(jié)果
print('shape1',missing_age_test['Age'].shape,missing_age_test[['Age_GB','Age_LRF']].mode(axis=1).shape)
#missing_age_test['Age'] = missing_age_test[['Age_GB','Age_LRF']].mode(axis=1)
missing_age_test['Age'] = np.mean([missing_age_test['Age_GB'],missing_age_test['Age_LRF']])
print(missing_age_test['Age'][:4])
drop_col_not_req(missing_age_test, ['Age_GB', 'Age_LRF'])
return missing_age_test
填充Age
combined_train_test.loc[(combined_train_test.Age.isnull()), 'Age'] = fill_missing_age(missing_age_train,missing_age_test)
(8)Ticket
將Ticket中的字母與數(shù)字分開,分為Ticket_Letter和Ticket_Number兩項。
combined_train_test['Ticket_Letter'] = combined_train_test['Ticket'].str.split().str[0]
combined_train_test['Ticket_Letter'] = combined_train_test['Ticket_Letter'].apply(lambda x:np.nan if x.isnumeric() else x)
combined_train_test['Ticket_Number'] = combined_train_test['Ticket'].apply(lambda x: pd.to_numeric(x,errors='coerce'))
combined_train_test['Ticket_Number'].fillna(0,inplace=True)
combined_train_test = pd.get_dummies(combined_train_test,columns=['Ticket','Ticket_Letter'])
(9)Cabin
Cabin項缺失太多,只能將有無Cain作為特征值進行建模
combined_train_test['Cabin_Letter'] = combined_train_test['Cabin'].apply(lambda x:str(x)[0] if pd.notnull(x) else x)
combined_train_test = pd.get_dummies(combined_train_test,columns=['Cabin','Cabin_Letter'])
完成之后再將train和test分開:
train_data = combined_train_test[:891]
test_data = combined_train_test[891:]
titanic_train_data_X = train_data.drop(['Survived'],axis=1)
titanic_train_data_Y = train_data['Survived']
titanic_test_data_X = test_data.drop(['Survived'],axis=1)
4. 模型融合
模型融合分兩步進行:
(1)用幾個模型篩選出較為重要的特征:
def get_top_n_features(titanic_train_data_X, titanic_train_data_Y, top_n_features):
# 隨機森林
rf_est = RandomForestClassifier(random_state=42)
rf_param_grid = {'n_estimators': [500], 'min_samples_split': [2, 3], 'max_depth': [20]}
rf_grid = model_selection.GridSearchCV(rf_est, rf_param_grid, n_jobs=25, cv=10, verbose=1)
rf_grid.fit(titanic_train_data_X,titanic_train_data_Y)
#將feature按Importance排序
feature_imp_sorted_rf = pd.DataFrame({'feature': list(titanic_train_data_X), 'importance': rf_grid.best_estimator_.feature_importances_}).sort_values('importance', ascending=False)
features_top_n_rf = feature_imp_sorted_rf.head(top_n_features)['feature']
print('Sample 25 Features from RF Classifier')
print(str(features_top_n_rf[:25]))
# AdaBoost
ada_est = ensemble.AdaBoostClassifier(random_state=42)
ada_param_grid = {'n_estimators': [500], 'learning_rate': [0.5, 0.6]}
ada_grid = model_selection.GridSearchCV(ada_est, ada_param_grid, n_jobs=25, cv=10, verbose=1)
ada_grid.fit(titanic_train_data_X, titanic_train_data_Y)
#排序
feature_imp_sorted_ada = pd.DataFrame({'feature': list(titanic_train_data_X),'importance': ada_grid.best_estimator_.feature_importances_}).sort_values( 'importance', ascending=False)
features_top_n_ada = feature_imp_sorted_ada.head(top_n_features)['feature']
# ExtraTree
et_est = ensemble.ExtraTreesClassifier(random_state=42)
et_param_grid = {'n_estimators': [500], 'min_samples_split': [3, 4], 'max_depth': [15]}
et_grid = model_selection.GridSearchCV(et_est, et_param_grid, n_jobs=25, cv=10, verbose=1)
et_grid.fit(titanic_train_data_X, titanic_train_data_Y)
#排序
feature_imp_sorted_et = pd.DataFrame({'feature': list(titanic_train_data_X), 'importance': et_grid.best_estimator_.feature_importances_}).sort_values('importance', ascending=False)
features_top_n_et = feature_imp_sorted_et.head(top_n_features)['feature']
print('Sample 25 Features from ET Classifier:')
print(str(features_top_n_et[:25]))
# 將三個模型挑選出來的前features_top_n_et合并
features_top_n = pd.concat([features_top_n_rf, features_top_n_ada, features_top_n_et], ignore_index=True).drop_duplicates()
return features_top_n
(2)根據(jù)篩選出的特征值挑選訓練集和測試集
feature_to_pick = 250
feature_top_n = get_top_n_features(titanic_train_data_X,titanic_train_data_Y,feature_to_pick)
titanic_train_data_X = titanic_train_data_X[feature_top_n]
del titanic_train_data_X['Ticket_Number']#后來發(fā)現(xiàn)刪除Ticket_Number后效果更好了
titanic_test_data_X = titanic_test_data_X[feature_top_n]
del titanic_test_data_X['Ticket_Number']
(3)利用votingClassifer建立最終預(yù)測模型
rf_est = ensemble.RandomForestClassifier(n_estimators = 750, criterion = 'gini', max_features = 'sqrt',
max_depth = 3, min_samples_split = 4, min_samples_leaf = 2,
n_jobs = 50, random_state = 42, verbose = 1)
gbm_est = ensemble.GradientBoostingClassifier(n_estimators=900, learning_rate=0.0008, loss='exponential',
min_samples_split=3, min_samples_leaf=2, max_features='sqrt',
max_depth=3, random_state=42, verbose=1)
et_est = ensemble.ExtraTreesClassifier(n_estimators=750, max_features='sqrt', max_depth=35, n_jobs=50,
criterion='entropy', random_state=42, verbose=1)
voting_est = ensemble.VotingClassifier(estimators = [('rf', rf_est),('gbm', gbm_est),('et', et_est)],
voting = 'soft', weights = [3,5,2],
n_jobs = 50)
voting_est.fit(titanic_train_data_X,titanic_train_data_Y)
ps:不想用VotingClassifier的也可以自己根據(jù)這幾個模型的測試準確率給幾個模型的結(jié)果自定義權(quán)重,將最終的加權(quán)平均值作為預(yù)測結(jié)果,本人親測自定義權(quán)重的效果不必VotingClassifier差。
(4)預(yù)測及生成提交文件
titanic_test_data_X['Survived'] = voting_est.predict(titanic_test_data_X)
submission = pd.DataFrame({'PassengerId':test_data_org.loc[:,'PassengerId'],
'Survived':titanic_test_data_X.loc[:,'Survived']})
submission.to_csv('submission_result.csv',index=False,sep=',')
至此全部結(jié)束。
以上代碼運行部分參考了Kernels的分享內(nèi)容,最好運行結(jié)果為80.8%,排名8%,代碼比較繁瑣,共有400余行,對各種因素考慮得比較周全,各種函數(shù)寫法也相當正規(guī),適合給新手學習之用。
之前自己瞎寫的代碼比較難看,最終也只得到80.3%,排名12%的結(jié)果,這里就不作分享了。
有興趣轉(zhuǎn)行機器學習的朋友可以加群:

完整代碼路徑:https://github.com/Arctanxy/Titanic_Voting_Classifier/blob/master/VotingClassifier.py