機(jī)器學(xué)習(xí)入門(Kaggel 競(jìng)賽項(xiàng)目)

本文以在Kaggel上參與的Titanic生存預(yù)測(cè)項(xiàng)目為例,簡(jiǎn)單聊聊自己對(duì)于機(jī)器學(xué)習(xí)的理解。機(jī)器學(xué)習(xí)涉及的內(nèi)容比較復(fù)雜,主要包括有監(jiān)督的學(xué)習(xí)和無監(jiān)督的學(xué)習(xí),有監(jiān)督的學(xué)習(xí)一般包括分類問題和回歸問題,無監(jiān)督學(xué)習(xí)則包括聚類問題和數(shù)據(jù)降維等。本文討論的是有監(jiān)督學(xué)習(xí)中的分類問題。

本文主要從數(shù)據(jù)的預(yù)處理、模型的選擇及優(yōu)化兩個(gè)方面來展開。
(一)數(shù)據(jù)預(yù)處理(特征工程)
本文的數(shù)據(jù)預(yù)處理是指在使用算法模型之前,對(duì)數(shù)據(jù)進(jìn)行一些整理工作。一般在較大的項(xiàng)目中這個(gè)步驟叫特征工程,而數(shù)據(jù)的預(yù)處理是特征工程中的一部分內(nèi)容。在本文的數(shù)據(jù)預(yù)處理步驟中,主要包括對(duì)于一些缺失值、異常值的處理,數(shù)據(jù)的規(guī)范化、離散化等。主要過程在下列代碼中說明。

1.讀取相應(yīng)的文件
    train = pd.read_csv('desktop/titanic/train.csv')
    test = pd.read_csv('desktop/titanic/test.csv')
    data_full = [train,test]
2.缺失值的處理
for dataset in data_full:
      dataset['Embarked'] = dataset['Embarked'].fillna('S') # 用眾數(shù)替代空值        
      dataset['Fare'] = dataset['Fare'].fillna(dataset['Fare'].median())
      dataset.drop(['Cabin'],axis=1,inplace=True)
3.數(shù)據(jù)的離散化 按年齡段劃分
 for dataset in data_full:
      dataset.loc[dataset['Age']<=16,'Age']=0
      dataset.loc[(dataset['Age']>16) & (dataset['Age']<=32),'Age'] = 1
      dataset.loc[(dataset['Age']>32) & (dataset['Age']<=48),'Age'] = 2
      dataset.loc[(dataset['Age']>48) & (dataset['Age']<=64),'Age'] = 3
      dataset.loc[(dataset['Age']>64),'Age'] = 4
4.數(shù)據(jù)的轉(zhuǎn)化 將字符型數(shù)據(jù)轉(zhuǎn)化為數(shù)值型數(shù)據(jù)
 for dataset in data_full:
      dataset.Name = dataset.Name.str.extract('([A-Za-z]+)\.')
 for dataset in data_full:
      dataset['Name'].replace(['Lady', 'Countess', 'Capt','Mlle','Ms','Mme', 'Col', 'Don', 'Dr', 'Major', 'Rev', 'Sir', 'Jonkheer', 'Dona'], 'Rare',inplace=True)
  for dataset in data_full:
      dataset['Sex'] = dataset['Sex'].map({"male":0,"female":1})
      dataset['Embarked'] = dataset['Embarked'].map({"S":0,"C":1,"Q":2})
      dataset['Fare'] = dataset['Fare'].astype(int)
  for dataset in data_full:
      dataset['Name']=dataset['Name'].map({"Mr":1,"Miss":2,"Miss":3,"Master":4,"Rare":5})
5.填補(bǔ)空值
  for dataset in data_full:
      dataset['Name'] = dataset['Name'].fillna(5)
6.轉(zhuǎn)化數(shù)據(jù)類型
  for dataset in data_full:
        dataset['Name'] = dataset['Name'].astype(int)
7.刪去無用數(shù)據(jù)
 for dataset in data_full:
        dataset.drop(['Ticket'],axis=1,inplace = True)
        train['Age'] = train['Age'].fillna(value=train['Age'].median())
        test['Age'] = test['Age'].fillna(value=train['Age'].median())
8.合并特征值
 for dataset in data_full:
       dataset['SibSp']=dataset['SibSp']+dataset['Parch']
 for dataset in data_full:
       dataset = dataset.drop('Parch',axis=1)

(二)模型的選擇和優(yōu)化

本文利用stacking的方法進(jìn)行了各種模型的融合,通過不斷的調(diào)整子模型和二層模型的種類,最終調(diào)試了一個(gè)還算可以的模型,在kaggel上排名top8%。
kaggel成績(jī).png
1.工具包的導(dǎo)入
from sklearn.linear_model import LogisticRegression #回歸模型
from sklearn.preprocessing import StandardScaler
from sklearn.svm import LinearSVC    # 支持向量機(jī)
from sklearn.naive_bayes import MultinomialNB  # 樸素貝葉斯
from sklearn.ensemble import RandomForestClassifier  # 隨機(jī)森林
from sklearn.tree import DecisionTreeClassifier     # 決策樹
from sklearn.ensemble import AdaBoostClassifier
from sklearn.tree import ExtraTreeClassifier 
from sklearn.ensemble import GradientBoostingClassifier  #梯度提升決策樹
2.融合模型的函數(shù)實(shí)現(xiàn)
    初始訓(xùn)練集
    X_train = train[selected_features].values
    y_train = train['Survived'].ravel()
  初始測(cè)試集
  X_test = test[selected_features].values
  stacking模型融合代碼
  from sklearn.model_selection import KFold
  ntrain = train.shape[0]
  ntest  = test.shape[0]
  kf = KFold(n_splits=5)
  def get_oof(clf,X_train,y_train,X_test):
       oof_train = np.zeros((ntrain,))
       oof_test  = np.zeros((ntest,))
       oof_test_skf = np.empty((5,ntest))
     for i,(train_index,test_index) in enumerate(kf.split(X_train)):
        kf_X_train = X_train[train_index]
        kf_y_train = y_train[train_index]
        kf_X_test  = X_train[test_index]
        clf.fit(kf_X_train,kf_y_train)
    
        oof_train[test_index] = clf.predict(kf_X_test)
        oof_test_skf[i,:] = clf.predict(X_test)
oof_test[:] = oof_test_skf.mean(axis=0)
return oof_train.reshape(-1,1), oof_test.reshape(-1,1)

  #算法模型的實(shí)例化
  lsvc = LinearSVC()  # 1支持向量機(jī)
  lgre = LogisticRegression(max_iter=10000) #線性回歸
  xgbc = XGBClassifier()       #XGBoost
  dtr = ExtraTreeClassifier()  # 2決策樹
  ran = RandomForestClassifier()  # 3隨機(jī)森林
  ada = AdaBoostClassifier()      #4adaboost
  grad = GradientBoostingClassifier() #5梯度提升
  #融合函數(shù)的調(diào)用
  lsvc_oof_train, lsvc_oof_test = get_oof(lsvc, X_train, y_train, X_test) 
  dtr_oof_train,dtr_oof_test = get_oof(dtr,X_train,y_train,X_test)
  ran_oof_train,ran_oof_test = get_oof(ran,X_train,y_train,X_test)
  ada_oof_train,ada_oof_test = get_oof(ada,X_train,y_train,X_test)
  grad_oof_train,grad_oof_test = get_oof(grad,X_train,y_train,X_test)
  新的測(cè)試集
  x_train1 = np.concatenate(( lsvc_oof_train,dtr_oof_train, ran_oof_train, ada_oof_train,grad_oof_train), axis=1)
  x_test1 = np.concatenate((lsvc_oof_test,dtr_oof_test,ran_oof_test,ada_oof_test,grad_oof_test ), axis=1)

  #XGBoost模型的參數(shù)調(diào)整
  gbm = XGBClassifier(
  #learning_rate = 0.02,
       n_estimators= 2000,
       max_depth= 4,
       min_child_weight= 2,
      gamma=0.9,                        
      subsample=0.8,
      colsample_bytree=0.8,
      objective= 'binary:logistic',
      nthread= -1,
     scale_pos_weight=1)
   #模型擬合
  gbm.fit(x_train1,y_train)
  pre = gbm.predict(x_test1)
  # 數(shù)據(jù)的存儲(chǔ)
  pd.DataFrame({ 'PassengerId': test.PassengerId, 'Survived': pre }).set_index('PassengerId').to_csv('desktop/titanic/202038a.csv')
最后編輯于
?著作權(quán)歸作者所有,轉(zhuǎn)載或內(nèi)容合作請(qǐng)聯(lián)系作者
【社區(qū)內(nèi)容提示】社區(qū)部分內(nèi)容疑似由AI輔助生成,瀏覽時(shí)請(qǐng)結(jié)合常識(shí)與多方信息審慎甄別。
平臺(tái)聲明:文章內(nèi)容(如有圖片或視頻亦包括在內(nèi))由作者上傳并發(fā)布,文章內(nèi)容僅代表作者本人觀點(diǎn),簡(jiǎn)書系信息發(fā)布平臺(tái),僅提供信息存儲(chǔ)服務(wù)。

相關(guān)閱讀更多精彩內(nèi)容

友情鏈接更多精彩內(nèi)容