Task4 建模調(diào)參

本部分為特征工程之后的階段:創(chuàng)建適合的算法模型并對模型進(jìn)行參數(shù)調(diào)節(jié)。

0讀取數(shù)據(jù):

import pandas as pd
import numpy as np
import warnings
warnings.filterwarnings('ignore')

sample_feature = pd.read_csv('data_for_tree.csv')
continuous_feature_names = [x for x in sample_feature.columns if x not in ['price','brand','model','brand']]

sample_feature = sample_feature.dropna().replace('-', 0).reset_index(drop=True)
sample_feature['notRepairedDamage'] = sample_feature['notRepairedDamage'].astype(np.float32)
train = sample_feature[continuous_feature_names + ['price']]

train_X = train[continuous_feature_names]
train_y = train['price']

1. 線性回歸模型

import datetime
sample_feature = sample_feature.reset_index(drop=True)

split_point = len(sample_feature) // 5 * 4
train = sample_feature.loc[:split_point].dropna()
val = sample_feature.loc[split_point:].dropna()

train_X = train[continuous_feature_names]
train_y_ln = np.log(train['price'] + 1)
val_X = val[continuous_feature_names]
val_y_ln = np.log(val['price'] + 1)

model = model.fit(train_X, train_y_ln)

mean_absolute_error(val_y_ln, model.predict(val_X))

2. 五折交叉驗證

from sklearn.model_selection import cross_val_score
from sklearn.metrics import mean_absolute_error,  make_scorer

def log_transfer(func):
    def wrapper(y, yhat):
        result = func(np.log(y), np.nan_to_num(np.log(yhat)))
        return result
    return wrapper
#使用線性回歸模型,對未處理標(biāo)簽的特征數(shù)據(jù)進(jìn)行五折交叉驗證(Error 1.36)
scores = cross_val_score(model, X=train_X, y=train_y, verbose=1, cv = 5, scoring=make_scorer(log_transfer(mean_absolute_error)))
#使用線性回歸模型,對處理過標(biāo)簽的特征數(shù)據(jù)進(jìn)行五折交叉驗證(Error 0.19)
scores = cross_val_score(model, X=train_X, y=train_y_ln, verbose=1, cv = 5, scoring=make_scorer(mean_absolute_error))

3. 線性模型的嵌入式特征選擇

在過濾式和包裹式特征選擇方法中,特征選擇過程與學(xué)習(xí)器訓(xùn)練過程有明顯的分別。而嵌入式特征選擇在學(xué)習(xí)器訓(xùn)練過程中自動地進(jìn)行特征選擇。嵌入式選擇最常用的是L1正則化與L2正則化。在對線性回歸模型加入兩種正則化方法后,他們分別變成了嶺回歸與Lasso回歸。

from sklearn.linear_model import LinearRegression
from sklearn.linear_model import Ridge
from sklearn.linear_model import Lasso

models = [LinearRegression(),
          Ridge(),
          Lasso()]

result = dict()
for model in models:
    model_name = str(model).split('(')[0]
    scores = cross_val_score(model, X=train_X, y=train_y_ln, verbose=0, cv = 5, scoring=make_scorer(mean_absolute_error))
    result[model_name] = scores
    print(model_name + ' is finished')
#對三種方法的效果對比
result = pd.DataFrame(result)
result.index = ['cv' + str(x) for x in range(1, 6)]
result

4. 非線性模型

除了線性模型以外,還有許多我們常用的非線性模型如下,在此篇幅有限不再一一講解原理。我們選擇了部分常用模型與線性模型進(jìn)行效果比對。

from sklearn.linear_model import LinearRegression
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.neural_network import MLPRegressor
from xgboost.sklearn import XGBRegressor
from lightgbm.sklearn import LGBMRegressor

models = [LinearRegression(),
          DecisionTreeRegressor(),
          RandomForestRegressor(),
          GradientBoostingRegressor(),
          MLPRegressor(solver='lbfgs', max_iter=100), 
          XGBRegressor(n_estimators = 100, objective='reg:squarederror'), 
          LGBMRegressor(n_estimators = 100)]

result = dict()
for model in models:
    model_name = str(model).split('(')[0]
    scores = cross_val_score(model, X=train_X, y=train_y_ln, verbose=0, cv = 5, scoring=make_scorer(mean_absolute_error))
    result[model_name] = scores
    print(model_name + ' is finished')
#效果對比
result = pd.DataFrame(result)
result.index = ['cv' + str(x) for x in range(1, 6)]
result

可以看到隨機(jī)森林模型在每一個fold中均取得了更好的效果。

5. 模型調(diào)參

在此我們介紹了三種常用的調(diào)參方法如下:

?著作權(quán)歸作者所有,轉(zhuǎn)載或內(nèi)容合作請聯(lián)系作者
【社區(qū)內(nèi)容提示】社區(qū)部分內(nèi)容疑似由AI輔助生成,瀏覽時請結(jié)合常識與多方信息審慎甄別。
平臺聲明:文章內(nèi)容(如有圖片或視頻亦包括在內(nèi))由作者上傳并發(fā)布,文章內(nèi)容僅代表作者本人觀點,簡書系信息發(fā)布平臺,僅提供信息存儲服務(wù)。

相關(guān)閱讀更多精彩內(nèi)容

友情鏈接更多精彩內(nèi)容