0.下載數(shù)據(jù)集
本文房產(chǎn)估值模型源數(shù)據(jù)為廈門市房價數(shù)據(jù),文件下載鏈接: https://pan.baidu.com/s/1vOact6MsyZZlTSxjmMqTbw 密碼: 8zg6
下載文件打開后如下圖所示:

從上圖中可以看出數(shù)據(jù)已經(jīng)經(jīng)過簡單的處理,只需要再稍微調(diào)整就可以投入模型的訓練中。
1.MLPR和GBR模型對比
df_y = df['unitPrice']得到DataFrame的unitPrice字段數(shù)據(jù);
y = df_y.values得到shape為(21935,),類型為numpy.ndarray的矩陣,即長度為21935的一維矩陣;
df_x = df.drop(['unitPrice'],axis=1)得到DataFrame的除了unitPrice字段的其他字段;
x = df_x.values得到shape為(21935,120),類型為numpy.ndarray的矩陣,即大小為21935*120的二維矩陣。
用sklearn中的預處理函數(shù)preprocessing.StandardScaler()對數(shù)據(jù)標準化處理,處理過程是先用訓練集fit,再把測試集也標準化處理。
調(diào)用MLPRegresso()獲得多層感知器-回歸模型,再用訓練集進行訓練,最后對測試集進行測試得分。
調(diào)用GradientBoostingRegressor()獲得集成-回歸模型,再用訓練集進行訓練,最后對測試集進行測試得分。
from sklearn.model_selection import train_test_split
from sklearn import preprocessing
from sklearn.neural_network import MLPRegressor
from sklearn.ensemble import GradientBoostingRegressor
import pandas as pd
#boston = load_boston()
df = pd.read_excel("數(shù)據(jù)處理結(jié)果.xlsx")
df_y = df['unitPrice']
df_x = df.drop(['unitPrice'],axis=1)
x = df_x.values
y = df_y.values
train_x,test_x,train_y,test_y = train_test_split(x,y,train_size=0.8,\
random_state=33)
ss_x = preprocessing.StandardScaler()
train_x1 = ss_x.fit_transform(train_x)
test_x1 = ss_x.transform(test_x)
ss_y = preprocessing.StandardScaler()
train_y1 = ss_y.fit_transform(train_y.reshape(-1,1))
test_y1 = ss_y.transform(test_y.reshape(-1,1))
model_mlp = MLPRegressor(solver='lbfgs',hidden_layer_sizes=(20,20,20),random_state=1)
model_mlp.fit(train_x1,train_y1.ravel())
mlp_score = model_mlp.score(test_x1,test_y1.ravel())
print("sklearn多層感知器-回歸模型得分",mlp_score)
model_gbr = GradientBoostingRegressor()
model_gbr.fit(train_x1,train_y1.ravel())
gbr_score = model_gbr.score(test_x1,test_y1.ravel())
print("sklearn集成-回歸模型得分",gbr_score)
打印的結(jié)果是:
sklearn多層感知器-回歸模型得分 0.683941816792
sklearn集成-回歸模型得分 0.762351806857
對于第一次調(diào)整模型,這個結(jié)果還可以接受。
2.異常值處理

從圖中我們可以看到有的房子單價達到幾十上百萬,這種異常值需要刪除。
暫時沒有發(fā)現(xiàn)可以直接調(diào)用處理異常值的函數(shù),所以需要自己寫。下面的代碼中定義了一個cleanOutlier函數(shù),函數(shù)的功能主要是刪除異常值。首先得清楚下四分位數(shù)和上四分位數(shù)的概念:例如總共有100個數(shù),中位數(shù)是從小到大排序第50個數(shù)的值,低位數(shù)是從小到大排序第25個數(shù),高位數(shù)是從小到大排序第75個數(shù)。
四分位距是上四分位數(shù)減下四分位數(shù)所得值,例如:上四分位數(shù)為900,下四分位數(shù)為700,則四分位距為200
異常值指的是過大或者過小的值。在我們這個刪除異常值的方法中,低于(下四分位數(shù)-3四分位距)的值或者高于(上四分位數(shù)+3四分位距)的值會被判定為異常值并刪除。例如,上四分位數(shù)為900,下四分位數(shù)為700,則低于100或者高于1500的數(shù)被刪除。
將DataFrame轉(zhuǎn)換為ndarray只需要用df.values就可以獲得,訓練模型時數(shù)值類型一般為float,所以用df.values.astype('float')來獲得浮點類型數(shù)值的矩陣。
用cleanOutlier函數(shù)刪除異常值,然后把第0列負值給y變量,把1列到最后一列賦值給x變量
因為x大多是1-hot編碼,所以不需要再進行標準化。
from sklearn.model_selection import train_test_split
from sklearn import preprocessing
from sklearn.neural_network import MLPRegressor
from sklearn.ensemble import GradientBoostingRegressor
import pandas as pd
def cleanOutlier(data,column,mul=3):
data = data[data[:,column].argsort()] #得到排序后的ndarray
l = len(data)
low = int(l/4)
high = int(l/4*3)
lowValue = data[low,column]
highValue = data[high,column]
print("下四分位數(shù)為{} 上四分位數(shù){}".format(lowValue,highValue))
if lowValue - mul * (highValue - lowValue) < data[0,column] :
delLowValue = data[0,column]
else:
delLowValue = lowValue - mul * (highValue - lowValue)
if highValue + mul * (highValue - lowValue) > data[-1,column]:
delHighValue = data[-1,column]
else:
delHighValue = highValue + mul * (highValue - lowValue)
print("刪除第{}列中數(shù)值小于{}或者大于{}的部分".format(column,\
delLowValue,delHighValue))
for i in range(low):
if data[i,column] >= delLowValue:
recordLow = i
break
for i in range(len(data)-1,high,-1):
if data[i,column] <= delHighValue:
recordHigh = i
break
#打印處理異常值的相關信息
print("原矩陣共有{}行".format(len(data)),end=',')
print("保留{}到{}行".format(recordLow,recordHigh),end=',')
data = data[recordLow:recordHigh+1]
print("刪除第{}列中的異常值后剩余{}行".format(column,\
recordHigh+1-recordLow))
return data
df = pd.read_excel("數(shù)據(jù)處理結(jié)果.xlsx")
data = df.values.astype('float')
data = cleanOutlier(data,0)
x = data[:,1:]
y = data[:,0]
train_x,test_x,train_y,test_y = train_test_split(x,y,train_size=0.8,\
random_state=33)
ss_y = preprocessing.StandardScaler()
train_y = ss_y.fit_transform(train_y.reshape(-1,1))
test_y = ss_y.transform(test_y.reshape(-1,1))
model_mlp = MLPRegressor(solver='lbfgs',hidden_layer_sizes=(20,20,20),random_state=1)
model_mlp.fit(train_x,train_y.ravel())
mlp_score = model_mlp.score(test_x,test_y.ravel())
print("sklearn多層感知器-回歸模型得分",mlp_score)
model_gbr = GradientBoostingRegressor(learning_rate=0.1)
model_gbr.fit(train_x,train_y.ravel())
ss_y = preprocessing.StandardScaler()
train_y = ss_y.fit_transform(train_y.reshape(-1,1))
test_y = ss_y.transform(test_y.reshape(-1,1))
gbr_score = model_gbr.score(test_x,test_y.ravel())
print("sklearn集成-回歸模型得分",gbr_score)
打印的結(jié)果是:
sklearn多層感知器-回歸模型得分 0.795028773029
sklearn集成-回歸模型得分 0.767157061712
對于第二次調(diào)整模型,我們可以看到sklearn多層感知器-回歸模型得分明顯提高,而對于sklearn集成-回歸模型則沒有太大提高??傊@次異常值處理是成功的。
3.正態(tài)化
正態(tài)化就是將y的值以e為底取對數(shù),得到新的一列賦值給y。
正態(tài)化用一個循環(huán)完成:for i in range(len(y)): y[i] = math.log(y[i])
正態(tài)化之后按照原理是不用再標準化了,但是經(jīng)過實驗,對x,y標準化都可以提高得分。
from sklearn.model_selection import train_test_split
from sklearn import preprocessing
from sklearn.neural_network import MLPRegressor
from sklearn.ensemble import GradientBoostingRegressor
import pandas as pd
import math
def cleanOutlier(data,column,mul=3):
data = data[data[:,column].argsort()] #得到排序后的ndarray
l = len(data)
low = int(l/4)
high = int(l/4*3)
lowValue = data[low,column]
highValue = data[high,column]
print("下四分位數(shù)為{} 上四分位數(shù){}".format(lowValue,highValue))
if lowValue - mul * (highValue - lowValue) < data[0,column] :
delLowValue = data[0,column]
else:
delLowValue = lowValue - mul * (highValue - lowValue)
if highValue + mul * (highValue - lowValue) > data[-1,column]:
delHighValue = data[-1,column]
else:
delHighValue = highValue + mul * (highValue - lowValue)
print("刪除第{}列中數(shù)值小于{}或者大于{}的部分".format(column,\
delLowValue,delHighValue))
for i in range(low):
if data[i,column] >= delLowValue:
recordLow = i
break
for i in range(len(data)-1,high,-1):
if data[i,column] <= delHighValue:
recordHigh = i
break
#打印處理異常值的相關信息
print("原矩陣共有{}行".format(len(data)),end=',')
print("保留{}到{}行".format(recordLow,recordHigh),end=',')
data = data[recordLow:recordHigh+1]
print("刪除第{}列中的異常值后剩余{}行".format(column,\
recordHigh+1-recordLow))
return data
df = pd.read_excel("數(shù)據(jù)處理結(jié)果.xlsx")
data = df.values.astype('float')
data = cleanOutlier(data,0)
x = data[:,1:]
y = data[:,0]
for i in range(len(y)):
y[i] = math.log(y[i])
train_x,test_x,train_y,test_y = train_test_split(x,y,train_size=0.8,\
random_state=33)
ss_x = preprocessing.StandardScaler()
train_x = ss_x.fit_transform(train_x)
test_x = ss_x.transform(test_x)
ss_y = preprocessing.StandardScaler()
train_y = ss_y.fit_transform(train_y.reshape(-1,1))
test_y = ss_y.transform(test_y.reshape(-1,1))
model_mlp = MLPRegressor(solver='lbfgs',hidden_layer_sizes=(20,20,20),random_state=1)
model_mlp.fit(train_x,train_y.ravel())
mlp_score = model_mlp.score(test_x,test_y.ravel())
print("sklearn多層感知器-回歸模型得分",mlp_score)
model_gbr = GradientBoostingRegressor(learning_rate=0.1)
model_gbr.fit(train_x,train_y.ravel())
gbr_score = model_gbr.score(test_x,test_y.ravel())
print("sklearn集成-回歸模型得分",gbr_score)
打印的結(jié)果是:
sklearn多層感知器-回歸模型得分 0.831448099649
sklearn集成-回歸模型得分 0.780133207248
相比較于前一次,分數(shù)又得到了提高,是一次成功的調(diào)整。
4.交叉驗證
主要使用的是sklearn.model_selection中的KFold方法選擇訓練集和測試集
kf = KFold(n_splits=5,shuffle=True)這一行代碼初始化KFold對象
for train_index,test_index in kf.split(x):這一行代碼可以看出kf.split(x)得到的是一個長度為n_splits的列表,即長度為5的列表,列表中元素是元組,元組中的元素是訓練集和測試集的索引。
from sklearn import preprocessing
from sklearn.neural_network import MLPRegressor
from sklearn.ensemble import GradientBoostingRegressor
import pandas as pd
import math
from sklearn.model_selection import KFold
def cleanOutlier(data,column,mul=3):
data = data[data[:,column].argsort()] #得到排序后的ndarray
l = len(data)
low = int(l/4)
high = int(l/4*3)
lowValue = data[low,column]
highValue = data[high,column]
print("下四分位數(shù)為{} 上四分位數(shù){}".format(lowValue,highValue))
if lowValue - mul * (highValue - lowValue) < data[0,column] :
delLowValue = data[0,column]
else:
delLowValue = lowValue - mul * (highValue - lowValue)
if highValue + mul * (highValue - lowValue) > data[-1,column]:
delHighValue = data[-1,column]
else:
delHighValue = highValue + mul * (highValue - lowValue)
print("刪除第{}列中數(shù)值小于{}或者大于{}的部分".format(column,\
delLowValue,delHighValue))
for i in range(low):
if data[i,column] >= delLowValue:
recordLow = i
break
for i in range(len(data)-1,high,-1):
if data[i,column] <= delHighValue:
recordHigh = i
break
#打印處理異常值的相關信息
print("原矩陣共有{}行".format(len(data)),end=',')
print("保留{}到{}行".format(recordLow,recordHigh),end=',')
data = data[recordLow:recordHigh+1]
print("刪除第{}列中的異常值后剩余{}行".format(column,\
recordHigh+1-recordLow))
return data
df = pd.read_excel("數(shù)據(jù)處理結(jié)果.xlsx")
data = df.values.astype('float')
data = cleanOutlier(data,0)
x = data[:,1:]
y = data[:,0]
for i in range(len(y)):
y[i] = math.log(y[i])
kf = KFold(n_splits=5,shuffle=True)
for train_index,test_index in kf.split(x):
train_x = x[train_index]
test_x = x[test_index]
train_y = y[train_index]
test_y = y[test_index]
ss_x = preprocessing.StandardScaler()
train_x = ss_x.fit_transform(train_x)
test_x = ss_x.transform(test_x)
ss_y = preprocessing.StandardScaler()
train_y = ss_y.fit_transform(train_y.reshape(-1,1))
test_y = ss_y.transform(test_y.reshape(-1,1))
model_mlp = MLPRegressor(solver='lbfgs',hidden_layer_sizes=(20,20,20),random_state=1)
model_mlp.fit(train_x,train_y.ravel())
mlp_score = model_mlp.score(test_x,test_y.ravel())
print("sklearn多層感知器-回歸模型得分",mlp_score)
model_gbr = GradientBoostingRegressor(learning_rate=0.1)
model_gbr.fit(train_x,train_y.ravel())
gbr_score = model_gbr.score(test_x,test_y.ravel())
print("sklearn集成-回歸模型得分",gbr_score)
打印結(jié)果是:
sklearn多層感知器-回歸模型得分 0.8427725943791746
sklearn集成-回歸模型得分 0.7915684454283963
sklearn多層感知器-回歸模型得分 0.8317854959807023
sklearn集成-回歸模型得分 0.7705608099963528
sklearn多層感知器-回歸模型得分 0.8369280445356948
sklearn集成-回歸模型得分 0.7851823734454625
sklearn多層感知器-回歸模型得分 0.8364897250676866
sklearn集成-回歸模型得分 0.7833199279062474
sklearn多層感知器-回歸模型得分 0.8335782493590231
sklearn集成-回歸模型得分 0.7722233325504181