房產(chǎn)估值模型訓練及預測結(jié)果

0.下載數(shù)據(jù)集

本文房產(chǎn)估值模型源數(shù)據(jù)為廈門市房價數(shù)據(jù),文件下載鏈接: https://pan.baidu.com/s/1vOact6MsyZZlTSxjmMqTbw 密碼: 8zg6
下載文件打開后如下圖所示:

文件打開圖示.png

從上圖中可以看出數(shù)據(jù)已經(jīng)經(jīng)過簡單的處理,只需要再稍微調(diào)整就可以投入模型的訓練中。

1.MLPR和GBR模型對比

df_y = df['unitPrice']得到DataFrame的unitPrice字段數(shù)據(jù);
y = df_y.values得到shape為(21935,),類型為numpy.ndarray的矩陣,即長度為21935的一維矩陣;
df_x = df.drop(['unitPrice'],axis=1)得到DataFrame的除了unitPrice字段的其他字段;
x = df_x.values得到shape為(21935,120),類型為numpy.ndarray的矩陣,即大小為21935*120的二維矩陣。
用sklearn中的預處理函數(shù)preprocessing.StandardScaler()對數(shù)據(jù)標準化處理,處理過程是先用訓練集fit,再把測試集也標準化處理。
調(diào)用MLPRegresso()獲得多層感知器-回歸模型,再用訓練集進行訓練,最后對測試集進行測試得分。
調(diào)用GradientBoostingRegressor()獲得集成-回歸模型,再用訓練集進行訓練,最后對測試集進行測試得分。

from sklearn.model_selection import train_test_split
from sklearn import preprocessing
from sklearn.neural_network import MLPRegressor
from sklearn.ensemble import GradientBoostingRegressor
import pandas as pd

#boston = load_boston()
df = pd.read_excel("數(shù)據(jù)處理結(jié)果.xlsx")
df_y = df['unitPrice']
df_x = df.drop(['unitPrice'],axis=1)
x = df_x.values
y = df_y.values

train_x,test_x,train_y,test_y = train_test_split(x,y,train_size=0.8,\
                                                  random_state=33)
ss_x = preprocessing.StandardScaler()
train_x1 = ss_x.fit_transform(train_x)
test_x1 = ss_x.transform(test_x)

ss_y = preprocessing.StandardScaler()
train_y1 = ss_y.fit_transform(train_y.reshape(-1,1))
test_y1 = ss_y.transform(test_y.reshape(-1,1))

model_mlp = MLPRegressor(solver='lbfgs',hidden_layer_sizes=(20,20,20),random_state=1)
model_mlp.fit(train_x1,train_y1.ravel())
mlp_score = model_mlp.score(test_x1,test_y1.ravel())
print("sklearn多層感知器-回歸模型得分",mlp_score)

model_gbr = GradientBoostingRegressor()
model_gbr.fit(train_x1,train_y1.ravel())
gbr_score = model_gbr.score(test_x1,test_y1.ravel())
print("sklearn集成-回歸模型得分",gbr_score)

打印的結(jié)果是:

sklearn多層感知器-回歸模型得分 0.683941816792
sklearn集成-回歸模型得分 0.762351806857

對于第一次調(diào)整模型,這個結(jié)果還可以接受。

2.異常值處理

image.png

從圖中我們可以看到有的房子單價達到幾十上百萬,這種異常值需要刪除。
暫時沒有發(fā)現(xiàn)可以直接調(diào)用處理異常值的函數(shù),所以需要自己寫。下面的代碼中定義了一個cleanOutlier函數(shù),函數(shù)的功能主要是刪除異常值。首先得清楚下四分位數(shù)和上四分位數(shù)的概念:例如總共有100個數(shù),中位數(shù)是從小到大排序第50個數(shù)的值,低位數(shù)是從小到大排序第25個數(shù),高位數(shù)是從小到大排序第75個數(shù)。
四分位距是上四分位數(shù)減下四分位數(shù)所得值,例如:上四分位數(shù)為900,下四分位數(shù)為700,則四分位距為200
異常值指的是過大或者過小的值。在我們這個刪除異常值的方法中,低于(下四分位數(shù)-3四分位距)的值或者高于(上四分位數(shù)+3四分位距)的值會被判定為異常值并刪除。例如,上四分位數(shù)為900,下四分位數(shù)為700,則低于100或者高于1500的數(shù)被刪除。
將DataFrame轉(zhuǎn)換為ndarray只需要用df.values就可以獲得,訓練模型時數(shù)值類型一般為float,所以用df.values.astype('float')來獲得浮點類型數(shù)值的矩陣。
cleanOutlier函數(shù)刪除異常值,然后把第0列負值給y變量,把1列到最后一列賦值給x變量
因為x大多是1-hot編碼,所以不需要再進行標準化。

from sklearn.model_selection import train_test_split
from sklearn import preprocessing
from sklearn.neural_network import MLPRegressor
from sklearn.ensemble import GradientBoostingRegressor
import pandas as pd

def cleanOutlier(data,column,mul=3):
    data = data[data[:,column].argsort()] #得到排序后的ndarray
    l = len(data)
    low = int(l/4)
    high = int(l/4*3)
    lowValue = data[low,column]
    highValue = data[high,column]
    print("下四分位數(shù)為{}  上四分位數(shù){}".format(lowValue,highValue))
    if lowValue - mul * (highValue - lowValue) < data[0,column] :
        delLowValue = data[0,column]
    else:
        delLowValue = lowValue - mul * (highValue - lowValue)
    if highValue + mul * (highValue - lowValue) > data[-1,column]:
        delHighValue = data[-1,column]
    else:
        delHighValue = highValue + mul * (highValue - lowValue)
    print("刪除第{}列中數(shù)值小于{}或者大于{}的部分".format(column,\
          delLowValue,delHighValue))
    for i in range(low):
        if data[i,column] >= delLowValue:
            recordLow = i 
            break
    for i in range(len(data)-1,high,-1):
        if data[i,column] <= delHighValue:
            recordHigh = i
            break
    #打印處理異常值的相關信息
    print("原矩陣共有{}行".format(len(data)),end=',')
    print("保留{}到{}行".format(recordLow,recordHigh),end=',')
    data = data[recordLow:recordHigh+1]
    print("刪除第{}列中的異常值后剩余{}行".format(column,\
          recordHigh+1-recordLow))
    return data

df = pd.read_excel("數(shù)據(jù)處理結(jié)果.xlsx")
data = df.values.astype('float')
data = cleanOutlier(data,0)
x = data[:,1:]
y = data[:,0]

train_x,test_x,train_y,test_y = train_test_split(x,y,train_size=0.8,\
                                                  random_state=33)

ss_y = preprocessing.StandardScaler()
train_y = ss_y.fit_transform(train_y.reshape(-1,1))
test_y = ss_y.transform(test_y.reshape(-1,1))

model_mlp = MLPRegressor(solver='lbfgs',hidden_layer_sizes=(20,20,20),random_state=1)
model_mlp.fit(train_x,train_y.ravel())
mlp_score = model_mlp.score(test_x,test_y.ravel())
print("sklearn多層感知器-回歸模型得分",mlp_score)

model_gbr = GradientBoostingRegressor(learning_rate=0.1)
model_gbr.fit(train_x,train_y.ravel())

ss_y = preprocessing.StandardScaler()
train_y = ss_y.fit_transform(train_y.reshape(-1,1))
test_y = ss_y.transform(test_y.reshape(-1,1))

gbr_score = model_gbr.score(test_x,test_y.ravel())
print("sklearn集成-回歸模型得分",gbr_score)

打印的結(jié)果是:

sklearn多層感知器-回歸模型得分 0.795028773029
sklearn集成-回歸模型得分 0.767157061712

對于第二次調(diào)整模型,我們可以看到sklearn多層感知器-回歸模型得分明顯提高,而對于sklearn集成-回歸模型則沒有太大提高??傊@次異常值處理是成功的。

3.正態(tài)化

正態(tài)化就是將y的值以e為底取對數(shù),得到新的一列賦值給y。
正態(tài)化用一個循環(huán)完成:for i in range(len(y)): y[i] = math.log(y[i])
正態(tài)化之后按照原理是不用再標準化了,但是經(jīng)過實驗,對x,y標準化都可以提高得分。

from sklearn.model_selection import train_test_split
from sklearn import preprocessing
from sklearn.neural_network import MLPRegressor
from sklearn.ensemble import GradientBoostingRegressor
import pandas as pd
import math

def cleanOutlier(data,column,mul=3):
    data = data[data[:,column].argsort()] #得到排序后的ndarray
    l = len(data)
    low = int(l/4)
    high = int(l/4*3)
    lowValue = data[low,column]
    highValue = data[high,column]
    print("下四分位數(shù)為{}  上四分位數(shù){}".format(lowValue,highValue))
    if lowValue - mul * (highValue - lowValue) < data[0,column] :
        delLowValue = data[0,column]
    else:
        delLowValue = lowValue - mul * (highValue - lowValue)
    if highValue + mul * (highValue - lowValue) > data[-1,column]:
        delHighValue = data[-1,column]
    else:
        delHighValue = highValue + mul * (highValue - lowValue)
    print("刪除第{}列中數(shù)值小于{}或者大于{}的部分".format(column,\
          delLowValue,delHighValue))
    for i in range(low):
        if data[i,column] >= delLowValue:
            recordLow = i 
            break
    for i in range(len(data)-1,high,-1):
        if data[i,column] <= delHighValue:
            recordHigh = i
            break
    #打印處理異常值的相關信息
    print("原矩陣共有{}行".format(len(data)),end=',')
    print("保留{}到{}行".format(recordLow,recordHigh),end=',')
    data = data[recordLow:recordHigh+1]
    print("刪除第{}列中的異常值后剩余{}行".format(column,\
          recordHigh+1-recordLow))
    return data

df = pd.read_excel("數(shù)據(jù)處理結(jié)果.xlsx")
data = df.values.astype('float')
data = cleanOutlier(data,0)
x = data[:,1:]
y = data[:,0]
for i in range(len(y)):
    y[i] = math.log(y[i])
    
train_x,test_x,train_y,test_y = train_test_split(x,y,train_size=0.8,\
                                                  random_state=33)

ss_x = preprocessing.StandardScaler()
train_x = ss_x.fit_transform(train_x)
test_x = ss_x.transform(test_x)

ss_y = preprocessing.StandardScaler()
train_y = ss_y.fit_transform(train_y.reshape(-1,1))
test_y = ss_y.transform(test_y.reshape(-1,1))

model_mlp = MLPRegressor(solver='lbfgs',hidden_layer_sizes=(20,20,20),random_state=1)
model_mlp.fit(train_x,train_y.ravel())
mlp_score = model_mlp.score(test_x,test_y.ravel())
print("sklearn多層感知器-回歸模型得分",mlp_score)

model_gbr = GradientBoostingRegressor(learning_rate=0.1)
model_gbr.fit(train_x,train_y.ravel())
gbr_score = model_gbr.score(test_x,test_y.ravel())
print("sklearn集成-回歸模型得分",gbr_score)

打印的結(jié)果是:

sklearn多層感知器-回歸模型得分 0.831448099649
sklearn集成-回歸模型得分 0.780133207248

相比較于前一次,分數(shù)又得到了提高,是一次成功的調(diào)整。

4.交叉驗證

主要使用的是sklearn.model_selection中的KFold方法選擇訓練集和測試集
kf = KFold(n_splits=5,shuffle=True)這一行代碼初始化KFold對象
for train_index,test_index in kf.split(x):這一行代碼可以看出kf.split(x)得到的是一個長度為n_splits的列表,即長度為5的列表,列表中元素是元組,元組中的元素是訓練集和測試集的索引。

from sklearn import preprocessing
from sklearn.neural_network import MLPRegressor
from sklearn.ensemble import GradientBoostingRegressor
import pandas as pd
import math
from sklearn.model_selection import KFold

def cleanOutlier(data,column,mul=3):
    data = data[data[:,column].argsort()] #得到排序后的ndarray
    l = len(data)
    low = int(l/4)
    high = int(l/4*3)
    lowValue = data[low,column]
    highValue = data[high,column]
    print("下四分位數(shù)為{}  上四分位數(shù){}".format(lowValue,highValue))
    if lowValue - mul * (highValue - lowValue) < data[0,column] :
        delLowValue = data[0,column]
    else:
        delLowValue = lowValue - mul * (highValue - lowValue)
    if highValue + mul * (highValue - lowValue) > data[-1,column]:
        delHighValue = data[-1,column]
    else:
        delHighValue = highValue + mul * (highValue - lowValue)
    print("刪除第{}列中數(shù)值小于{}或者大于{}的部分".format(column,\
          delLowValue,delHighValue))
    for i in range(low):
        if data[i,column] >= delLowValue:
            recordLow = i 
            break
    for i in range(len(data)-1,high,-1):
        if data[i,column] <= delHighValue:
            recordHigh = i
            break
    #打印處理異常值的相關信息
    print("原矩陣共有{}行".format(len(data)),end=',')
    print("保留{}到{}行".format(recordLow,recordHigh),end=',')
    data = data[recordLow:recordHigh+1]
    print("刪除第{}列中的異常值后剩余{}行".format(column,\
          recordHigh+1-recordLow))
    return data

df = pd.read_excel("數(shù)據(jù)處理結(jié)果.xlsx")
data = df.values.astype('float')
data = cleanOutlier(data,0)
x = data[:,1:]
y = data[:,0]
for i in range(len(y)):
    y[i] = math.log(y[i])

kf = KFold(n_splits=5,shuffle=True)

for train_index,test_index in kf.split(x):
    train_x = x[train_index]    
    test_x = x[test_index]
    train_y = y[train_index]
    test_y = y[test_index]
 
    ss_x = preprocessing.StandardScaler()
    train_x = ss_x.fit_transform(train_x)
    test_x = ss_x.transform(test_x)
    
    ss_y = preprocessing.StandardScaler()
    train_y = ss_y.fit_transform(train_y.reshape(-1,1))
    test_y = ss_y.transform(test_y.reshape(-1,1))
    
    model_mlp = MLPRegressor(solver='lbfgs',hidden_layer_sizes=(20,20,20),random_state=1)
    model_mlp.fit(train_x,train_y.ravel())
    mlp_score = model_mlp.score(test_x,test_y.ravel())
    print("sklearn多層感知器-回歸模型得分",mlp_score)
    
    model_gbr = GradientBoostingRegressor(learning_rate=0.1)
    model_gbr.fit(train_x,train_y.ravel())
    gbr_score = model_gbr.score(test_x,test_y.ravel())
    print("sklearn集成-回歸模型得分",gbr_score)

打印結(jié)果是:

sklearn多層感知器-回歸模型得分 0.8427725943791746
sklearn集成-回歸模型得分 0.7915684454283963
sklearn多層感知器-回歸模型得分 0.8317854959807023
sklearn集成-回歸模型得分 0.7705608099963528
sklearn多層感知器-回歸模型得分 0.8369280445356948
sklearn集成-回歸模型得分 0.7851823734454625
sklearn多層感知器-回歸模型得分 0.8364897250676866
sklearn集成-回歸模型得分 0.7833199279062474
sklearn多層感知器-回歸模型得分 0.8335782493590231
sklearn集成-回歸模型得分 0.7722233325504181

最后編輯于
?著作權(quán)歸作者所有,轉(zhuǎn)載或內(nèi)容合作請聯(lián)系作者
【社區(qū)內(nèi)容提示】社區(qū)部分內(nèi)容疑似由AI輔助生成,瀏覽時請結(jié)合常識與多方信息審慎甄別。
平臺聲明:文章內(nèi)容(如有圖片或視頻亦包括在內(nèi))由作者上傳并發(fā)布,文章內(nèi)容僅代表作者本人觀點,簡書系信息發(fā)布平臺,僅提供信息存儲服務。

相關閱讀更多精彩內(nèi)容

友情鏈接更多精彩內(nèi)容