偷窥久久精品,超碰色网站,一区二区日韩乱码影院

準(zhǔn)備數(shù)據(jù)

訓(xùn)練集和測(cè)試集的數(shù)據(jù)來源于很多地方，比如：數(shù)據(jù)庫，csv文件或者其他存儲(chǔ)數(shù)據(jù)的方式，為了操作的簡(jiǎn)便性，可以寫一些小的腳本來下載并解析這些數(shù)據(jù)。在本文中，我們先寫一個(gè)腳本來演示：

import os
import tarfile
from six.moves import urllib

DOWNLOAD_ROOT = 'https://raw.githubusercontent.com/ageron/handson-ml/master/'
HOUSING_PATH = 'chapter02/datasets/housing'
HOUSING_URL = DOWNLOAD_ROOT + 'datasets/housing' + '/housing.tgz'


def fetch_housing_data(housing_url=HOUSING_URL, housing_path=HOUSING_PATH):
    print(housing_url)
    if not os.path.isdir(housing_path):
        os.makedirs(housing_path)
    tgz_path = os.path.join(housing_path, 'housing.tgz')
    urllib.request.urlretrieve(housing_url, tgz_path)
    print(tgz_path)
    housing_tgz = tarfile.open(tgz_path)
    housing_tgz.extractall(path=housing_path)
    housing_tgz.close()
    

fetch_housing_data()

執(zhí)行上邊的代碼后，數(shù)據(jù)就已經(jīng)下載到本地了，接下來在使用pandas加載數(shù)據(jù)

import pandas as pd


def load_housing_data(housing_path=HOUSING_PATH):
    print(housing_path)
    csv_path = os.path.join(housing_path, "housing.csv")
    print(csv_path)
    return pd.read_csv(csv_path)

數(shù)據(jù)預(yù)覽

使用pandas解析后的數(shù)據(jù)是DataFrames格式，我們可以調(diào)用變量的head()方法，獲取默認(rèn)的前5條數(shù)據(jù)

[圖片上傳失敗...(image-8138cf-1555069757026)]

可以看出，總共有10條屬性，在這5條中，顯示數(shù)據(jù)都很完整，沒有發(fā)現(xiàn)數(shù)值有空的情況，使用info()，我們可以對(duì)整個(gè)數(shù)據(jù)的信息進(jìn)行預(yù)覽：

[圖片上傳失敗...(image-6da190-1555069757026)]

一共有20640條數(shù)據(jù)，這點(diǎn)數(shù)據(jù)對(duì)于ML來說是很小的，只有total_bedrooms的屬性下存在數(shù)據(jù)為空的情況。

通過觀察數(shù)據(jù)，我們發(fā)現(xiàn)，除了ocean_proximity之外的屬性的值都是數(shù)值類型，數(shù)值類型很容易在ML算法中實(shí)現(xiàn)，再次觀察上邊5條數(shù)據(jù)的ocean_proximity值，可以推斷出ocean_proximity應(yīng)該存在幾種類型，跟枚舉有點(diǎn)像，使用value_counts()方法可以查看每個(gè)值得數(shù)量：

[圖片上傳失敗...(image-94d37-1555069757026)]

除此之外，使用describe()可以查看每一行更多的信息：

[圖片上傳失敗...(image-736dde-1555069757026)]

名詞解釋：

名稱	解釋
count	數(shù)量
mean	均值
min	最小值
max	最大值
std	標(biāo)準(zhǔn)差
25%/50%.75%	低于該值所占的比例

如果想查看每個(gè)屬性更加詳細(xì)的信息，我們可以使用hist()方法，查看每個(gè)屬性的矩形圖：

%matplotlib inline 
import matplotlib.pyplot as plt
housing.hist(bins=50, figsize=(20, 15))
plt.show()

[圖片上傳失敗...(image-2ce480-1555069757026)]

通過觀察矩形圖可以很容易的看出值的分布情況，矩形圖的x軸表示值，y軸表示數(shù)量。針對(duì)我們這份數(shù)據(jù)，我們發(fā)現(xiàn)了如下信息：

對(duì)于median_income來說，它的值并不是表示的是真實(shí)的收入，而是通過計(jì)算的結(jié)果，取值范圍在0.5~15之間，明白數(shù)值是如何計(jì)算的，也很重要。
數(shù)據(jù)受限的情況，housing_median_age和median_house_value存在明顯的值得限制，在他們的矩形圖的右邊有一條很長(zhǎng)的條，這說明存在限制的情況，這會(huì)對(duì)ML算法產(chǎn)生一定的影響，比如，在使用算法預(yù)測(cè)的時(shí)候，是否需要也添加該限制？如果答案是不限制，需要對(duì)當(dāng)前受限制的數(shù)據(jù)做進(jìn)一步的處理：
- 收集受限制的數(shù)據(jù)的真實(shí)值
- 刪除這些受限制的數(shù)據(jù)
這些屬性的取值范圍有很大的區(qū)別，這個(gè)會(huì)在下文中解決這個(gè)問題
圖形中有存在尾重的現(xiàn)象，這個(gè)也會(huì)在下文中解決

創(chuàng)建test集

在創(chuàng)建test set的過程中，能夠進(jìn)一步讓我們了解數(shù)據(jù)，這對(duì)選擇機(jī)器學(xué)習(xí)算法很有幫助。最簡(jiǎn)單的就是隨機(jī)收取大約20%的數(shù)據(jù)作為test set。

使用隨機(jī)函數(shù)的缺點(diǎn)是，每次運(yùn)行程序得到的結(jié)果都不一樣，因此，為處理這個(gè)問題，我們需要給每一行一個(gè)唯一的identifier，然后對(duì)identifier進(jìn)行hash化，取它的最后一個(gè)字節(jié)值小于或等于51（20%）就可以了。

在原有的數(shù)據(jù)中，并不存在這樣的identifier，因此需要調(diào)用reset_index()函數(shù)，為每行添加索引，作為identifier。

import hashlib
import numpy as np


def test_set_check(identifier, test_ratio, hash):
    return hash(np.int64(identifier)).digest()[-1] < 256 * test_ratio


def split_train_test_by_id(data, test_ratio, id_column, hash=hashlib.md5):
    ids = data[id_column]
    in_test_set = ids.apply(lambda id_: test_set_check(id_, test_ratio, hash))
    return data.loc[~in_test_set], data.loc[in_test_set]

# 給housing添加index
housing_with_id = housing.reset_index()
train_set, test_set = split_train_test_by_id(housing_with_id, 0.2, "index")
print(len(train_set), 'train +', len(test_set), "test")

# 也可以使用這種方式來創(chuàng)建id
# housing_with_id["id"] = housing["longitude"] * 1000 + housing["latitude"]
# train_set, test_set = split_train_test_by_id(housing_with_id, 0.2, "id")

在上邊的代碼中，使用index作為identifier有一個(gè)缺點(diǎn)，需要把新的數(shù)據(jù)拼接到數(shù)據(jù)整體的最后邊，同時(shí)不能刪除中間的數(shù)據(jù)，解決的方法是，使用其他屬性的組合來計(jì)算identifier。

當(dāng)然sklearn也提供了生成test set的方法

from sklearn.model_selection import train_test_split
train_set, test_set = train_test_split(housing, test_size=0.2, random_state=42)

隨機(jī)抽樣比較適用于數(shù)據(jù)量大的樣本，如果樣本不夠大，就會(huì)引入很大的抽樣偏差。對(duì)于當(dāng)前的數(shù)據(jù)，我們采取分層抽樣。當(dāng)你詢問專家那個(gè)屬性最重要的時(shí)候，他回答說median_income最重要，我們就要考慮基于median_income進(jìn)行分層抽樣。

[圖片上傳失敗...(image-d8b7ed-1555069757026)]

觀察上圖，可以發(fā)現(xiàn)，median_income的值主要集中在幾個(gè)層次上，由于層次不夠多，這也側(cè)面說明了不太適合使用隨機(jī)抽樣。

我們?yōu)閿?shù)據(jù)新增一個(gè)屬性，用于標(biāo)記每行數(shù)據(jù)屬于哪個(gè)層次。對(duì)于大于5.0的，都?xì)w到5.0中。

# 隨機(jī)抽樣會(huì)在某些情況下存在偏差，這時(shí)候可以考慮分層抽樣，每層的實(shí)例個(gè)數(shù)不能太少，分層不能太多
housing["income_cat"] = np.ceil(housing["median_income"] / 1.5)
housing["income_cat"].where(housing["income_cat"] < 5, 5.0, inplace=True)
print(housing.head(10))

[圖片上傳失敗...(image-73994d-1555069757026)]

接下來就需要根據(jù)income_cat,使用sklearn對(duì)數(shù)據(jù)進(jìn)行分層抽樣。

# 使用sklearn的tratifiedShuffleSplit類進(jìn)行分層抽樣
from sklearn.model_selection import StratifiedShuffleSplit


split = StratifiedShuffleSplit(n_splits=1, test_size=0.2, random_state=42)
for train_index, test_index in split.split(housing, housing["income_cat"]):
    strat_train_set = housing.loc[train_index]
    strat_test_set = housing.loc[test_index]
    
print(housing["income_cat"].value_counts() / len(housing))

# 得到訓(xùn)練集和測(cè)試集后刪除income_cat
for s in (strat_train_set, strat_test_set):
    s.drop(["income_cat"], axis=1, inplace=True)
    
print(strat_train_set.head(10))

上邊的代碼在抽樣成功后，刪除了income_cat屬性，結(jié)果如下：
[圖片上傳失敗...(image-8816b8-1555069757026)]

如果我們計(jì)算test set和原數(shù)據(jù)的誤差，能夠得到下邊這張表格，可以看出，分層抽樣的錯(cuò)誤明顯小于隨機(jī)抽樣。

[圖片上傳失敗...(image-12a1e3-1555069757026)]

發(fā)現(xiàn)數(shù)據(jù)的更多信息

要想找到數(shù)據(jù)中隱藏的信息，就要使用可視化的手段，對(duì)于我們的housing數(shù)據(jù)來說，它包含經(jīng)緯度信息，基于地理位置應(yīng)該是一個(gè)好的切入口。

housing = strat_train_set.copy()
housing.plot(kind="scatter", x="longitude", y="latitude", figsize=(20, 12))

[圖片上傳失敗...(image-fac13e-1555069757026)]

這張圖如果繪制成這樣的，很難發(fā)現(xiàn)有什么特點(diǎn)，我們調(diào)整點(diǎn)的透明度試一試。

housing.plot(kind="scatter", x="longitude", y="latitude", alpha=0.1, figsize=(20, 12))

[圖片上傳失敗...(image-ba018a-1555069757026)]

這樣我們的頭腦自動(dòng)分析后，很容易得出數(shù)據(jù)濃度高的地方存在特殊性，那么這些是否與價(jià)格相關(guān)？更進(jìn)一步，我們用點(diǎn)的半徑表示相應(yīng)點(diǎn)的人口規(guī)模，用顏色表示價(jià)格，然后繪圖：

housing.plot(kind="scatter", x="longitude", y="latitude", alpha=0.4, 
             s=housing["population"]/100, label="population", 
             c="median_house_value", cmap=plt.get_cmap("jet"), colorbar=True, figsize=(20, 12))
plt.legend()

[圖片上傳失敗...(image-7ac824-1555069757026)]

從這張圖，可以觀察到，價(jià)格跟位置和人口密度有很大的關(guān)系，和ocean_proximity同樣有關(guān)系，因此，從直覺上，我們可以考慮使用聚類算法。

屬性組合

在數(shù)據(jù)中，可能打個(gè)屬性的用處并不大，但是對(duì)這些屬性做一些特殊的重組后，會(huì)獲取到一些有用的信息。

在我們這個(gè)例子中，total_rooms,total_bedrooms單獨(dú)存在的意義不是很大，但是如果跟population和households做一些組合后，就會(huì)產(chǎn)生新的有意義的屬性。

# 有些屬性可能是我們不需要的，在這里，bedrooms的總數(shù)，不是我們關(guān)心的
# 因此我們可以使用已有的一些屬性生成新的組合屬性
housing["rooms_per_household"] = housing["total_rooms"] / housing["households"]
housing["bedrooms_per_room"] = housing["total_bedrooms"] / housing["total_rooms"]
housing["population_per_household"] = housing["population"] / housing["households"]
corr_matrix = housing.corr()
corr_matrix["median_house_value"].sort_values(ascending=False)

[圖片上傳失敗...(image-78d1c1-1555069757026)]

bedrooms_per_room比，total_rooms,total_bedrooms的相關(guān)性都要高，說明我們做的屬性重組起到了作用。

對(duì)數(shù)據(jù)的操作是一個(gè)循序漸進(jìn)的過程。

數(shù)據(jù)清洗

在清洗數(shù)據(jù)之前，我們先保存好數(shù)據(jù)。

# 分離labels
housing = strat_train_set.drop("median_house_value", axis=1)
housing_labels = strat_train_set["median_house_value"].copy()

在本文上半部分，我們提到過total_bedrooms有一些值為空的情況，對(duì)于這種情況，我們一般會(huì)采取以下幾種方式“

放棄值為空的整行的數(shù)據(jù)
放棄該屬性
重新賦值

通常會(huì)采取第三種方式，為空的值重新附一個(gè)新值，比方說均值。

sklearn提供了一個(gè)Imputer來專門處理這個(gè)問題：

# 機(jī)器學(xué)習(xí)算法不能運(yùn)行在值缺失的情況，因此需要對(duì)值缺失做一些處理
# 1. 放棄那一行數(shù)據(jù) 2. 放棄整個(gè)屬性 3. 給缺失的值重新賦值
from sklearn.impute import SimpleImputer


# 使用中位數(shù)作為策略
imputer = SimpleImputer(strategy="median")
# 移除不是數(shù)值類型的項(xiàng)
housing_num = housing.drop("ocean_proximity", axis=1)
# fit只用來計(jì)算數(shù)據(jù)的策略值
imputer.fit(housing_num)
print(imputer.statistics_)
# 轉(zhuǎn)換數(shù)據(jù)，就是補(bǔ)齊missing value
X = imputer.transform(housing_num)

其中imputer的fit()函數(shù)，只是計(jì)算了各個(gè)屬性的均值，并沒有做其他額外的事情，這就好比對(duì)imputer進(jìn)行了‘訓(xùn)練’，然后調(diào)用transfom()轉(zhuǎn)化數(shù)據(jù)。

其中均值如下：

[圖片上傳失敗...(image-57904c-1555069757026)]

處理text類型的屬性

在我們這個(gè)例子中,ocean_proximity是text類型，需要把它轉(zhuǎn)為數(shù)值類型。sklearn提供了LabelEncoder模塊來把這些text類型的值轉(zhuǎn)換成數(shù)值。

# 對(duì)于不是數(shù)值的屬性值，sk頁提供了轉(zhuǎn)換方法
from sklearn.preprocessing import LabelEncoder


encoder = LabelEncoder()
housing_cat = housing["ocean_proximity"]
housing_cat_encoded = encoder.fit_transform(housing_cat)
print(housing_cat_encoded)
print(encoder.classes_)

'''
[3 3 3 ... 1 1 1]
['<1H OCEAN' 'INLAND' 'ISLAND' 'NEAR BAY' 'NEAR OCEAN']
'''

但是這么做存在的問題是，在機(jī)器學(xué)習(xí)中，認(rèn)為相近的數(shù)值往往相似性更高，為了解決這個(gè)問題，sklearn提供了OneHotEncoder模塊，把整數(shù)映射為一個(gè)只有0和1的向量，只有相對(duì)的位置是1，其他都是0：

# 在上邊的例子中有個(gè)很大的問題，ml的算法會(huì)任務(wù)0和1比較接近，但是<1H OCEAN和NEAR OCEAN更相似
# 為了解決這個(gè)問題，需要引入one hot的方式，用所在的位置設(shè)為1
from sklearn.preprocessing import OneHotEncoder


encoder = OneHotEncoder()
housing_cat_1hot = encoder.fit_transform(housing_cat_encoded.reshape(-1, 1))
print(housing_cat_1hot.toarray())

'''
[[1. 0. 0. 0. 0.]
 [1. 0. 0. 0. 0.]
 [0. 0. 0. 0. 1.]
 ...
 [0. 1. 0. 0. 0.]
 [1. 0. 0. 0. 0.]
 [0. 0. 0. 1. 0.]]
 '''

當(dāng)然，sklearn還提供了把上邊兩步合為一步的模塊LabelBinarizer:

# 也可以把label和one hot的步驟合成一個(gè)
from sklearn.preprocessing import LabelBinarizer


encoder = LabelBinarizer()
housing_cat_1hot = encoder.fit_transform(housing_cat)
print(housing_cat_1hot)

自定義Transforms

盡管sklearn提供了很多有用的transfoms，但是我們還是希望能夠自定義一些transforms，而且這些自定義的模塊，最好用起來和sklearn提供的一樣，很簡(jiǎn)單，下邊的代碼實(shí)現(xiàn)了一個(gè)很簡(jiǎn)單的數(shù)據(jù)轉(zhuǎn)換：

之前：

# 有些屬性可能是我們不需要的，在這里，bedrooms的總數(shù)，不是我們關(guān)心的
# 因此我們可以使用已有的一些屬性生成新的組合屬性
housing["rooms_per_household"] = housing["total_rooms"] / housing["households"]
housing["bedrooms_per_room"] = housing["total_bedrooms"] / housing["total_rooms"]
housing["population_per_household"] = housing["population"] / housing["households"]
corr_matrix = housing.corr()
corr_matrix["median_house_value"].sort_values(ascending=False)

現(xiàn)在：

# 自定義Transformation
from sklearn.base import BaseEstimator, TransformerMixin


rooms_ix, bedrooms_ix, population_ix, household_ix = 3, 4, 5, 6


class CombinedAttributesAdder(BaseEstimator, TransformerMixin):
    def __init__(self, add_bedrooms_per_room=True):
        self.add_bedrooms_per_room = add_bedrooms_per_room
        
    def fit(self, X, y=None):
        return self
    
    def transform(self, X, y=None):
        print("==============")
        rooms_per_household = X[:, rooms_ix] / X[:, household_ix]
        population_per_household = X[:, population_ix] / X[:, household_ix]
        if self.add_bedrooms_per_room:
            bedrooms_per_room = X[:, bedrooms_ix] / X[:, rooms_ix]
            print("aaaa", np.c_[X, rooms_per_household, population_per_household, bedrooms_per_room][0])
            return np.c_[X, rooms_per_household, population_per_household, bedrooms_per_room]
        else:
            return np.c_[X, rooms_per_household, population_per_household]
    

attr_adder = CombinedAttributesAdder()
housing_extra_attribs = attr_adder.transform(housing.values)
print(len(housing_extra_attribs[0])) # 在每一行的后邊拼接了兩個(gè)值
print(housing_extra_attribs) # 在每一行的后邊拼接了兩個(gè)值

'''
[[-121.89 37.29 38.0 ... 4.625368731563422 2.094395280235988
  0.22385204081632654]
 [-121.93 37.05 14.0 ... 6.008849557522124 2.7079646017699117
  0.15905743740795286]
 [-117.2 32.77 31.0 ... 4.225108225108225 2.0259740259740258
  0.24129098360655737]
 ...
 [-116.4 34.09 9.0 ... 6.34640522875817 2.742483660130719
  0.1796086508753862]
 [-118.01 33.82 31.0 ... 5.50561797752809 3.808988764044944
  0.19387755102040816]
 [-122.45 37.77 52.0 ... 4.843505477308295 1.9859154929577465
  0.22035541195476574]]
  '''

這個(gè)轉(zhuǎn)換的另一個(gè)好處是，可以很方便的加入到pipeline中，這個(gè)下邊也講到了。

特征縮放

對(duì)于機(jī)器學(xué)習(xí)，數(shù)據(jù)的scaling同樣很重要，不同scaling的特征，會(huì)產(chǎn)生不同的結(jié)果，在我們的數(shù)據(jù)中，就存在scaling不一致的問題，解決這樣的問題一般有兩種方式：

Min-max scaling，也叫normalization，主要是把值壓縮到0~1之間，用值減去最小值后，再除以最大值減最小值的值
Standardization，減去均值后再除以方差，這個(gè)跟也叫normalization不一樣的地方在于，他的取值范圍不是0~1，它可以避免數(shù)據(jù)中存在極大值造成的誤差

sklearn提供了StandardScaler模塊用于特征縮放，我們使用的是第二種Standardization。

Transformation Pipelines

我們上邊的一系列過程，包含數(shù)據(jù)清洗，屬性重組，數(shù)據(jù)縮放，text類型的轉(zhuǎn)換，都可以使用sklearn的Pipeline來組合成一個(gè)整體的過程，支持異步的方式，同時(shí)進(jìn)行多個(gè)pipeline

# 使用屬性組合的方式
from sklearn.pipeline import FeatureUnion
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler


class DataFrameSelector(BaseEstimator, TransformerMixin):
    def __init__(self, attribute_names):
        self.attribute_names = attribute_names
        
    def fit(self, X, y=None):
        return self
    
    def transform(self, X):
        return X[self.attribute_names].values
    

class CustomLabelBinarizer(BaseEstimator, TransformerMixin):
    def __init__(self, *args, **kwargs):
        self.encoder = LabelBinarizer(*args, **kwargs)
        
    def fit(self, x, y=None):
        self.encoder.fit(x)
        return self
    
    def transform(self, x, y=None):
        print(self.encoder.transform(x))
        return self.encoder.transform(x)
        

num_attribs = list(housing_num)
cat_attribs = ["ocean_proximity"]


num_pipeline = Pipeline([("selector", DataFrameSelector(num_attribs)), 
                         ("imputer", SimpleImputer(strategy="median")), 
                         ("attribs_adder", CombinedAttributesAdder()), 
                          ("std_scaler", StandardScaler())])

cat_pipeline = Pipeline([("selector", DataFrameSelector(cat_attribs)), 
                        ("label_binarizer", CustomLabelBinarizer())])


full_pipeline = FeatureUnion(transformer_list=[("num_pipeline", num_pipeline), 
                                               ("cat_pipeline", cat_pipeline)])

housing_prepared = full_pipeline.fit_transform(housing)
print(housing_prepared[0])

上邊的代碼實(shí)現(xiàn)了從數(shù)據(jù)清洗到特征縮放的整個(gè)過程。

選擇和訓(xùn)練模型

在完成了數(shù)據(jù)的準(zhǔn)備任務(wù)后，我們對(duì)數(shù)據(jù)應(yīng)該有了很清晰的了解，接下來就需要選擇訓(xùn)練模型，這個(gè)過程也是一個(gè)不斷選擇的過程。

我們首先用linear regression model來試一下：

# 我們先用線性回歸模型試一下
from sklearn.linear_model import LinearRegression


lin_reg = LinearRegression()
lin_reg.fit(housing_prepared, housing_labels)

# 準(zhǔn)備一些測(cè)試數(shù)據(jù)
some_data = housing.iloc[:5]
some_labels = housing_labels.iloc[:5]
some_data_prepared = full_pipeline.transform(some_data)
print(some_data_prepared)
print("Predictions:\t", lin_reg.predict(some_data_prepared))
print("Labels:\t\t,", list(some_labels))

[圖片上傳失敗...(image-273698-1555069757026)]

用sklearn寫模型還是很簡(jiǎn)單的，通過打印，我們能夠看到預(yù)測(cè)值和觀測(cè)值還有差距，這時(shí)候，就需要一個(gè)error信息，來監(jiān)控錯(cuò)誤率

mean_squared_error表示均方誤差，公式為：[圖片上傳失敗...(image-3a592-1555069757026)]

一般使用RMSE進(jìn)行評(píng)估（這個(gè)回歸分析模型中最常用的評(píng)估方法）：[圖片上傳失敗...(image-b71739-1555069757026)]

用代碼表示為：

# 使用RMSE測(cè)錯(cuò)誤
from sklearn.metrics import mean_squared_error


housing_predictions = lin_reg.predict(housing_prepared)
lin_mse = mean_squared_error(housing_labels, housing_predictions)
lin_rmse = np.sqrt(lin_mse)
lin_rmse # 這種錯(cuò)誤誤差已經(jīng)很大，說明當(dāng)前的features不能提供預(yù)測(cè)的足夠的信息或者當(dāng)前模型不夠強(qiáng)大

'''
68628.19819848923
'''

從本文上部分的分布應(yīng)該不難看出，用線性回歸的話誤差應(yīng)該很大，更進(jìn)步，我們考慮使用決策樹模型來訓(xùn)練試一下。

# 使用決策樹來訓(xùn)練數(shù)據(jù)
from sklearn.tree import DecisionTreeRegressor


tree_reg = DecisionTreeRegressor()
tree_reg.fit(housing_prepared, housing_labels)

tree_predictions = tree_reg.predict(housing_prepared)
tree_mse = mean_squared_error(housing_labels, tree_predictions)
tree_rmse = np.sqrt(tree_mse)
tree_rmse

'''
0.0
'''

誤差為0，這說明過擬合了。過擬合不是一件好事，為了解決這個(gè)問題，我們可以對(duì)當(dāng)前的訓(xùn)練數(shù)據(jù)做交叉驗(yàn)證Cross-Validation。它的本質(zhì)是把當(dāng)前的數(shù)據(jù)分割成n份，同時(shí)生成n個(gè)誤差。

這里用到的是K-fold Cross Validation叫做K折交叉驗(yàn)證，和LOOCV的不同在于，我們每次的測(cè)試集將不再只包含一個(gè)數(shù)據(jù)，而是多個(gè)，具體數(shù)目將根據(jù)K的選取決定。比如，如果K=5，那么我們利用五折交叉驗(yàn)證的步驟就是：

將所有數(shù)據(jù)集分成5份
不重復(fù)地每次取其中一份做測(cè)試集，用其他四份做訓(xùn)練集訓(xùn)練模型，之后計(jì)算該模型在測(cè)試集上的MSE_i
將5次的MSE_i取平均得到最后的MSE

[圖片上傳失敗...(image-b2f1e4-1555069757026)]

# 上邊出現(xiàn)了error為0的情況，說明過擬合了，可以使用sk的交叉驗(yàn)證
# 把訓(xùn)練數(shù)據(jù)分成一定的分?jǐn)?shù)，相互驗(yàn)證
from sklearn.model_selection import cross_val_score


scores = cross_val_score(tree_reg, housing_prepared, housing_labels, 
                         scoring="neg_mean_squared_error", cv=10)
tree_rmse_scores = np.sqrt(-scores)


def display_scores(scores):
    print("Scores:", scores)
    print("Mean:", scores.mean())
    print("Standard deviation:", scores.std())
    
    
display_scores(tree_rmse_scores)

[圖片上傳失敗...(image-24652a-1555069757026)]

可以看出決策樹的誤差也很高，我們?cè)趯?duì)線性回歸模型做交叉驗(yàn)證：

# 使用交叉驗(yàn)證看看回歸的error
line_scores = cross_val_score(lin_reg, housing_prepared, housing_labels, 
                         scoring="neg_mean_squared_error", cv=10)
line_rmse_scores = np.sqrt(-line_scores)


display_scores(line_rmse_scores)

[圖片上傳失敗...(image-28bf83-1555069757026)]

最后，我們使用隨機(jī)森林來訓(xùn)練模型：

# 隨機(jī)森林
from sklearn.ensemble import RandomForestRegressor


random_forest = RandomForestRegressor()
random_forest.fit(housing_prepared, housing_labels)

forest_predictions = random_forest.predict(housing_prepared)
forest_mse = mean_squared_error(housing_labels, forest_predictions)
forest_rmse = np.sqrt(forest_mse)
forest_rmse

'''
22100.915917968654
'''

看上去，這次錯(cuò)誤明顯小了很多，這個(gè)模型目前來說是比較理想的。

在經(jīng)歷過選擇模型后，我們一般會(huì)得到一個(gè)模型列表，只需選擇最優(yōu)的那個(gè)就行了。

微調(diào)模型

一般來說，機(jī)器學(xué)習(xí)算法都有一些hyperparameter，這些參數(shù)可以影響結(jié)果，我們對(duì)模型的優(yōu)化也包括如何找到最優(yōu)的參數(shù)。

sklearn的GridSearchCV能夠方便的創(chuàng)建參數(shù)組合，比如：

# 在得到一系列可用的模型列表后，需要對(duì)該模型做微調(diào)
# Grid Search 網(wǎng)絡(luò)搜索，使用sk對(duì)各種不同的參數(shù)組合做訓(xùn)練，獲取最佳參數(shù)組合
from sklearn.model_selection import GridSearchCV


param_grid = [{'n_estimators': [3, 10, 30], 'max_features': [2, 4, 6, 8]},
              {'bootstrap': [False], 'n_estimators': [3, 10], 'max_features': [2, 3, 4]}]
forest_reg = RandomForestRegressor()
grid_search = GridSearchCV(forest_reg, param_grid, cv=5, scoring='neg_mean_squared_error')
grid_search.fit(housing_prepared, housing_labels)

grid_search.best_params_

'''
{'max_features': 8, 'n_estimators': 30}
'''

上邊的代碼中一共嘗試了34 + 23 = 18種組合。

# 獲取最優(yōu)的estimator
grid_search.best_estimator_

'''
RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=None,
           max_features=8, max_leaf_nodes=None, min_impurity_decrease=0.0,
           min_impurity_split=None, min_samples_leaf=1,
           min_samples_split=2, min_weight_fraction_leaf=0.0,
           n_estimators=30, n_jobs=None, oob_score=False,
           random_state=None, verbose=0, warm_start=False)
'''

cvres = grid_search.cv_results_
for mean_score, params in zip(cvres["mean_test_score"], cvres["params"]):
    print(np.sqrt(-mean_score), params)

[圖片上傳失敗...(image-71d57a-1555069757026)]

可以很直觀的看到每個(gè)參數(shù)下的誤差。

用測(cè)試集驗(yàn)證

最后，當(dāng)有了可用的模型后，就可以對(duì)test set進(jìn)行驗(yàn)證了，但首先需要使用上文的pipeline對(duì)test set進(jìn)行轉(zhuǎn)換：

# 使用最終的模型來評(píng)估測(cè)試數(shù)據(jù)
final_model = grid_search.best_estimator_

X_test = strat_test_set.drop("median_house_value", axis=1)
y_test = strat_test_set["median_house_value"].copy()

X_test_prepared = full_pipeline.transform(X_test)

final_predictions = final_model.predict(X_test_prepared)

final_mse = mean_squared_error(y_test, final_predictions)
final_rmse = np.sqrt(final_mse)
final_rmse

'''
47732.7520382174
'''

總結(jié)

本文只是一個(gè)關(guān)于機(jī)器學(xué)習(xí)的小項(xiàng)目，但是包含了一個(gè)完整的分析過程，可以看出，對(duì)數(shù)據(jù)的理解和處理占據(jù)大部分的工作，要想處理好這些內(nèi)容，需要一定的統(tǒng)計(jì)學(xué)知識(shí)，這個(gè)會(huì)在后期的文章中給出總結(jié)。

色偷偷精品伊人,欧洲久久精品,欧美综合婷婷骚逼,国产AV主播,国产最新探花在线,九色在线视频一区,伊人大交九欧美,1769亚洲,黄色成人av

機(jī)器學(xué)習(xí)一個(gè)完整的項(xiàng)目過程

機(jī)器學(xué)習(xí)一個(gè)完整的項(xiàng)目過程

準(zhǔn)備數(shù)據(jù)

數(shù)據(jù)預(yù)覽

創(chuàng)建test集

發(fā)現(xiàn)數(shù)據(jù)的更多信息

相關(guān)性

屬性組合

數(shù)據(jù)清洗

處理text類型的屬性

自定義Transforms

特征縮放

Transformation Pipelines

選擇和訓(xùn)練模型

微調(diào)模型

用測(cè)試集驗(yàn)證

總結(jié)

相關(guān)閱讀更多精彩內(nèi)容

友情鏈接更多精彩內(nèi)容

色偷偷精品伊人,欧洲久久精品,欧美综合婷婷骚逼,国产AV主播,国产最新探花在线,九色在线视频一区,伊人大交九 欧美,1769亚洲,黄色成人av

機(jī)器學(xué)習(xí)一個(gè)完整的項(xiàng)目過程

準(zhǔn)備數(shù)據(jù)

數(shù)據(jù)預(yù)覽

創(chuàng)建test集

發(fā)現(xiàn)數(shù)據(jù)的更多信息

相關(guān)性

屬性組合

數(shù)據(jù)清洗

處理text類型的屬性

自定義Transforms

特征縮放

Transformation Pipelines

選擇和訓(xùn)練模型

微調(diào)模型

用測(cè)試集驗(yàn)證

總結(jié)

相關(guān)閱讀更多精彩內(nèi)容

友情鏈接更多精彩內(nèi)容

色偷偷精品伊人,欧洲久久精品,欧美综合婷婷骚逼,国产AV主播,国产最新探花在线,九色在线视频一区,伊人大交九欧美,1769亚洲,黄色成人av