操鸡巴视频久久,色色干在线视频,日韩啊v香蕉在线观看

機器學(xué)習(xí)實戰(zhàn) 書籍第二章例子學(xué)習(xí)筆記

書中源碼，here文中還有很多擴展知識和更新方法，很值得學(xué)習(xí)
本文地址here

注：
1.增加CustomLabelBinarizer轉(zhuǎn)換器解決參數(shù)傳遞問題（出現(xiàn)args參數(shù)數(shù)量錯誤）
2.在評估數(shù)據(jù)集some_data報錯因為選取數(shù)據(jù)object那個對象進行稀疏向量表示時會出現(xiàn)長度不是樣本的5維，例如選擇了5組數(shù)據(jù)，原本第一組屬性是object應(yīng)該是[1,0,0,0,0]表示一個值，但是因為選取5組數(shù)據(jù)使用one-hot編碼后樣本種類不足5導(dǎo)致第一組屬性是object表示是[1,0,0] 那么長度就不一致了，預(yù)測會報錯。** (5, 14) (1000, 15) (10000, 16) **

主要分為獲取數(shù)據(jù)、數(shù)據(jù)分析可視化、數(shù)據(jù)預(yù)處理、選擇和訓(xùn)練模型、分析模型幾個部分。

在這里插入圖片描述

獲取數(shù)據(jù)(數(shù)據(jù)下載測試集獲取)

import os
import tarfile
from six.moves import urllib

DOWNLOAD_ROOT = "https://raw.githubusercontent.com/ageron/handson-ml/master/"
HOUSING_PATH = "datasets/housing"
HOUSING_URL = DOWNLOAD_ROOT + HOUSING_PATH + "/housing.tgz"

def fetch_housing_data(housing_url = HOUSING_URL, housing_path = HOUSING_PATH):
    print("download from url : ", housing_url)
    if not os.path.isdir(housing_path):
        os.makedirs(housing_path)
    tgz_path = os.path.join(housing_path, "housing.tgz")
    urllib.request.urlretrieve(housing_url, tgz_path)
    housing_tgz = tarfile.open(tgz_path)
    housing_tgz.extractall(path=housing_path)
    housing_tgz.close()
fetch_housing_data()

import pandas as pd
def load_housing_data(housing_path = HOUSING_PATH):
    csv_path = os.path.join(housing_path, "housing.csv")
    return pd.read_csv(csv_path)
data = load_housing_data()

注： six.moves 兼容py2 py3
當(dāng)然如果確認(rèn)環(huán)境也可以使用
import urllib.request
urllib.request.urlopen("http://")

pandas 查看原始數(shù)據(jù)常用

info() 數(shù)據(jù)簡單描述
describe() 數(shù)值屬性摘要 count max min mean std(標(biāo)準(zhǔn)差數(shù)據(jù)離散程度)這里的空值會被忽略
value_counts() 查看數(shù)據(jù)基本分類

data.ocean_proximity.value_counts()
data.describe()

%matplotlib inline 
import matplotlib.pyplot as plt
data.longitude.hist(bins = 50, figsize = (4, 3))
plt.show()

創(chuàng)建測試集
測試集創(chuàng)建一次之后避免變化，四種方法：
1.運行一次之后保存測試數(shù)據(jù)集
2.設(shè)定隨機種子，例如permutation產(chǎn)生隨機數(shù)之前運行 np.random.seed(1)使得每次產(chǎn)生的隨機數(shù)和首次一致（程序重新運行就失效了）
sklearn提供了類似第二種方法函數(shù):
from sklearn.model_selection import train_test_split
train_set, test_set = train_test_split(data, test_size=0.2, random_state=42)
3.計算示例的hash 取hash最后一個字節(jié) 如果該值小于等于51 即（256的20%）當(dāng)作測試集
4.數(shù)據(jù)分層取樣對于數(shù)據(jù)特征較少，且數(shù)據(jù)與某屬性關(guān)聯(lián)較大
例如書籍上分析預(yù)測房價問題，房價和用戶收入關(guān)聯(lián)很大。例如用戶收入聚集在2~5萬，2萬10人 3萬20人，4萬30人，5萬30萬，大于5萬10人。那么收入比例分別是0.1， 0.2， 0.3， 0.3， 0.1，那么純隨機采樣可能會出現(xiàn)和樣本很大偏差。所以采用分層取樣就可以得到和總樣本基本一致的分布

#隨機數(shù)
import numpy as np
def split_train_test(data, test_ratio):
    shuffled_indices = np.random.permutation(len(data))
    test_set_size = int(len(data) * test_ratio)
    test_indices = shuffled_indices[:test_set_size]
    train_indices = shuffled_indices[test_set_size:]
    return data.iloc[train_indices], data.iloc[test_indices]
train_set, test_set = split_train_test(data, test_ratio=0.2)
print("train len : ", len(train_set), "  test len : ", len(test_set))

#hash
import hashlib 
def test_set_check(identifier, test_ratio, hash):
    return hash(np.int64(identifier)).digest()[-1] < 256 * test_ratio
def split_train_test_by_id(data, test_ratio, id_column, hash=hashlib.md5):
    ids = data[id_column]
    in_test_set = ids.apply(lambda id_: test_set_check(id_, test_ratio, hash))
    return data.loc[~in_test_set], data.loc[in_test_set]

#使用index作為索引
housing_with_id = data.reset_index()
train_set, test_set = split_train_test_by_id(housing_with_id, 0.2, "index")
# 或者可以使用地區(qū)經(jīng)緯度進行計算
housing_with_id["id"] = data["longitude"] * 1000 + data["latitude"]
train_set, test_set = split_train_test_by_id(housing_with_id, 0.2, "id")

from sklearn.model_selection import train_test_split
#random_state 保存隨機狀態(tài)
train_set, test_set = train_test_split(data, test_size=0.2, random_state=42)
print(train_set.shape,test_set.shape)

#分層取樣 
#數(shù)據(jù)預(yù)處理 較少分層數(shù)量 /1.5 where滿足條件時保留原始值，不滿足條件時賦值5.0
data["income_cat"] = np.ceil(data["median_income"] / 1.5)
data["income_cat"].where(data["income_cat"] < 5, 5.0, inplace = True)
from sklearn.model_selection import StratifiedShuffleSplit
#n_splits是將訓(xùn)練數(shù)據(jù)分成train/test對的組數(shù)  此處只需要一組故為1 如果是多組可以發(fā)現(xiàn)樣本分布概率基本一致
split = StratifiedShuffleSplit(n_splits=1, test_size=0.2, random_state=42)
for train_index, test_index in split.split(data, data["income_cat"]):
    strat_train_set = data.loc[train_index]
    strat_test_set = data.loc[test_index]
print("origin data percentage : " , data["income_cat"].value_counts()/ len(data))
print("test sample data percentage : " , strat_test_set["income_cat"].value_counts()/ len(strat_test_set))

#分組好之后刪除income_cat數(shù)據(jù)
for item in (strat_train_set, strat_test_set):
    item.drop("income_cat", axis = 1, inplace=True)

注： 例如輸入值為1 則hash(np.int64(identifier)).digest() 得到 3\xcd\xec\xcc\xce\xbe\x802\x9f\x1f\xdb\xee\x7fXt\xcb
取最后一個即0xcb 即203 203<(256*0.2) 所以不是測試集

從數(shù)據(jù)探索和可視化中獲得洞見

地理數(shù)據(jù)可視化

# 簡單繪制經(jīng)緯度
# data.plot(kind = "scatter", x = "longitude", y = "latitude", alpha = 0.1)
# s大小代表人口數(shù)量多少 c顏色按照右側(cè)越大越靠近頂部顏色  cmap指定顏色分布向量
data.plot(kind="scatter", x="longitude", y="latitude", alpha=0.3, s=data.population/100, 
         label="population", c="median_house_value", cmap=plt.cm.jet, colorbar=True)
plt.legend()

在這里插入圖片描述

尋找相關(guān)性

1.數(shù)據(jù)集不大情況下可使用corr 公式如下
2.還可以使用pandas scatter_matrix

corr_matrix = data.corr()
from pandas.plotting import scatter_matrix
attributes = ["median_house_value", "median_income", "total_rooms", "housing_median_age"]
scatter_matrix(data[attributes], figsize=(12, 8))

在這里插入圖片描述

數(shù)據(jù)預(yù)處理

提取labels
數(shù)據(jù)清理
1.丟失缺失的區(qū)域
2.放棄這個屬性
3.填充缺失值
4.pandas Imputer

data = strat_train_set.drop("median_house_value", axis=1)
housing_labels = strat_train_set["median_house_value"].copy()
# data.dropna(subset=["total_bedrooms"])
# data.drop("total_bedrooms", axis = 1)
median = data["total_bedrooms"].median()
data.total_bedrooms = data.total_bedrooms.fillna(median)
data.loc[data["total_bedrooms"].isnull()]#查看是否有空數(shù)據(jù)

from sklearn.impute import SimpleImputer
imputer = SimpleImputer(strategy="median")
# 因為中位數(shù)只能在數(shù)值屬性操作 ocean_proximity是object類型 
housing_num = data.drop("ocean_proximity", axis = 1)
imputer.fit(housing_num)

#X是一個numpy array格式
X = imputer.transform(housing_num)
housing_tr = pd.DataFrame(X, columns=housing_num.columns)

處理文本和分類屬性

LabelEncoder轉(zhuǎn)成數(shù)字 OneHotEncoder轉(zhuǎn)成向量
LabelBinarizer直接轉(zhuǎn)換成稀疏矩陣

from sklearn.preprocessing import LabelEncoder
encoder = LabelEncoder()
housing_cat = data["ocean_proximity"]
housing_cat_encoded = encoder.fit_transform(housing_cat)

from sklearn.preprocessing import OneHotEncoder
encoder = OneHotEncoder(categories='auto')
# reshape 轉(zhuǎn)成n*1維向量 輸出是Numpy array
housing_cat_1hot = encoder.fit_transform(housing_cat_encoded.reshape(-1, 1))
housing_cat_1hot

from sklearn.preprocessing import LabelBinarizer
encoder = LabelBinarizer()
housing_cat_1hot = encoder.fit_transform(housing_cat)
housing_cat_1hot

自定義轉(zhuǎn)換器

定義自己的轉(zhuǎn)換器

fit
transform
fit_transform 相當(dāng)于fit 和 transform

from sklearn.base import BaseEstimator, TransformerMixin
rooms_ix, bedrooms_ix, population_ix, household_ix = 3,4,5,6

class CombinedAttributesAdder(BaseEstimator, TransformerMixin):
    """主要是實現(xiàn)新增兩列并通過參數(shù)add_bedrooms_per_room判斷是否增加第三列"""
    def __init__(self, add_bedrooms_per_room = True):
        self.add_bedrooms_per_room = add_bedrooms_per_room
    def fit(self, X, y = None):
        return self
    def transform(self, X, y=None):
        rooms_per_household = X[:, rooms_ix] / X[:, household_ix]#每個家庭擁有的房間數(shù)
        population_per_household = X[:, population_ix] / X[:, household_ix]#平均人口數(shù)
        if self.add_bedrooms_per_room:
            bedrooms_per_room = X[:, bedrooms_ix] / X[:, rooms_ix]
            #np.r_是按列連接兩個矩陣，就是把兩矩陣上下相加，要求列數(shù)相等。
            #np.c_是按行連接兩個矩陣，就是把兩矩陣左右相加，要求行數(shù)相等。
            return np.c_[X, rooms_per_household, population_per_household, bedrooms_per_room]
        else:
            return np.c_[X, rooms_per_household, population_per_household]
attr_adder = CombinedAttributesAdder(add_bedrooms_per_room=False)
housing_extra_attribs = attr_adder.transform(data.values)   
print(housing_extra_attribs.shape, data.shape)

特征縮放

最大-最小縮放（歸一化，減去最小值并除以最大值和最小值差值，容易受到極值影響，如果最大值或最小值是錯誤數(shù)據(jù)）
標(biāo)準(zhǔn)化（減去平均值并除以方差）

轉(zhuǎn)換流水線

pipeline
按照轉(zhuǎn)換器順序將上一個輸出作為下一個的輸入
有了數(shù)值處理的流水線
+單個流水線（那個object類型的字段）
FeatureUnion

from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.impute import SimpleImputer
#中位數(shù)值 添加列 標(biāo)準(zhǔn)化
num_pipeline = Pipeline([
    ('imputer', SimpleImputer(strategy='median')),
    ('attribs_adder', CombinedAttributesAdder()),
    ('std_scaler', StandardScaler())
])   
housing_num_str = num_pipeline.fit_transform(housing_num)

添加DataFrameSeletor轉(zhuǎn)換器用于篩選字段

from sklearn.base import BaseEstimator, TransformerMixin
class DataFrameSelector(BaseEstimator, TransformerMixin) :
    def __init__(self, attribute_names):
        self.attribute_names = attribute_names
    def fit(self, X):
        return self
    def transform(self, X):
        return X[self.attribute_names].values
from sklearn.pipeline import FeatureUnion
num_attribs = list(housing_num)
cat_attribs = ["ocean_proximity"]

class CustomLabelBinarizer(BaseEstimator, TransformerMixin):
    def __init__(self, sparse_output=False):
        self.sparse_output = sparse_output
    def fit(self, X, y=None):
        self.enc = LabelBinarizer(sparse_output=self.sparse_output)
        self.enc.fit(X)
        return self
    def transform(self, X, y=None):
        return self.enc.transform(X)

#處理除object屬性外的其他字段 DataFrameSeletor選擇待處理字段
num_pipeline =  Pipeline([
    ('selector', DataFrameSelector(num_attribs)),
    ('imputer', SimpleImputer(strategy='median')),
    ('attribs_adder', CombinedAttributesAdder()),
    ('std_scaler', StandardScaler())
])   

cat_pipeline = Pipeline([
    ('selector', DataFrameSelector(cat_attribs)),
    ('label_binarizer', CustomLabelBinarizer())
])

full_pipeline = FeatureUnion(transformer_list=[
    ("num_pipeline", num_pipeline),
    ("cat_pipeline", cat_pipeline)
])
housing_prepared = full_pipeline.fit_transform(data)

選擇和訓(xùn)練模型

訓(xùn)練和評估數(shù)據(jù)集

from sklearn.linear_model import LinearRegression
lin_reg = LinearRegression()
lin_reg.fit(housing_prepared, housing_labels)
print(housing_prepared.shape, data.shape,data.columns)

some_data = data.iloc[0:7000]
some_labels = housing_labels.iloc[0:7000]
some_data_prepared = full_pipeline.fit_transform(some_data)

#使sklearn mean_squere_error來測量均方誤差
from sklearn.metrics import mean_squared_error
housing_predictions = lin_reg.predict(housing_prepared)
lin_mse = mean_squared_error(housing_labels, housing_predictions)
lin_rmse = np.sqrt(lin_mse)
lin_rmse

引入決策樹模型
當(dāng)然在驗證模型準(zhǔn)確率可以使用之前分離好的測試集，但對于模型訓(xùn)練過程來說使用驗證集基本沒問題情況下再進行測試

from sklearn.tree import DecisionTreeRegressor
tree_reg = DecisionTreeRegressor()
tree_reg.fit(housing_prepared, housing_labels)
housing_predictions = tree_reg.predict(housing_prepared)
tree_mse = mean_squared_error(housing_labels, housing_predictions)
tree_rmse = np.sqrt(tree_mse)
tree_rmse

使用交叉驗證進行評估

from sklearn.model_selection import cross_val_score
#cv=10 表示訓(xùn)練集分割成10個不同的子集 每次取出9個fold進行訓(xùn)練 1個進行評估
scores = cross_val_score(tree_reg, housing_prepared, housing_labels, scoring="neg_mean_squared_error", cv = 10)
rmse_scores = np.sqrt(-scores)

def display_scores(scores):
    print("Scores : ", scores)
    print("Means : ", scores.mean())
    print("Standard deviation : ", scores.std())
display_scores(rmse_scores)

lin_scores = cross_val_score(lin_reg, housing_prepared, housing_labels, scoring="neg_mean_squared_error", cv = 10)
lin_rmse_scores = np.sqrt(-lin_scores)
display_scores(lin_rmse_scores)

from sklearn.ensemble import RandomForestRegressor
forest_reg = RandomForestRegressor()
# forest_reg.fit(housing_prepared, housing_labels)
forest_scores = cross_val_score(forest_reg, housing_prepared, housing_labels, scoring="neg_mean_squared_error", cv = 10)
forest_rmse_scores = np.sqrt(-forest_scores)
display_scores(forest_rmse_scores)

保存模型

from sklearn.externals import joblib
joblib.dump(forest_reg, "my_forest_model.pkl")
my_forest_model = joblib.load("my_forest_model.pkl")

微調(diào)模型

手動改參數(shù) 但是比較繁瑣
使用sklearn GridSearch將預(yù)設(shè)參數(shù)設(shè)置好，并進行找出最佳組合
n_estimators：表示森林里樹的個數(shù)。理論上是越大越好。但是伴隨著就是計算時間的增長。但是并不是取得越大就會越好，預(yù)測效果最好的將會出現(xiàn)在合理的樹個數(shù)。
max_features：隨機選擇特征集合的子集合，并用來分割節(jié)點。子集合的個數(shù)越少，方差就會減少的越快，但同時偏差就會增加的越快。
grid_search.best_params_ 最佳的參數(shù)組合
grid_search.best_estimator_ 最好估算器
grid_search.cv_results_評估分?jǐn)?shù)
當(dāng)組合參數(shù)數(shù)量較少時可以使用使用GridSearch 但當(dāng)超參數(shù)在一個范圍時，優(yōu)先選擇RandomizedSearchCV，該方法隨機獲取參數(shù)
集成方法 通過對表現(xiàn)最優(yōu)的模型組合起來

from sklearn.model_selection import GridSearchCV
param_grid = [
    {'n_estimators': [3, 10, 30], 'max_features':[2, 4, 6, 8]},
    {'bootstrap' : [False], 'n_estimators':[3, 10], 'max_features':[2, 3, 4]}
]
forest_reg = RandomForestRegressor()
grid_search = GridSearchCV(forest_reg, param_grid, cv=5, scoring='neg_mean_squared_error')
grid_search.fit(housing_prepared, housing_labels)

分析最佳模型

獲取屬性重要性分?jǐn)?shù)，通過對該分?jǐn)?shù)分析得出哪些屬性對模型影響較大
使用之前訓(xùn)練集分析模型

feature_importances = grid_search.best_estimator_.feature_importances_
extra_attribs = ["rooms_per_hhold", "pop_per_hhold", "bedrooms_per_room"]
cat_one_hot_attribs = list(encoder.classes_)
attributes = num_attribs + extra_attribs + cat_one_hot_attribs
sorted(zip(feature_importances, attributes), reverse=True)

final_model = grid_search.best_estimator_
X_test = strat_test_set.drop("median_house_value", axis=1)
y_test = strat_test_set["median_house_value"].copy()
X_test_prepared = full_pipeline.transform(X_test)
final_predictions = final_model.predict(X_test_prepared)

final_mse = mean_squared_error(y_test, final_predictions)
final_rmse = np.sqrt(final_mse)
final_rmse

色偷偷精品伊人,欧洲久久精品,欧美综合婷婷骚逼,国产AV主播,国产最新探花在线,九色在线视频一区,伊人大交九欧美,1769亚洲,黄色成人av

機器學(xué)習(xí)實戰(zhàn)（基于Sklearn和tensorflow）第二章學(xué)習(xí)筆記

機器學(xué)習(xí)實戰(zhàn)（基于Sklearn和tensorflow）第二章學(xué)習(xí)筆記

機器學(xué)習(xí)實戰(zhàn) 書籍第二章例子學(xué)習(xí)筆記

獲取數(shù)據(jù)(數(shù)據(jù)下載測試集獲取)

從數(shù)據(jù)探索和可視化中獲得洞見

尋找相關(guān)性

數(shù)據(jù)預(yù)處理

處理文本和分類屬性

自定義轉(zhuǎn)換器

特征縮放

轉(zhuǎn)換流水線

選擇和訓(xùn)練模型

微調(diào)模型

分析最佳模型

相關(guān)閱讀更多精彩內(nèi)容

友情鏈接更多精彩內(nèi)容

色偷偷精品伊人,欧洲久久精品,欧美综合婷婷骚逼,国产AV主播,国产最新探花在线,九色在线视频一区,伊人大交九 欧美,1769亚洲,黄色成人av

機器學(xué)習(xí)實戰(zhàn)（基于Sklearn和tensorflow）第二章學(xué)習(xí)筆記

機器學(xué)習(xí)實戰(zhàn) 書籍第二章例子學(xué)習(xí)筆記

獲取數(shù)據(jù)(數(shù)據(jù)下載 測試集獲取)

從數(shù)據(jù)探索和可視化中獲得洞見

尋找相關(guān)性

數(shù)據(jù)預(yù)處理

處理文本和分類屬性

自定義轉(zhuǎn)換器

特征縮放

轉(zhuǎn)換流水線

選擇和訓(xùn)練模型

微調(diào)模型

分析最佳模型

相關(guān)閱讀更多精彩內(nèi)容

友情鏈接更多精彩內(nèi)容

色偷偷精品伊人,欧洲久久精品,欧美综合婷婷骚逼,国产AV主播,国产最新探花在线,九色在线视频一区,伊人大交九欧美,1769亚洲,黄色成人av

獲取數(shù)據(jù)(數(shù)據(jù)下載測試集獲取)