中文字幕视频精品人妻,精品操逼视频,玖玖玖国产精品

零、引言

此篇文章的初版，是筆者照著kaggle競(jìng)賽社區(qū)中Titanic項(xiàng)目中的兩篇文章實(shí)戰(zhàn)后的總結(jié)，兩篇文章分別為：

Titanic Data Science Solutions

第一篇文章是以Titanic項(xiàng)目為例完完整整的介紹了一遍數(shù)據(jù)挖掘?qū)崙?zhàn)從理解數(shù)據(jù)到訓(xùn)練模型最后提交的整個(gè)過(guò)程，跟著實(shí)現(xiàn)一遍可以很清楚的理解與感知數(shù)據(jù)挖掘?qū)崙?zhàn)全過(guò)程，非常有助于培養(yǎng)實(shí)戰(zhàn)的感覺(jué)
Introduction to Ensembling/Stacking in Python

第二篇文章也是一個(gè)以Titanic項(xiàng)目為例完整介紹了實(shí)戰(zhàn)過(guò)程的文章，但是這篇文章的重心在于介紹與實(shí)踐數(shù)據(jù)挖掘的集成算法 — Stacking 算法

一、實(shí)戰(zhàn)過(guò)程及常用方法

0. 理解題目與觀察數(shù)據(jù)

- 理解題目

接手任何一競(jìng)賽或者項(xiàng)目的時(shí)候，第一件事都是要認(rèn)真的閱讀題目，充分理解題目的背景，因?yàn)槊總€(gè)項(xiàng)目雖然大體上的流程是差不多，但是每一個(gè)步驟的實(shí)現(xiàn)都會(huì)不一樣，例如特征提取與選擇方面，除去利用數(shù)學(xué)知識(shí)降維或者提取主要特征之外，還有一個(gè)很重要的方面，就是需要理解題目的業(yè)務(wù)場(chǎng)景，代入背景去思考業(yè)務(wù)的情況，這種做法可以讓我們事先加強(qiáng)對(duì)特征的理解，方便我們判斷特征工程過(guò)程中的合理性，也可以在一些項(xiàng)目場(chǎng)景中利用其特有的數(shù)據(jù)特征來(lái)修正這種場(chǎng)景下的模型已達(dá)到很好的效果

這種對(duì)于業(yè)務(wù)場(chǎng)景的思考應(yīng)是貫徹到整個(gè)項(xiàng)目實(shí)戰(zhàn)過(guò)程中的

- 觀察數(shù)據(jù)

初步題目后，緊接著就是對(duì)于數(shù)據(jù)的觀察和思考，python中利用pandas進(jìn)行數(shù)據(jù)的加載和處理非常方便，其中pandas庫(kù)中的一些觀察數(shù)據(jù)的方法有

import pandas as pd
train_df = pd.read_csv('./data/train.csv')
train_df.head()
train_df.info()
# describe() 用于觀察連續(xù)數(shù)值特征
train_df.describe()
# describe(include=['O']) 用于觀察字符串特征及非連續(xù)值分類特征
train_df.describe(include=['O'])

其中describe()是個(gè)很有效的描述數(shù)據(jù)的方法，可以加入percentiles=[.1, .2, .3, .4, .5, .6, .7]這樣的分位參數(shù)來(lái)將連續(xù)數(shù)值特征排序并顯示分位值。描述分類特征則可以返回特征值的個(gè)數(shù)、頻率等值

一般觀察思考數(shù)據(jù)是結(jié)合業(yè)務(wù)場(chǎng)景的，需要理解什么樣的場(chǎng)景下會(huì)產(chǎn)生這樣的數(shù)據(jù)，哪些數(shù)據(jù)特征與結(jié)果存在明顯的對(duì)應(yīng)關(guān)系等，在采取合適的操作之前應(yīng)該有一些我們自己的關(guān)于數(shù)據(jù)的假設(shè)，然后在數(shù)據(jù)中佐證我們的假設(shè)

1. 分析數(shù)據(jù)與特征工程

We need arrive at following assumptions based on data analysis done so far. We may validate these assumptions further before taking appropriate actions.

上句是借用第一篇文章的一句話，也是我們這一部分的思考，數(shù)據(jù)分析比較多的都是用數(shù)據(jù)來(lái)驗(yàn)證我們的假想，然后再采取更合適的操作

- 特征工程部分的工作流程需要解決七個(gè)主要目標(biāo)：

歸類（Classifying）：

需要嘗試分類或者歸類我們的樣例，并且去理解分出的不同的類別的含義及其與我們的目標(biāo)之間的關(guān)聯(lián)
關(guān)聯(lián)（Correlating）：

一種方法是利用訓(xùn)練數(shù)據(jù)集中可用的數(shù)據(jù)特征（feature）。哪些數(shù)據(jù)集中的特征對(duì)我們的解決方案目標(biāo)有顯著作用？從統(tǒng)計(jì)的角度說(shuō)哪些數(shù)據(jù)特征與待解決目標(biāo)之間有較大相關(guān)性？訓(xùn)練集中的數(shù)據(jù)特征值改變后待解決目標(biāo)值是否也一樣變化，并且反之亦然？這些都可以針對(duì)給定數(shù)據(jù)集中的數(shù)值特征和分類特征進(jìn)行測(cè)試。我們也想要確定各個(gè)數(shù)據(jù)特征之間的相關(guān)性，關(guān)聯(lián)一些確定的數(shù)據(jù)特征可以有效地幫助我們創(chuàng)建、完善或者糾正其他的數(shù)據(jù)特征
轉(zhuǎn)化（Converting）：

在建模階段，需要準(zhǔn)備適合模型訓(xùn)練的數(shù)據(jù)。根據(jù)模型算法的選擇，可能需要將所有特征轉(zhuǎn)換為等價(jià)的數(shù)值。例如將文本分類值特征轉(zhuǎn)換數(shù)值特征
完善（Completing）：

數(shù)據(jù)準(zhǔn)備過(guò)程也可能要求我們估計(jì)特征中的任何缺失值。當(dāng)沒(méi)有缺失值時(shí)，模型算法可能效果最好
糾正（Correcting）：

我們還可以分析給定的訓(xùn)練數(shù)據(jù)集以查找錯(cuò)誤或可能無(wú)法使用的特征值，并嘗試糾正這些值或排除包含錯(cuò)誤的樣本。一種方法是檢測(cè)樣本或特征中的任何異常值。如果某項(xiàng)特征對(duì)分析沒(méi)有貢獻(xiàn)，或者可能會(huì)顯著影響結(jié)果，我們也可能會(huì)完全放棄該特征
創(chuàng)造（Creating）：

我們可以基于現(xiàn)有的特征或者一系列特征創(chuàng)造新的數(shù)據(jù)特征，且新特征遵循相關(guān)性，轉(zhuǎn)換成數(shù)值和完整性目標(biāo)
圖表（Charting）：

根據(jù)數(shù)據(jù)的性質(zhì)和解決方案目標(biāo)來(lái)選擇正確的可視化圖表和圖表

上述的七個(gè)方向不僅是特征工程部分我們需要思考的，也是我們進(jìn)行數(shù)據(jù)分析處理的七個(gè)角度

所以對(duì)應(yīng)的數(shù)據(jù)清洗操作就可包含：

Correcting by dropping features
Creating new feature extracting from existing || Create new feature combining existing features
Converting a categorical feature || Converting categorical feature to numeric
Completing numerical continuous feature || Completing a categorical feature

- 數(shù)據(jù)分析常用的方法有

Analyze by pivoting feature

train_df[['feature1', 'feature2']].groupby(['feature1 or feature2'], as_index=False).mean().sort_values(by='feature1 or feature2', ascending=False)

Analyze by visualizing data

# 利用 seaborn 畫圖
# sns.barplot | plt.hist | sns.pointplot
grid = sns.FacetGrid(train_df, col='feature1', row='feature2', size=2.2, aspect=1.6)
grid.map(plt.hist, 'feature3', alpha=.5, bins=20)
grid.add_legend()

grid = sns.FacetGrid(train_df, row='Embarked', size=2.2, aspect=1.6)
grid.map(sns.pointplot, 'Pclass', 'Survived', 'Sex', palette='deep')
grid.add_legend()
plt.show()

# 特征值相關(guān)性的熱力圖
colormap = plt.cm.RdBu
plt.figure(figsize=(14,12))
plt.title('Correlation of Features', y=1.05, size=15)
sns.heatmap(train.astype(float).corr(),linewidths=0.1,vmax=1.0, 
            square=True, cmap=colormap, linecolor='white', annot=True)

g = sns.pairplot(train[[u'Survived', u'Pclass', u'Sex', u'Age', u'Parch', u'Fare', u'Embarked',
       u'FamilySize', u'Title']], hue='Survived', palette = 'seismic',size=1.2,diag_kind = 'kde',diag_kws=dict(shade=True),plot_kws=dict(s=10) )
g.set(xticklabels=[])

# 利用 plotly 畫圖 
# go.Scatter | go.Bar 
trace = go.Scatter(
    y = feature_dataframe['Random Forest feature importances'].values,
    x = feature_dataframe['features'].values,
    mode='markers',
    marker=dict(
        sizemode = 'diameter',
        sizeref = 1,
        size = 25,
#       size= feature_dataframe['AdaBoost feature importances'].values,
        #color = np.random.randn(500), #set color equal to a variable
        color = feature_dataframe['Random Forest feature importances'].values,
        colorscale='Portland',
        showscale=True
    ),
    text = feature_dataframe['features'].values
)
data = [trace]

layout= go.Layout(
    autosize= True,
    title= 'Random Forest Feature Importance',
    hovermode= 'closest',
#     xaxis= dict(
#         title= 'Pop',
#         ticklen= 5,
#         zeroline= False,
#         gridwidth= 2,
#     ),
    yaxis=dict(
        title= 'Feature Importance',
        ticklen= 5,
        gridwidth= 2
    ),
    showlegend= False
)
fig = go.Figure(data=data, layout=layout)
py.iplot(fig,filename='scatter2010')

# 用 plotly 畫熱力圖
data = [
    go.Heatmap(
        z= base_predictions_train.astype(float).corr().values ,
        x=base_predictions_train.columns.values,
        y= base_predictions_train.columns.values,
          colorscale='Viridis',
            showscale=True,
            reversescale = True
    )
]
py.iplot(data, filename='labelled-heatmap')

2. 模型訓(xùn)練與預(yù)測(cè)

這里是對(duì)于數(shù)據(jù)挖掘算法的選擇，一般分類用的算法包括：

Logistic Regression
KNN or k-Nearest Neighbors
Support Vector Machines
Naive Bayes classifier
Decision Tree
Random Forrest
Xgboost

- Stacking 算法

此處額外需要講述一種集成算法 —Stacking 集成算法，以兩層Stacking 算法為例：

第一層，可以挑選 4 種或 5 種分類算法，記為model_a、model_b、model_c、model_d及model_e

對(duì)訓(xùn)練數(shù)據(jù)進(jìn)行訓(xùn)練，此時(shí)就需要注意，在Stacking算法中，對(duì)訓(xùn)練數(shù)據(jù)的模型訓(xùn)練需要用到 K-折交叉驗(yàn)證 方法以 5-折交叉驗(yàn)證 為例：

首先假設(shè)我們有 m * n維度的訓(xùn)練數(shù)據(jù)train_set以及k * w維度的測(cè)試數(shù)據(jù)test_set，把train_set分為5份，取出其中的 4 份作為新的(4/5)m * n維度的訓(xùn)練數(shù)據(jù)記為tr_set，另一份則作為臨時(shí)的(1/5)m * n維度的測(cè)試數(shù)據(jù)記為te_set，假設(shè)模型model_a，利用tr_set對(duì)model_a進(jìn)行訓(xùn)練，訓(xùn)練好的模型來(lái)預(yù)測(cè)余下的一份te_set，得到的結(jié)果為(1/5)m * 1維度，用一種m * 1維度的數(shù)據(jù)結(jié)構(gòu)model_list_a中的一部分記錄下來(lái)，然后繼續(xù)用此時(shí)的model_a預(yù)測(cè)全部的測(cè)試數(shù)據(jù)，得到結(jié)果model_a_tmp_1

因?yàn)槭?**5-折 **交叉驗(yàn)證，所以這個(gè)過(guò)程會(huì)重復(fù)五遍，即model_a模型會(huì)被不同的(4/5)m * n維度的訓(xùn)練數(shù)據(jù)訓(xùn)練五遍，最終的model_list_a里保存的則是model_a對(duì)于所有訓(xùn)練數(shù)據(jù)的預(yù)測(cè)值，每一次的重復(fù)又會(huì)產(chǎn)生不同的model_a_tmp_(2,3,4,5)，將這些model_a_tmp相加求平均得model_a_test

而又因?yàn)槲覀冞x擇了五個(gè)訓(xùn)練模型，所以對(duì)于model_b、model_c、model_d及model_e四個(gè)模型，我們同樣會(huì)各訓(xùn)練五遍，也就自然會(huì)產(chǎn)生model_list_b，model_list_c，model_list_d及model_list_e，分別存儲(chǔ)的是四個(gè)模型對(duì)于全部訓(xùn)練數(shù)據(jù)的預(yù)測(cè)值，還會(huì)產(chǎn)生每個(gè)模型的對(duì)于測(cè)試數(shù)據(jù)test_set的平均預(yù)測(cè)結(jié)果 model_b_test、model_c_test、model_d_test及model_e_test

然后將得到的結(jié)果拼接，如下代碼實(shí)現(xiàn)：

x_train = np.concatenate(( model_list_a, model_list_b, model_list_c, model_list_d, model_list_e), axis=1)
x_test = np.concatenate(( model_a_test, model_b_test, model_c_test, model_d_test, model_e_test), axis=1)

此時(shí)得到的x_train的數(shù)據(jù)結(jié)構(gòu)可能是這樣的

屏幕快照 2018-05-21 下午11.06.43.png

我們就是利用這個(gè) x_train 與 x_test 進(jìn)行 stacking 算法的第二層訓(xùn)練，例如利用xgboost算法進(jìn)行訓(xùn)練，如下：

gbm = xgb.XGBClassifier(
    #learning_rate = 0.02,
 n_estimators= 2000,
 max_depth= 4,
 min_child_weight= 2,
 #gamma=1,
 gamma=0.9,                        
 subsample=0.8,
 colsample_bytree=0.8,
 objective= 'binary:logistic',
 nthread= -1,
 scale_pos_weight=1).fit(x_train, y_train)
predictions = gbm.predict(x_test)

則此時(shí)得到的predictions就是我們利用Stacking算法集成了很多種基礎(chǔ)算法得到的最終結(jié)果

這個(gè)過(guò)程中有一個(gè)難點(diǎn)就是關(guān)于每個(gè)模型利用 k-折交叉驗(yàn)證 的思想進(jìn)行的 k 次重復(fù)訓(xùn)練，實(shí)現(xiàn)代碼如下：

# Some useful parameters which will come in handy later on
ntrain = train.shape[0]
ntest = test.shape[0]
SEED = 0 # for reproducibility
NFOLDS = 5 # set folds for out-of-fold prediction
kf = KFold(ntrain, n_folds= NFOLDS, random_state=SEED)

def get_oof(clf, x_train, y_train, x_test):
    oof_train = np.zeros((ntrain,))
    oof_test = np.zeros((ntest,))
    oof_test_skf = np.empty((NFOLDS, ntest))

    for i, (train_index, test_index) in enumerate(kf):
        x_tr = x_train[train_index]
        y_tr = y_train[train_index]
        x_te = x_train[test_index]

        clf.train(x_tr, y_tr)

        oof_train[test_index] = clf.predict(x_te)
        oof_test_skf[i, :] = clf.predict(x_test)

    oof_test[:] = oof_test_skf.mean(axis=0)
    return oof_train.reshape(-1, 1), oof_test.reshape(-1, 1)

此外推薦閱讀數(shù)據(jù)比賽大殺器----模型融合(stacking&blending)，并且會(huì)總結(jié)一些其它重要的算法，此處挖坑GBDT、xgboost

色偷偷精品伊人,欧洲久久精品,欧美综合婷婷骚逼,国产AV主播,国产最新探花在线,九色在线视频一区,伊人大交九欧美,1769亚洲,黄色成人av

數(shù)據(jù)挖掘?qū)崙?zhàn)總結(jié)

數(shù)據(jù)挖掘?qū)崙?zhàn)總結(jié)

零、引言

一、實(shí)戰(zhàn)過(guò)程及常用方法

0. 理解題目與觀察數(shù)據(jù)

- 理解題目

- 觀察數(shù)據(jù)

1. 分析數(shù)據(jù)與特征工程

- 特征工程部分的工作流程需要解決七個(gè)主要目標(biāo)：

- 數(shù)據(jù)分析常用的方法有

2. 模型訓(xùn)練與預(yù)測(cè)

- Stacking 算法

相關(guān)閱讀更多精彩內(nèi)容

友情鏈接更多精彩內(nèi)容

色偷偷精品伊人,欧洲久久精品,欧美综合婷婷骚逼,国产AV主播,国产最新探花在线,九色在线视频一区,伊人大交九 欧美,1769亚洲,黄色成人av

數(shù)據(jù)挖掘?qū)崙?zhàn)總結(jié)

零、 引言

一、 實(shí)戰(zhàn)過(guò)程及常用方法

0. 理解題目與觀察數(shù)據(jù)

- 理解題目

- 觀察數(shù)據(jù)

1. 分析數(shù)據(jù)與特征工程

- 特征工程部分的工作流程需要解決七個(gè)主要目標(biāo)：

- 數(shù)據(jù)分析常用的方法有

2. 模型訓(xùn)練與預(yù)測(cè)

- Stacking 算法

相關(guān)閱讀更多精彩內(nèi)容

友情鏈接更多精彩內(nèi)容

色偷偷精品伊人,欧洲久久精品,欧美综合婷婷骚逼,国产AV主播,国产最新探花在线,九色在线视频一区,伊人大交九欧美,1769亚洲,黄色成人av

零、引言

一、實(shí)戰(zhàn)過(guò)程及常用方法