機器學(xué)習(xí)應(yīng)用建議(二)

偏差和方差的判別

高偏差和高方差本質(zhì)上為學(xué)習(xí)模型的欠擬合和過擬合問題。

對于高偏差和高方差問題,即學(xué)習(xí)模型的欠擬合和過擬合問題,我們通常繪制如下圖表進行判斷:

高偏差——欠擬合問題

  • Jtrain(Θ)誤差大
  • JCV(Θ)誤差 ≈ Jtrain(Θ)誤差

高方差——過擬合問題

  • Jtrain(Θ)誤差小
  • JCV(Θ)誤差 >> Jtrain(Θ)誤差
補充筆記
Diagnosing Bias vs. Variance

In this section we examine the relationship between the degree of the polynomial d and the underfitting or overfitting of our hypothesis.

  • We need to distinguish whether bias or variance is the problem contributing to bad predictions.
  • High bias is underfitting and high variance is overfitting. Ideally, we need to find a golden mean between these two.

The training error will tend to decrease as we increase the degree d of the polynomial.

At the same time, the cross validation error will tend to decrease as we increase d up to a point, and then it will increase as d is increased, forming a convex curve.

High bias (underfitting): both Jtrain(Θ) and JCV(Θ) will be high. Also, JCV(Θ)≈Jtrain(Θ).

High variance (overfitting): Jtrain(Θ) will be low and JCV(Θ) will be much greater than Jtrain(Θ).

The is summarized in the figure below:

正則化的偏差與方差

在訓(xùn)練模型的過程中,為了避免過擬合問題我們通常使用正則化方法。但對于正則化參數(shù)λ的選擇,我們是需要謹(jǐn)慎考慮的。

之前,我們在考慮正則化參數(shù)λ的選擇時,只是考慮單變量的情況?,F(xiàn)在,我們要考慮在多項式的情況下,正則化參數(shù)λ的取值問題。

例如:對于某一多項式模型,我們使用正則化方法。其中,正則化參數(shù)λ=0,0.01,0.02,0.04,0.08,0.16,0.32,0.64,1.28,2.56,5.12,10。現(xiàn)求出最佳的正則化參數(shù)λ的值。

首先,我們將數(shù)據(jù)集分為訓(xùn)練集、交叉驗證集和測試集三部分。

然后,當(dāng)正則化參數(shù)λ=0,0.01,0.02,0.04,0.08,0.16,0.32,0.64,1.28,2.56,5.12,10時,我們分別求出Jtran(θ)和JCV(θ)。

最后,我們利用測試集對JCV(θ)最小時的某個正則化參數(shù)λ值進行計算,求出其Jtest(θ)。

圖中,假設(shè)正則化參數(shù)λ=0.08時,JCV(θ)最小。

為了便于理解,以及便于找到最佳的正則化參數(shù)λ的值,我們可以畫出下圖:

補充筆記
Regularization and Bias/Variance

In the figure above, we see that as λ increases, our fit becomes more rigid. On the other hand, as λ approaches 0, we tend to over overfit the data. So how do we choose our parameter λ to get it 'just right' ? In order to choose the model and the regularization term λ, we need to:

  1. Create a list of lambdas (i.e. λ∈{0,0.01,0.02,0.04,0.08,0.16,0.32,0.64,1.28,2.56,5.12,10.24});
  2. Create a set of models with different degrees or any other variants.
  3. Iterate through the λs and for each λ go through all the models to learn some Θ.
  4. Compute the cross validation error using the learned Θ (computed with λ) on the JCV(Θ) without regularization or λ = 0.
  5. Select the best combo that produces the lowest error on the cross validation set.
  6. Using the best combo Θ and λ, apply it on Jtest(Θ) to see if it has a good generalization of the problem.
學(xué)習(xí)曲線

通過繪制學(xué)習(xí)曲線可以幫助我們了解學(xué)習(xí)算法是否運行正常。學(xué)習(xí)曲線為訓(xùn)練集誤差、交叉驗證集誤差與訓(xùn)練集樣本數(shù)量m之間的函數(shù)關(guān)系圖。

上圖中,假設(shè)函數(shù)為hθ(x) = θ0 + θ1x + θ2x2,且此處不考慮正則化。當(dāng)m = 1時,我們的假設(shè)函數(shù)hθ(x)能完美擬合訓(xùn)練集,其Jtrain(θ) = 0,但對于交叉驗證集而言,假設(shè)函數(shù)hθ(x)的泛化能力差,其JCV(θ)的值將較大;當(dāng)m=2時,我們的假設(shè)函數(shù)hθ能較好地擬合訓(xùn)練集,其Jtrain(θ)的值將稍微增大,但對于交叉驗證集而言,假設(shè)函數(shù)hθ(x)的泛化能力依舊較差,其JCV(θ)的值將較比之前有略微減??;······;但m足夠大時,Jtrain(θ)的值將增大到某一特定值后保持水平,JCV(θ)的值將減小到某一特定值后保持水平,且Jtrain(θ)的值與JCV(θ)的值非常接近。

因此,當(dāng)學(xué)習(xí)算法處于高偏差的情況時,我們增加訓(xùn)練集樣本數(shù)量是毫無用處的。

上圖中,我們的假設(shè)函數(shù)hθ(x) = θ0 + θ1x + θ2x2 + ... + θ100x100,此處考慮正則化,其中正則化參數(shù)λ的值很小。當(dāng)m = 5時,假設(shè)函數(shù)hθ(x)能夠較好地擬合訓(xùn)練集,其Jtrain(θ)的值較小,但假設(shè)函數(shù)hθ(x)的泛化能力較差,其JCV(θ)的值較大;當(dāng)m = 12時,假設(shè)函數(shù)hθ(x)依舊能夠較好地擬合訓(xùn)練集,但其Jtrain(θ)的值稍微增大一些,JCV(θ)的值略微減小一些;······;當(dāng)m足夠大時,Jtrain(θ)的值逐漸增大,JCV(θ)的值逐漸減小。

因此,此時學(xué)習(xí)算法處于高偏差的情況時,我們增加訓(xùn)練集樣本數(shù)量可能會有些幫助。

注:當(dāng)m足夠大時,Jtrain(θ)的值逐漸增大,JCV(θ)的值逐漸減小,這兩者是否會相交,視頻中尚未交代清楚。

補充筆記
Learning Curves

Training an algorithm on a very few number of data points (such as 1, 2 or 3) will easily have 0 errors because we can always find a quadratic curve that touches exactly those number of points. Hence:

  • As the training set gets larger, the error for a quadratic function increases.
  • The error value will plateau out after a certain m, or training set size.

Experiencing high bias:

Low training set size: causes Jtrain(Θ) to be low and JCV(Θ) to be high.

Large training set size: causes both Jtrain(Θ) and JCV(Θ) to be high with Jtrain(Θ)≈JCV(Θ).

If a learning algorithm is suffering from high bias, getting more training data will not (by itself) help much.

Experiencing high variance:

Low training set size: Jtrain(Θ) will be low and JCV(Θ) will be high.

Large training set size: Jtrain(Θ) increases with training set size and JCV(Θ) continues to decrease without leveling off. Also, Jtrain(Θ) < JCV(Θ) but the difference between them remains significant.

If a learning algorithm is suffering from high variance, getting more training data is likely to help.

下一步?jīng)Q定做什么

機器學(xué)習(xí)應(yīng)用建議(一)一文的開頭,我們就預(yù)測結(jié)果存在高誤差而提出了如下的解決方法:

  • 獲取更多的樣本
  • 嘗試減少特征變量的數(shù)量
  • 嘗試獲取更多的特征變量
  • 嘗試增加多項式特征
  • 嘗試減小正則化參數(shù)λ的值
  • 嘗試增大正則化參數(shù)λ的值

對于這些方法,我們分別進行了研究得出了如下結(jié)論:

  • 獲取更多的樣本——適合高方差(過擬合)問題
  • 嘗試減少特征變量的數(shù)量——適合高方差(過擬合)問題
  • 嘗試獲取更多的特征變量——適合高偏差(欠擬合)問題
  • 嘗試增加多項式特征——適合高偏差(欠擬合)問題
  • 嘗試減小正則化參數(shù)λ的值——適合高偏差(欠擬合)問題
  • 嘗試增大正則化參數(shù)λ的值 ——適合高方差(過擬合)問題

對于神經(jīng)網(wǎng)絡(luò)模型而言,使用“小”的模型,其容易出現(xiàn)高偏差(欠擬合)問題,但其優(yōu)勢在于計算代價較??;使用“大”的模型(即隱藏層激活單元較多或有多個隱藏層。),其容易出現(xiàn)高方差(過擬合)問題,且其計算代價較大。但一般而言,正則化的神經(jīng)網(wǎng)絡(luò)模型越“大”其性能越好。

通常我們選擇只含有一層隱藏層的神經(jīng)網(wǎng)絡(luò)模型。但對于其他情況,只含有一層隱藏層的神經(jīng)網(wǎng)絡(luò)模型并不是最優(yōu)的模型。因此,我們可以將數(shù)據(jù)集分為訓(xùn)練集、交叉驗證集和測試集三部分,分別對隱藏層層數(shù)不同的神經(jīng)網(wǎng)絡(luò)模型進行訓(xùn)練,找到一個JCV(Θ)最小的神經(jīng)網(wǎng)絡(luò)模型為止。

補充筆記
Deciding What to Do Next Revisited

Our decision process can be broken down as follows:

  • Getting more training examples: Fixes high variance
  • Trying smaller sets of features: Fixes high variance
  • Adding features: Fixes high bias
  • Adding polynomial features: Fixes high bias
  • Decreasing λ: Fixes high bias
  • Increasing λ: Fixes high variance.

Diagnosing Neural Networks

  • A neural network with fewer parameters is prone to underfitting. It is also computationally cheaper.
  • A large neural network with more parameters is prone to overfitting. It is also computationally expensive. In this case you can use regularization (increase λ) to address the overfitting.

Using a single hidden layer is a good starting default. You can train your neural network on a number of hidden layers using your cross validation set. You can then select the one that performs best.

Model Complexity Effects:

  • Lower-order polynomials (low model complexity) have high bias and low variance. In this case, the model fits poorly consistently.
  • Higher-order polynomials (high model complexity) fit the training data extremely well and the test data extremely poorly. These have low bias on the training data, but very high variance.
  • In reality, we would want to choose a model somewhere in between, that can generalize well but also fits the data reasonably well.
最后編輯于
?著作權(quán)歸作者所有,轉(zhuǎn)載或內(nèi)容合作請聯(lián)系作者
【社區(qū)內(nèi)容提示】社區(qū)部分內(nèi)容疑似由AI輔助生成,瀏覽時請結(jié)合常識與多方信息審慎甄別。
平臺聲明:文章內(nèi)容(如有圖片或視頻亦包括在內(nèi))由作者上傳并發(fā)布,文章內(nèi)容僅代表作者本人觀點,簡書系信息發(fā)布平臺,僅提供信息存儲服務(wù)。

相關(guān)閱讀更多精彩內(nèi)容

友情鏈接更多精彩內(nèi)容