About Loss Function All You Need

損失函數

name meaning
cost function 成本函數是模型預測的值與實際值之間的誤差的度量,成本函數是整個訓練數據集的平均損失。
loss function 成本函數和損失函數是同義詞,但是損失函數僅用于單個訓練示例。 有時也稱為錯誤函數。
error function 錯誤函數和損失函數是同義詞
objective function 一種更為通用的表述,定義某一個具體的成本函數作為需要優(yōu)化的目標函數

優(yōu)化目標通常是最小化cost function,即成本函數,通常用符號J(\theta),通常我們會采用梯度下降的方式來對其進行最小化

Repeat \quad until \quad convergence: \\ \theta_{j} \leftarrow \theta_{j}-\alpha \frac{\partial}{\partial \theta_{j}} J(\theta)

在優(yōu)化過程中,如何使用梯度下降法進行優(yōu)化,對于每個損失函數通常都會進行如下的5步

  • 確定predict function((f(x))),確定predict function中的參數
  • 確定loss function-對于每一個訓練實例的損失(L(\theta))
  • 確定cost function–對于所有訓練實例的平均損失(J(\theta))
  • 確定gradients of cost function-對于每個未知參數(\frac{\partial}{\partial\theta_j}J(\theta))
  • 確定learning rate, 確定epoch,更新參數

下面將對每一種損失函數都進行上面的操作。

回歸損失函數

Squared Error Loss

也稱為L2損失,是實際值與預測值之差的平方

優(yōu)點:

  • 正二次函數(形式為ax ^ 2 + bx + c,其中a> 0),二次函數僅具有全局最小值, 沒有局部最小值
  • 可以確保Gradient Descent(梯度下降)收斂(如果完全收斂)到全局最小值

缺點:

  • 因為其平方性質,導致對異常值的魯棒性低,即對異常值很敏感, 因此,如果我們的數據容易出現異常值,則不應使用此方法。
  • predict function

f(x_i) = mx_i +b \\ \theta = \{m,b\}

  • loss function

L(\theta)=(y_i-f(x_i))^{2}

  • cost function

J(\theta)=\frac{1}{N}\sum_{i=1}^{N}(y_i-f(x_i))^{2}

  • gradient of cost function

\begin{aligned} for \quad Single \quad training \quad example: \\ \frac{\partial L(\theta)}{\partial m} &=2(y_i-f(x_i))*x_i \\ \frac{\partial L(\theta)}{\partial b} &=2(y_i-f(x_i))*1 \\ for \quad All \quad training \quad example: \\ \frac{\partial J(\theta)}{\partial m} &=\frac{1}{N}\sum_{i=1}^{N}\frac{\partial L(\theta)}{\partial m} \\ \frac{\partial J(\theta)}{\partial b} &=\frac{1}{N}\sum_{i=1}^{N}\frac{\partial L(\theta)}{\partial b} \\ \end{aligned}

  • update parameters
def update_weights_MSE(m, b, X, Y, learning_rate):
    m_deriv = 0
    b_deriv = 0
    N = len(X)
    for i in range(N):
        # Calculate partial derivatives
        # -2x(y - (mx + b))
        m_deriv += -2*X[i] * (Y[i] - (m*X[i] + b))

        # -2(y - (mx + b))
        b_deriv += -2*(Y[i] - (m*X[i] + b))

    # We subtract because the derivatives point in direction of steepest ascent
    m -= (m_deriv / float(N)) * learning_rate
    b -= (b_deriv / float(N)) * learning_rate

    return m, b
Absolute Error Loss

也稱為L1損失, 預測值與實際值之間的距離,而與符號無關

優(yōu)點:

  • 相比MSE,對于異常值魯棒性更強,對異常值不敏感

缺點:

  • Absolute Error 的曲線呈 V 字型,連續(xù)但在 y-f(x)=0 處不可導,計算機求解導數比較困難
  • predict function

f(x_i) = mx_i +b \\ \theta = \{m,b\}

  • loss function

L(\theta)=|y_i-f(x_i)|

  • cost function

J(\theta)=\frac{1}{N}\sum_{i=1}^{N}|y_i-f(x_i)|

  • gradient of cost function

\begin{aligned} for \quad Single \quad training \quad example: \\ \frac{\partial L(\theta)}{\partial m} &=\frac{(y_i-f(x_i))*x_i}{|y_i-f(x_i)|} \\ \frac{\partial L(\theta)}{\partial b} &=\frac{(y_i-f(x_i))*1}{|y_i-f(x_i)|} \\ for \quad All \quad training \quad example: \\ \frac{\partial J(\theta)}{\partial m} &=\frac{1}{N}\sum_{i=1}^{N}\frac{\partial L(\theta)}{\partial m} \\ \frac{\partial J(\theta)}{\partial b} &=\frac{1}{N}\sum_{i=1}^{N}\frac{\partial L(\theta)}{\partial b} \\ \end{aligned}

  • update parameters
def update_weights_MAE(m, b, X, Y, learning_rate):
    m_deriv = 0
    b_deriv = 0
    N = len(X)
    for i in range(N):
        # Calculate partial derivatives
        # -x(y - (mx + b)) / |mx + b|
        m_deriv += - X[i] * (Y[i] - (m*X[i] + b)) / abs(Y[i] - (m*X[i] + b))

        # -(y - (mx + b)) / |mx + b|
        b_deriv += -(Y[i] - (m*X[i] + b)) / abs(Y[i] - (m*X[i] + b))

    # We subtract because the derivatives point in direction of steepest ascent
    m -= (m_deriv / float(N)) * learning_rate
    b -= (b_deriv / float(N)) * learning_rate

    return m, b
Huber Loss

Huber Loss 是對二者的綜合,包含了一個超參數 δ。δ 值的大小決定了 Huber Loss 對 MSE 和 MAE 的側重性,當 |y?f(x)| ≤ δ 時,變?yōu)?MSE;當 |y?f(x)| > δ 時,則變成類似于 MAE

優(yōu)點:

  • 減小了對異常值的敏感度問題
  • 實現了處處可導的功能
  • predict function

f(x_i) = mx_i +b \\ \theta = \{m,b\}

  • loss function

\begin{aligned} &L_{\delta}(\theta)=\left\{\begin{array}{l} \frac{1}{2}(y_i-{f(x_i)})^{2}, \text { if }|y_i-f(x_i)| \leq \delta. \\ \delta|y_i-f(x_i)|-\frac{1}{2} \delta^{2}, \quad \text { otherwise } \end{array}\right.\\ \end{aligned}

  • cost function

J(\theta)=\frac{1}{N}\sum_{i=1}^{N}L_{\delta}(\theta)

  • gradient of cost function

\begin{aligned} for \quad Single \quad training \quad example: \\ \text { if }|y_i-f(x_i)| \leq \delta: \\ \frac{\partial L_{\delta}(\theta)}{\partial m} &=(y_i-f(x_i))*x_i \\ \frac{\partial L_{\delta}(\theta)}{\partial b} &=(y_i-f(x_i))*1 \\ \text { otherwise }: \\ \frac{\partial L_{\delta}(\theta)}{\partial m} &=\frac{\delta*(y_i-f(x_i))*x_i}{|y_i-f(x_i)|} \\ \frac{\partial L_{\delta}(\theta)}{\partial b} &=\frac{\delta*(y_i-f(x_i))*1}{|y_i-f(x_i)|} \\ for \quad All \quad training \quad example: \\ \frac{\partial J(\theta)}{\partial m} &=\frac{1}{N}\sum_{i=1}^{N}\frac{\partial L(\theta)}{\partial m} \\ \frac{\partial J(\theta)}{\partial b} &=\frac{1}{N}\sum_{i=1}^{N}\frac{\partial L(\theta)}{\partial b} \\ \end{aligned}

  • update parameters
def update_weights_Huber(m, b, X, Y, delta, learning_rate):
    m_deriv = 0
    b_deriv = 0
    N = len(X)
    for i in range(N):
        # derivative of quadratic for small values and of linear for large values
        if abs(Y[i] - m*X[i] - b) <= delta:
          m_deriv += -X[i] * (Y[i] - (m*X[i] + b))
          b_deriv += - (Y[i] - (m*X[i] + b))
        else:
          m_deriv += delta * X[i] * ((m*X[i] + b) - Y[i]) / abs((m*X[i] + b) - Y[i])
          b_deriv += delta * ((m*X[i] + b) - Y[i]) / abs((m*X[i] + b) - Y[i])
    
    # We subtract because the derivatives point in direction of steepest ascent
    m -= (m_deriv / float(N)) * learning_rate
    b -= (b_deriv / float(N)) * learning_rate

    return m, b

分類損失函數

分類任務中,如果要度量真實類別與預測類別的差異,可以通過entropy來定義,其中entropy指的是混亂或不確定性,對于一個概率分布的熵值越大,表示分布的不確定性越大。同樣,較小的值表示更確定的分布。那么對于分類來講,更小的entropy,意味著預測的結果更準確, 根據分類任務中類別的數量不同,可以分為二分類和多分類。

二分類損失函數(Binary)
Binary Cross Entropy Loss

屬于1類(或正類)的概率= p

屬于0類(或負類)的概率= 1-p

  • predict function

此處把上面的f替換成了p是為了強調其是一個概率值,除此之外,上面回歸中的所有(x_i)只是針對一個特征,下面的函數中有兩個特征,其中的i表示數據集中的第i條數據
z(x_1^{(i)},x_2^{(i)}) = m_1*x_1^{(i)} + m_2 * x_2^{(i)} +b \\ p(z) = \frac{1}{1+e^{-z}} \\ \theta = \{m_1,m_2,b\}

  • loss function

L(\theta)=-y * \log (p)-(1-y) * \log (1-p)=\left\{\begin{array}{ll} -\log (1-p), & \text { if } y=0 \\ -\log (p), & \text { if } y=1 \end{array}\right.

  • cost function

J(\theta)=\frac{1}{N}\sum_{i=1}^{N}L(\theta)

  • gradient of cost function

\begin{aligned} for \quad Single \quad training \quad example: \\ \frac{\partial L(\theta)}{\partial m_1} &=\frac{\partial L}{\partial p}\frac{\partial p}{\partial z}\frac{\partial z}{\partial m_1} \\ &= \left[-(\frac{y^{(i)}}{p} - \frac{1-y^{(i)}}{1-p})\right]*\left[p(1-p)\right]\left[x_1^{(i)}\right] \\ &= (p-y^{(i)})*x_1^{(i)} \\ \frac{\partial L(\theta)}{\partial m_2} &= \frac{\partial L}{\partial p}\frac{\partial p}{\partial z}\frac{\partial z}{\partial m_2} \\ &= \left[-(\frac{y^{(i)}}{p} - \frac{1-y^{(i)}}{1-p})\right]*\left[p(1-p)\right]\left[x_2^{(i)}\right] \\ &= (p-y^{(i)})*x_2^{(i)} \\ \frac{\partial L(\theta)}{\partial b} &= \frac{\partial L}{\partial p}\frac{\partial p}{\partial z}\frac{\partial z}{\partial b} \\ &= \left[-(\frac{y^{(i)}}{p} - \frac{1-y^{(i)}}{1-p})\right]*\left[p(1-p)\right]\left[ 1\right] \\ &= (p-y^{(i)}) \\ for \quad All \quad training \quad example: \\ \frac{\partial J(\theta)}{\partial m_1} &=\frac{1}{N}\sum_{i=1}^{N}\frac{\partial L(\theta)}{\partial m_1} \\ \frac{\partial J(\theta)}{\partial m_2} &=\frac{1}{N}\sum_{i=1}^{N}\frac{\partial L(\theta)}{\partial m_2} \\ \frac{\partial J(\theta)}{\partial b} &=\frac{1}{N}\sum_{i=1}^{N}\frac{\partial L(\theta)}{\partial b} \\ \end{aligned}

  • update parameters
def update_weights_BCE(m1, m2, b, X1, X2, Y, learning_rate):
    m1_deriv = 0
    m2_deriv = 0
    b_deriv = 0
    N = len(X1)
    for i in range(N):
        s = 1 / (1 / (1 + math.exp(-m1*X1[i] - m2*X2[i] - b)))
        
        # Calculate partial derivatives
        m1_deriv += -X1[i] * (s - Y[i])
        m2_deriv += -X2[i] * (s - Y[i])
        b_deriv += -(s - Y[i])

    # We subtract because the derivatives point in direction of steepest ascent
    m1 -= (m1_deriv / float(N)) * learning_rate
    m2 -= (m2_deriv / float(N)) * learning_rate
    b -= (b_deriv / float(N)) * learning_rate

    return m1, m2, b
Hinge Loss

主要用于帶有類別標簽-1和1的支持向量機(SVM)分類器。因此,請確保將數據集中類別的標簽從0更改為-1。

  • predict function

f(x_1^{(i)},x_2^{(i)}) = m_1*x_1^{(i)} + m_2 * x_2^{(i)} +b \\ \theta = \{m_1, m_2, b\}

  • loss function

L(\theta)=\max (0,\quad1-y^{(i)} * f(x_1^{(i)},x_2^{(i)}))

  • cost function

J(\theta)=\frac{1}{N}\sum_{i=1}^{N}L(\theta)

  • gradient of cost function

\begin{aligned} for \quad Single \quad training \quad example: \\ \text { if } y^{(i)} * f(x_1^{(i)},x_2^{(i)}) \leq 1: \\ \frac{\partial L(\theta)}{\partial m_1} &=y^{(i)}*x_1^{(i)}\\ \frac{\partial L(\theta)}{\partial m_2} &=y^{(i)}*x_2^{(i)} \\ \frac{\partial L(\theta)}{\partial b} &=y^{(i)} \\ \text { otherwise }: \\ pass \\ for \quad All \quad training \quad example: \\ \frac{\partial J(\theta)}{\partial m_1} &=\frac{1}{N}\sum_{i=1}^{N}\frac{\partial L(\theta)}{\partial m_1} \\ \frac{\partial J(\theta)}{\partial m_2} &=\frac{1}{N}\sum_{i=1}^{N}\frac{\partial L(\theta)}{\partial m_2} \\ \frac{\partial J(\theta)}{\partial b} &=\frac{1}{N}\sum_{i=1}^{N}\frac{\partial L(\theta)}{\partial b} \\ \end{aligned}

  • update parameters
def update_weights_Hinge(m1, m2, b, X1, X2, Y, learning_rate):
    m1_deriv = 0
    m2_deriv = 0
    b_deriv = 0
    N = len(X1)
    for i in range(N):
        # Calculate partial derivatives
        if Y[i]*(m1*X1[i] + m2*X2[i] + b) <= 1:
          m1_deriv += -X1[i] * Y[i]
          m2_deriv += -X2[i] * Y[i]
          b_deriv += -Y[i]
        # else derivatives are zero

    # We subtract because the derivatives point in direction of steepest ascent
    m1 -= (m1_deriv / float(N)) * learning_rate
    m2 -= (m2_deriv / float(N)) * learning_rate
    b -= (b_deriv / float(N)) * learning_rate

    return m1, m2, b
多分類損失函數(Multi-class)

電子郵件歸類任務中,處理可以將其歸類為垃圾郵件和非垃圾郵件,還可以為一封郵件賦予更多的角色,例如它們被分類為其他各種類別-工作,家庭,社交,晉升等。這是一個多類分類。

Multi-Class Cross Entropy Loss

輸入向量X_i和對應的獨熱編碼的目標向量Y_i的損失為:

  • predict function

z = some function \\ p(z_i) = \frac{e^{z_{i}}}{\sum_{j=1}^{K} e^{z_{j}}} \quad \text { for } i=1, \ldots, K \text { and } \mathbf{z}=\left(z_{1}, \ldots, z_{K}\right) \in \mathbb{R}^{K}

  • loss function
image-20210308235126073.png
  • cost function

J(\theta)=\frac{1}{N}\sum_{i=1}^{N}L(\theta)

  • gradient of cost function

為了簡化,此處損失我只求到z,例如在神經網絡中,相當于求輸出層的誤差梯度

關于softmax的偏導部分,詳情請看softmax梯度
\begin{aligned} for \quad Single \quad training \quad example: \\ if \quad k = j: \\ \frac{\partial L(\theta)}{\partial z_j} &=\frac{\partial L}{\partial p_{k}}\frac{\partial p_{k}}{\partial z_j} \\ &= \left[-\frac{1}{p_k}\right]*\left[p_k(1-p_j)\right]\\ &= \left[-\frac{1}{p_k}\right]*\left[p_k(1-p_j)\right] \quad \text{because k = j} \\ &= p_j-1 \\ if \quad k \ne j: \\ \frac{\partial L(\theta)}{\partial z_j} &=\frac{\partial L}{\partial p_{k}}\frac{\partial p_{k}}{\partial z_j} \\ &= \left[-\frac{1}{p_k}\right]*\left[-p_kp_j\right] \\ &= \left[-\frac{1}{p_k}\right]*\left[-p_kp_j\right] \\ &= p_j \end{aligned}

  • update parameters
# importing requirements
from keras.layers import Dense
from keras.models import Sequential
from keras.optimizers import adam

# alpha = 0.001 as given in the lr parameter in adam() optimizer

# build the model
model_alpha1 = Sequential()
model_alpha1.add(Dense(50, input_dim=2, activation='relu'))
model_alpha1.add(Dense(3, activation='softmax'))

# compile the model
opt_alpha1 = adam(lr=0.001)
model_alpha1.compile(loss='categorical_crossentropy', optimizer=opt_alpha1, metrics=['accuracy'])

# fit the model
# dummy_Y is the one-hot encoded 
# history_alpha1 is used to score the validation and accuracy scores for plotting 
history_alpha1 = model_alpha1.fit(dataX, dummy_Y, validation_data=(dataX, dummy_Y), epochs=200, verbose=0)
最后編輯于
?著作權歸作者所有,轉載或內容合作請聯(lián)系作者
【社區(qū)內容提示】社區(qū)部分內容疑似由AI輔助生成,瀏覽時請結合常識與多方信息審慎甄別。
平臺聲明:文章內容(如有圖片或視頻亦包括在內)由作者上傳并發(fā)布,文章內容僅代表作者本人觀點,簡書系信息發(fā)布平臺,僅提供信息存儲服務。

相關閱讀更多精彩內容

友情鏈接更多精彩內容