Stanford cs231n Assignment #1 (d) 實(shí)現(xiàn)兩層神經(jīng)網(wǎng)絡(luò)

這一章講如何實(shí)現(xiàn)一個(gè)最簡(jiǎn)單的兩層神經(jīng)網(wǎng)絡(luò),首先還是說(shuō)一下關(guān)于backpropagation的意義。去年在coursera上的Andrew Ng的機(jī)器學(xué)習(xí)課程,他并沒(méi)有仔細(xì)解釋誤差反向傳播的意義,但是在cs231n的課程中講的非常透徹。

Back Propagation

Paste_Image.png

其實(shí)誤差反向傳播就是一個(gè)Loss function對(duì)每一層輸入的導(dǎo)數(shù)的chain rule而已。平常我們聽(tīng)到的殘差的概念(delta),也就是loss反向傳播到這一層的時(shí)候,相對(duì)于該層輸出score(激活函數(shù)之前的輸出值)的導(dǎo)數(shù)??梢钥聪挛抑癈SDN博客轉(zhuǎn)載的一篇,說(shuō)的特別清楚了:http://blog.csdn.net/u014585022/article/details/53503930

每一個(gè)gate,比如說(shuō)上圖的一個(gè)+或者*運(yùn)算單元,在誤差反向傳播的時(shí)候都可以準(zhǔn)確地知道loss function相對(duì)于它輸出的導(dǎo)數(shù)。也就是dL/dO(這個(gè)就是殘差啦,不過(guò)這個(gè)概念其實(shí)沒(méi)必要非得提出來(lái))。但是我們知道,我們想要求的是dL/dW,那么再應(yīng)用一下chain rule就得到了 dL/dW = dL/dO × dO/dW,顯然dO/dW是很好求的,因?yàn)橐话慵せ詈瘮?shù)的導(dǎo)數(shù)都很好表達(dá)嘛。
下面是一個(gè)例子:

Paste_Image.png
Paste_Image.png
Paste_Image.png

那個(gè)0.39應(yīng)該是0.4。

其實(shí)串起來(lái)的話這個(gè)電路的真身是這樣的:

Paste_Image.png

那么我們現(xiàn)在得到了w0, w1, w2反向傳播得到的dL/dW,就可以做SGD了!

The add gate always takes the gradient on its output and distributes it equally to all of its inputs, regardless of what their values were during the forward pass. This follows from the fact that the local gradient for the add operation is simply +1.0, so the gradients on all inputs will exactly equal the gradients on the output because it will be multiplied by x1.0 (and remain unchanged). In the example circuit above, note that the + gate routed the gradient of 2.00 to both of its inputs, equally and unchanged.
The max gate routes the gradient. Unlike the add gate which distributed the gradient unchanged to all its inputs, the max gate distributes the gradient (unchanged) to exactly one of its inputs (the input that had the highest value during the forward pass). This is because the local gradient for a max gate is 1.0 for the highest value, and 0.0 for all other values. In the example circuit above, the max operation routed the gradient of 2.00 to the z variable, which had a higher value than w, and the gradient on w remains zero.
The multiply gate is a little less easy to interpret. Its local gradients are the input values (except switched), and this is multiplied by the gradient on its output during the chain rule. In the example above, the gradient on x is -8.00, which is -4.00 x 2.00.

下附誤差反向傳播的計(jì)算過(guò)程:

Photo_1217_1a.jpg

對(duì)于輸出層來(lái)說(shuō):(截圖來(lái)自UFLDL)

Paste_Image.png
Paste_Image.png

Two Layer Network

這個(gè)練習(xí)中使用的網(wǎng)絡(luò)結(jié)構(gòu)是兩層,中間隱層的激活函數(shù)是ReLu,最后的輸出層是softmax。也就是說(shuō)loss function和上圖UFLDL中其實(shí)是不一樣的了。

網(wǎng)絡(luò)結(jié)構(gòu)如下圖:

Photo_1217_2d.jpg

代碼如下:

import numpy as np
import matplotlib.pyplot as plt

class TwoLayerNet(object):
  """
  A two-layer fully-connected neural network. The net has an input dimension of
  N, a hidden layer dimension of H, and performs classification over C classes.
  We train the network with a softmax loss function and L2 regularization on the
  weight matrices. The network uses a ReLU nonlinearity after the first fully
  connected layer.

  In other words, the network has the following architecture:

  input - fully connected layer - ReLU - fully connected layer - softmax

  The outputs of the second fully-connected layer are the scores for each class.
  """

  def __init__(self, input_size, hidden_size, output_size, std=1e-4):
    """
    Initialize the model. Weights are initialized to small random values and
    biases are initialized to zero. Weights and biases are stored in the
    variable self.params, which is a dictionary with the following keys:

    W1: First layer weights; has shape (D, H)
    b1: First layer biases; has shape (H,)
    W2: Second layer weights; has shape (H, C)
    b2: Second layer biases; has shape (C,)

    Inputs:
    - input_size: The dimension D of the input data.
    - hidden_size: The number of neurons H in the hidden layer.
    - output_size: The number of classes C.
    """
    self.params = {}
    self.params['W1'] = std * np.random.randn(input_size, hidden_size)
    self.params['b1'] = np.zeros(hidden_size)
    self.params['W2'] = std * np.random.randn(hidden_size, output_size)
    self.params['b2'] = np.zeros(output_size)

  def loss(self, X, y=None, reg=0.0):
    """
    Compute the loss and gradients for a two layer fully connected neural
    network.

    Inputs:
    - X: Input data of shape (N, D). Each X[i] is a training sample.
    - y: Vector of training labels. y[i] is the label for X[i], and each y[i] is
      an integer in the range 0 <= y[i] < C. This parameter is optional; if it
      is not passed then we only return scores, and if it is passed then we
      instead return the loss and gradients.
    - reg: Regularization strength.

    Returns:
    If y is None, return a matrix scores of shape (N, C) where scores[i, c] is
    the score for class c on input X[i].

    If y is not None, instead return a tuple of:
    - loss: Loss (data loss and regularization loss) for this batch of training
      samples.
    - grads: Dictionary mapping parameter names to gradients of those parameters
      with respect to the loss function; has the same keys as self.params.
    """
    # Unpack variables from the params dictionary
    W1, b1 = self.params['W1'], self.params['b1']
    W2, b2 = self.params['W2'], self.params['b2']
    N, D = X.shape

    # Compute the forward pass
    scores = None
    #############################################################################
    # TODO: Perform the forward pass, computing the class scores for the input. #
    # Store the result in the scores variable, which should be an array of      #
    # shape (N, C).                                                             #
    #############################################################################
    z1 = X.dot(W1) + b1
    a1 = np.maximum(0, z1)
    scores = a1.dot(W2) + b2
    #############################################################################
    #                              END OF YOUR CODE                             #
    #############################################################################
    
    # If the targets are not given then jump out, we're done
    if y is None:
      return scores

    # Compute the loss
    loss = None
    #############################################################################
    # TODO: Finish the forward pass, and compute the loss. This should include  #
    # both the data loss and L2 regularization for W1 and W2. Store the result  #
    # in the variable loss, which should be a scalar. Use the Softmax           #
    # classifier loss. So that your results match ours, multiply the            #
    # regularization loss by 0.5                                                #
    #############################################################################
    scores -= np.max(scores)
    scores = np.exp(scores)
    p_yi = scores[xrange(X.shape[0]), y]
    sum_p = np.sum(scores, axis=1).reshape(X.shape[0], 1)
    p = scores/sum_p
    loss = np.mean(-np.log(p_yi/sum_p))
    loss += 0.5 * reg * (np.sum(W1*W1) + np.sum(W2*W2))# compute the class probabilities
    
    #############################################################################
    #                              END OF YOUR CODE                             #
    #############################################################################

    # Backward pass: compute gradients
    grads = {}
    #############################################################################
    # TODO: Compute the backward pass, computing the derivatives of the weights #
    # and biases. Store the results in the grads dictionary. For example,       #
    # grads['W1'] should store the gradient on W1, and be a matrix of same size #
    #############################################################################

    # compute gradient on scores
    num_classes = W2.shape[1]
    binary = np.zeros((N, num_classes))
    binary[range(binary.shape[0]), y] = 1
    dscores = p - binary

    # W2 and b2
    grads['W2'] = np.dot(a1.T, dscores)
    grads['b2'] = np.sum(dscores, axis=0)

    # backprop into hidden layer, aka da1
    dhidden = np.dot(dscores, W2.T)
    
    # compute backprop gradient on ReLu
    da1z1 = np.maximum(a1, 0)
    da1z1[da1z1>0] = 1
    
    # compute dW2, db2
    # where dL/dW1 = dhidden * da1/dz1 * dz1/dw1
    dz1 = dhidden * da1z1
    # delta.shape = (N, 10)
    grads['W1'] = np.dot(X.T, dz1)
    grads['b1'] = np.sum(dz1.T, axis=1)

    # average on num_train and plus regularization
    # print grads['W1']
    grads['W1'] /= N
    grads['W1'] += reg*W1
    grads['W2'] /= N
    grads['W2'] += reg*W2
    grads['b1'] /= N
    grads['b2'] /= N
    #############################################################################
    #                              END OF YOUR CODE                             #
    #############################################################################

    return loss, grads

  def train(self, X, y, X_val, y_val,
            learning_rate=1e-3, learning_rate_decay=0.95,
            reg=1e-5, num_iters=100,
            batch_size=200, verbose=False):
    """
    Train this neural network using stochastic gradient descent.

    Inputs:
    - X: A numpy array of shape (N, D) giving training data.
    - y: A numpy array f shape (N,) giving training labels; y[i] = c means that
      X[i] has label c, where 0 <= c < C.
    - X_val: A numpy array of shape (N_val, D) giving validation data.
    - y_val: A numpy array of shape (N_val,) giving validation labels.
    - learning_rate: Scalar giving learning rate for optimization.
    - learning_rate_decay: Scalar giving factor used to decay the learning rate
      after each epoch.
    - reg: Scalar giving regularization strength.
    - num_iters: Number of steps to take when optimizing.
    - batch_size: Number of training examples to use per step.
    - verbose: boolean; if true print progress during optimization.
    """
    num_train = X.shape[0]
    iterations_per_epoch = max(num_train / batch_size, 1)

    # Use SGD to optimize the parameters in self.model
    loss_history = []
    train_acc_history = []
    val_acc_history = []

    for it in xrange(num_iters):
      X_batch = None
      y_batch = None

      #########################################################################
      # TODO: Create a random minibatch of training data and labels, storing  #
      # them in X_batch and y_batch respectively.                             #
      #########################################################################
      num_random = np.random.choice(np.arange(num_train), batch_size)
      X_batch = X[num_random, :]
      y_batch = y[num_random]
      #########################################################################
      #                             END OF YOUR CODE                          #
      #########################################################################

      # Compute loss and gradients using the current minibatch
      loss, grads = self.loss(X_batch, y=y_batch, reg=reg)
      loss_history.append(loss)

      #########################################################################
      # TODO: Use the gradients in the grads dictionary to update the         #
      # parameters of the network (stored in the dictionary self.params)      #
      # using stochastic gradient descent. You'll need to use the gradients   #
      # stored in the grads dictionary defined above.                         #
      #########################################################################
      self.params['W1'] -= learning_rate * grads['W1']
      self.params['W2'] -= learning_rate * grads['W2']
      self.params['b1'] -= learning_rate * grads['b1']
      self.params['b2'] -= learning_rate * grads['b2']
      #########################################################################
      #                             END OF YOUR CODE                          #
      #########################################################################

      if verbose and it % 100 == 0:
        print 'iteration %d / %d: loss %f' % (it, num_iters, loss)

      # Every epoch, check train and val accuracy and decay learning rate.
      if it % iterations_per_epoch == 0:
        # Check accuracy
        train_acc = (self.predict(X_batch) == y_batch).mean()
        val_acc = (self.predict(X_val) == y_val).mean()
        train_acc_history.append(train_acc)
        val_acc_history.append(val_acc)

        # Decay learning rate
        learning_rate *= learning_rate_decay

    return {
      'loss_history': loss_history,
      'train_acc_history': train_acc_history,
      'val_acc_history': val_acc_history,
    }

  def predict(self, X):
    """
    Use the trained weights of this two-layer network to predict labels for
    data points. For each data point we predict scores for each of the C
    classes, and assign each data point to the class with the highest score.

    Inputs:
    - X: A numpy array of shape (N, D) giving N D-dimensional data points to
      classify.

    Returns:
    - y_pred: A numpy array of shape (N,) giving predicted labels for each of
      the elements of X. For all i, y_pred[i] = c means that X[i] is predicted
      to have class c, where 0 <= c < C.
    """
    y_pred = None

    ###########################################################################
    # TODO: Implement this function; it should be VERY simple!                #
    ###########################################################################
    output = np.maximum(X.dot(self.params['W1']) + self.params['b1'], 0).dot(self.params['W2'])+self.params['b2']
    y_pred = np.argmax(output, axis=1)
    ###########################################################################
    #                              END OF YOUR CODE                           #
    ###########################################################################

    return y_pred

參考:

  1. https://zhuanlan.zhihu.com/p/22473137
  2. UFLDL
最后編輯于
?著作權(quán)歸作者所有,轉(zhuǎn)載或內(nèi)容合作請(qǐng)聯(lián)系作者
【社區(qū)內(nèi)容提示】社區(qū)部分內(nèi)容疑似由AI輔助生成,瀏覽時(shí)請(qǐng)結(jié)合常識(shí)與多方信息審慎甄別。
平臺(tái)聲明:文章內(nèi)容(如有圖片或視頻亦包括在內(nèi))由作者上傳并發(fā)布,文章內(nèi)容僅代表作者本人觀點(diǎn),簡(jiǎn)書系信息發(fā)布平臺(tái),僅提供信息存儲(chǔ)服務(wù)。

相關(guān)閱讀更多精彩內(nèi)容

  • 身體健康,財(cái)富自由。屬靈突破。
    Charging99閱讀 284評(píng)論 0 1
  • 終于看了武漢萬(wàn)達(dá)的漢秀表演,還是很有視覺(jué)音樂(lè)沖擊的,就是劇情和趣味性差了一些,開(kāi)始有屏幕文字介紹,沒(méi)來(lái)得及認(rèn)真看...
    親愛(ài)的小玩子閱讀 152評(píng)論 0 0
  • 昨天和老公吵架,一氣之下,打罵了孩子,失眠,壞情緒打亂了生活的節(jié)奏,許多準(zhǔn)備要去做的事情就此擱淺,前一天的“自我折...
    性感有才的郭郭閱讀 428評(píng)論 0 0
  • 剛做完使用DHlibxls讀取Excel數(shù)據(jù),結(jié)果花了好多積分下載下來(lái)的還是報(bào)各種錯(cuò)誤,分享一下修改過(guò)程中的經(jīng)驗(yàn)。...
    沉默的魚sunny閱讀 5,691評(píng)論 16 9
  • 問(wèn)我剛畢業(yè)為什么去深圳工作? 答因?yàn)槲覜](méi)去過(guò)深圳,想在南方生活一段時(shí)間,很任性吧?任性到我只知道深圳在中國(guó)的很南方...
    袁公子愛(ài)生活閱讀 176評(píng)論 0 0

友情鏈接更多精彩內(nèi)容