這一章講如何實(shí)現(xiàn)一個(gè)最簡(jiǎn)單的兩層神經(jīng)網(wǎng)絡(luò),首先還是說(shuō)一下關(guān)于backpropagation的意義。去年在coursera上的Andrew Ng的機(jī)器學(xué)習(xí)課程,他并沒(méi)有仔細(xì)解釋誤差反向傳播的意義,但是在cs231n的課程中講的非常透徹。
Back Propagation

其實(shí)誤差反向傳播就是一個(gè)Loss function對(duì)每一層輸入的導(dǎo)數(shù)的chain rule而已。平常我們聽(tīng)到的殘差的概念(delta),也就是loss反向傳播到這一層的時(shí)候,相對(duì)于該層輸出score(激活函數(shù)之前的輸出值)的導(dǎo)數(shù)??梢钥聪挛抑癈SDN博客轉(zhuǎn)載的一篇,說(shuō)的特別清楚了:http://blog.csdn.net/u014585022/article/details/53503930
每一個(gè)gate,比如說(shuō)上圖的一個(gè)+或者*運(yùn)算單元,在誤差反向傳播的時(shí)候都可以準(zhǔn)確地知道loss function相對(duì)于它輸出的導(dǎo)數(shù)。也就是dL/dO(這個(gè)就是殘差啦,不過(guò)這個(gè)概念其實(shí)沒(méi)必要非得提出來(lái))。但是我們知道,我們想要求的是dL/dW,那么再應(yīng)用一下chain rule就得到了 dL/dW = dL/dO × dO/dW,顯然dO/dW是很好求的,因?yàn)橐话慵せ詈瘮?shù)的導(dǎo)數(shù)都很好表達(dá)嘛。
下面是一個(gè)例子:



那個(gè)0.39應(yīng)該是0.4。
其實(shí)串起來(lái)的話這個(gè)電路的真身是這樣的:

那么我們現(xiàn)在得到了w0, w1, w2反向傳播得到的dL/dW,就可以做SGD了!
The add gate always takes the gradient on its output and distributes it equally to all of its inputs, regardless of what their values were during the forward pass. This follows from the fact that the local gradient for the add operation is simply +1.0, so the gradients on all inputs will exactly equal the gradients on the output because it will be multiplied by x1.0 (and remain unchanged). In the example circuit above, note that the + gate routed the gradient of 2.00 to both of its inputs, equally and unchanged.
The max gate routes the gradient. Unlike the add gate which distributed the gradient unchanged to all its inputs, the max gate distributes the gradient (unchanged) to exactly one of its inputs (the input that had the highest value during the forward pass). This is because the local gradient for a max gate is 1.0 for the highest value, and 0.0 for all other values. In the example circuit above, the max operation routed the gradient of 2.00 to the z variable, which had a higher value than w, and the gradient on w remains zero.
The multiply gate is a little less easy to interpret. Its local gradients are the input values (except switched), and this is multiplied by the gradient on its output during the chain rule. In the example above, the gradient on x is -8.00, which is -4.00 x 2.00.
下附誤差反向傳播的計(jì)算過(guò)程:

對(duì)于輸出層來(lái)說(shuō):(截圖來(lái)自UFLDL)


Two Layer Network
這個(gè)練習(xí)中使用的網(wǎng)絡(luò)結(jié)構(gòu)是兩層,中間隱層的激活函數(shù)是ReLu,最后的輸出層是softmax。也就是說(shuō)loss function和上圖UFLDL中其實(shí)是不一樣的了。
網(wǎng)絡(luò)結(jié)構(gòu)如下圖:

代碼如下:
import numpy as np
import matplotlib.pyplot as plt
class TwoLayerNet(object):
"""
A two-layer fully-connected neural network. The net has an input dimension of
N, a hidden layer dimension of H, and performs classification over C classes.
We train the network with a softmax loss function and L2 regularization on the
weight matrices. The network uses a ReLU nonlinearity after the first fully
connected layer.
In other words, the network has the following architecture:
input - fully connected layer - ReLU - fully connected layer - softmax
The outputs of the second fully-connected layer are the scores for each class.
"""
def __init__(self, input_size, hidden_size, output_size, std=1e-4):
"""
Initialize the model. Weights are initialized to small random values and
biases are initialized to zero. Weights and biases are stored in the
variable self.params, which is a dictionary with the following keys:
W1: First layer weights; has shape (D, H)
b1: First layer biases; has shape (H,)
W2: Second layer weights; has shape (H, C)
b2: Second layer biases; has shape (C,)
Inputs:
- input_size: The dimension D of the input data.
- hidden_size: The number of neurons H in the hidden layer.
- output_size: The number of classes C.
"""
self.params = {}
self.params['W1'] = std * np.random.randn(input_size, hidden_size)
self.params['b1'] = np.zeros(hidden_size)
self.params['W2'] = std * np.random.randn(hidden_size, output_size)
self.params['b2'] = np.zeros(output_size)
def loss(self, X, y=None, reg=0.0):
"""
Compute the loss and gradients for a two layer fully connected neural
network.
Inputs:
- X: Input data of shape (N, D). Each X[i] is a training sample.
- y: Vector of training labels. y[i] is the label for X[i], and each y[i] is
an integer in the range 0 <= y[i] < C. This parameter is optional; if it
is not passed then we only return scores, and if it is passed then we
instead return the loss and gradients.
- reg: Regularization strength.
Returns:
If y is None, return a matrix scores of shape (N, C) where scores[i, c] is
the score for class c on input X[i].
If y is not None, instead return a tuple of:
- loss: Loss (data loss and regularization loss) for this batch of training
samples.
- grads: Dictionary mapping parameter names to gradients of those parameters
with respect to the loss function; has the same keys as self.params.
"""
# Unpack variables from the params dictionary
W1, b1 = self.params['W1'], self.params['b1']
W2, b2 = self.params['W2'], self.params['b2']
N, D = X.shape
# Compute the forward pass
scores = None
#############################################################################
# TODO: Perform the forward pass, computing the class scores for the input. #
# Store the result in the scores variable, which should be an array of #
# shape (N, C). #
#############################################################################
z1 = X.dot(W1) + b1
a1 = np.maximum(0, z1)
scores = a1.dot(W2) + b2
#############################################################################
# END OF YOUR CODE #
#############################################################################
# If the targets are not given then jump out, we're done
if y is None:
return scores
# Compute the loss
loss = None
#############################################################################
# TODO: Finish the forward pass, and compute the loss. This should include #
# both the data loss and L2 regularization for W1 and W2. Store the result #
# in the variable loss, which should be a scalar. Use the Softmax #
# classifier loss. So that your results match ours, multiply the #
# regularization loss by 0.5 #
#############################################################################
scores -= np.max(scores)
scores = np.exp(scores)
p_yi = scores[xrange(X.shape[0]), y]
sum_p = np.sum(scores, axis=1).reshape(X.shape[0], 1)
p = scores/sum_p
loss = np.mean(-np.log(p_yi/sum_p))
loss += 0.5 * reg * (np.sum(W1*W1) + np.sum(W2*W2))# compute the class probabilities
#############################################################################
# END OF YOUR CODE #
#############################################################################
# Backward pass: compute gradients
grads = {}
#############################################################################
# TODO: Compute the backward pass, computing the derivatives of the weights #
# and biases. Store the results in the grads dictionary. For example, #
# grads['W1'] should store the gradient on W1, and be a matrix of same size #
#############################################################################
# compute gradient on scores
num_classes = W2.shape[1]
binary = np.zeros((N, num_classes))
binary[range(binary.shape[0]), y] = 1
dscores = p - binary
# W2 and b2
grads['W2'] = np.dot(a1.T, dscores)
grads['b2'] = np.sum(dscores, axis=0)
# backprop into hidden layer, aka da1
dhidden = np.dot(dscores, W2.T)
# compute backprop gradient on ReLu
da1z1 = np.maximum(a1, 0)
da1z1[da1z1>0] = 1
# compute dW2, db2
# where dL/dW1 = dhidden * da1/dz1 * dz1/dw1
dz1 = dhidden * da1z1
# delta.shape = (N, 10)
grads['W1'] = np.dot(X.T, dz1)
grads['b1'] = np.sum(dz1.T, axis=1)
# average on num_train and plus regularization
# print grads['W1']
grads['W1'] /= N
grads['W1'] += reg*W1
grads['W2'] /= N
grads['W2'] += reg*W2
grads['b1'] /= N
grads['b2'] /= N
#############################################################################
# END OF YOUR CODE #
#############################################################################
return loss, grads
def train(self, X, y, X_val, y_val,
learning_rate=1e-3, learning_rate_decay=0.95,
reg=1e-5, num_iters=100,
batch_size=200, verbose=False):
"""
Train this neural network using stochastic gradient descent.
Inputs:
- X: A numpy array of shape (N, D) giving training data.
- y: A numpy array f shape (N,) giving training labels; y[i] = c means that
X[i] has label c, where 0 <= c < C.
- X_val: A numpy array of shape (N_val, D) giving validation data.
- y_val: A numpy array of shape (N_val,) giving validation labels.
- learning_rate: Scalar giving learning rate for optimization.
- learning_rate_decay: Scalar giving factor used to decay the learning rate
after each epoch.
- reg: Scalar giving regularization strength.
- num_iters: Number of steps to take when optimizing.
- batch_size: Number of training examples to use per step.
- verbose: boolean; if true print progress during optimization.
"""
num_train = X.shape[0]
iterations_per_epoch = max(num_train / batch_size, 1)
# Use SGD to optimize the parameters in self.model
loss_history = []
train_acc_history = []
val_acc_history = []
for it in xrange(num_iters):
X_batch = None
y_batch = None
#########################################################################
# TODO: Create a random minibatch of training data and labels, storing #
# them in X_batch and y_batch respectively. #
#########################################################################
num_random = np.random.choice(np.arange(num_train), batch_size)
X_batch = X[num_random, :]
y_batch = y[num_random]
#########################################################################
# END OF YOUR CODE #
#########################################################################
# Compute loss and gradients using the current minibatch
loss, grads = self.loss(X_batch, y=y_batch, reg=reg)
loss_history.append(loss)
#########################################################################
# TODO: Use the gradients in the grads dictionary to update the #
# parameters of the network (stored in the dictionary self.params) #
# using stochastic gradient descent. You'll need to use the gradients #
# stored in the grads dictionary defined above. #
#########################################################################
self.params['W1'] -= learning_rate * grads['W1']
self.params['W2'] -= learning_rate * grads['W2']
self.params['b1'] -= learning_rate * grads['b1']
self.params['b2'] -= learning_rate * grads['b2']
#########################################################################
# END OF YOUR CODE #
#########################################################################
if verbose and it % 100 == 0:
print 'iteration %d / %d: loss %f' % (it, num_iters, loss)
# Every epoch, check train and val accuracy and decay learning rate.
if it % iterations_per_epoch == 0:
# Check accuracy
train_acc = (self.predict(X_batch) == y_batch).mean()
val_acc = (self.predict(X_val) == y_val).mean()
train_acc_history.append(train_acc)
val_acc_history.append(val_acc)
# Decay learning rate
learning_rate *= learning_rate_decay
return {
'loss_history': loss_history,
'train_acc_history': train_acc_history,
'val_acc_history': val_acc_history,
}
def predict(self, X):
"""
Use the trained weights of this two-layer network to predict labels for
data points. For each data point we predict scores for each of the C
classes, and assign each data point to the class with the highest score.
Inputs:
- X: A numpy array of shape (N, D) giving N D-dimensional data points to
classify.
Returns:
- y_pred: A numpy array of shape (N,) giving predicted labels for each of
the elements of X. For all i, y_pred[i] = c means that X[i] is predicted
to have class c, where 0 <= c < C.
"""
y_pred = None
###########################################################################
# TODO: Implement this function; it should be VERY simple! #
###########################################################################
output = np.maximum(X.dot(self.params['W1']) + self.params['b1'], 0).dot(self.params['W2'])+self.params['b2']
y_pred = np.argmax(output, axis=1)
###########################################################################
# END OF YOUR CODE #
###########################################################################
return y_pred
參考: