模型表達Ⅰ(Model Representation Ⅰ)
為了構建神經(jīng)網(wǎng)絡模型,我們需要借鑒大腦中的神經(jīng)系統(tǒng)。每一個神經(jīng)元都可以作為一個處理單元(Processing Unit)或神經(jīng)核(Nucleus),其擁有眾多用于輸入的樹突(Dendrite)和用于輸出的軸突(Axon),其中神經(jīng)元通過傳遞電脈沖來傳遞信息。

神經(jīng)網(wǎng)絡模型由一個個“神經(jīng)元”構成,而每一個“神經(jīng)元”又為一個學習模型,我們將這些“神經(jīng)元”稱為激活單元(Activation Unit)。

其中,參數(shù)θ在神經(jīng)網(wǎng)絡中也被稱為權重,假設函數(shù)hθ(x) = g(z),新增的x0稱為偏置單元(Bias Unit)。
在神經(jīng)網(wǎng)絡模型中(以三層神經(jīng)網(wǎng)絡模型為例),第一層為輸入層(Input Layer),最后一層為輸出層(Output Layer),中間的這層稱為隱藏層(Hidden Layer)。

我們引入如下標記用于描述神經(jīng)網(wǎng)絡模型:
- ai(j):表示第j層的第i個激活單元;
- θ(j):表示從第j層映射到第j+1層時權重矩陣。
注:在神經(jīng)網(wǎng)絡模型中,如若第j層有sj個激活單元,在第j+1層有sj+1個激活單元,則權重矩陣θ(j)的維度為sj+1 * (sj+1)。因此,上圖中權重矩陣θ(1)的維度3*4。
對于上圖所示的神經(jīng)網(wǎng)絡模型,我們可用如下數(shù)學表達式表示:

在邏輯回歸中,我們被限制使用數(shù)據(jù)集中的原始特征變量x,雖然我們可以通過多項式來組合這些特征,但我們仍然受到原始特征變量x的限制。
在神經(jīng)網(wǎng)絡中,原始特征變量x只作為輸入層,輸出層所做出的預測結果利用的是隱藏層的特征變量,由此我們可以認為隱藏層中特征變量是通過神經(jīng)網(wǎng)絡模型學習后,將得到的新特征用于預測結果,而非使用原始特征變量x用于預測結果。
補充筆記
Model Representation I

Visually, a simplistic representation looks like:

Our input nodes (layer 1), also known as the "input layer", go into another node (layer 2), which finally outputs the hypothesis function, known as the "output layer".
We can have intermediate layers of nodes between the input and output layers called the "hidden layers."
In this example, we label these intermediate or "hidden" layer nodes a02?an2 and call them "activation units."

If we had one hidden layer, it would look like:

The values for each of the "activation" nodes is obtained as follows:

This is saying that we compute our activation nodes by using a 3×4 matrix of parameters. We apply each row of the parameters to our inputs to obtain the value for one activation node. Our hypothesis output is the logistic function applied to the sum of the values of our activation nodes, which have been multiplied by yet another parameter matrix Θ(2) containing the weights for our second layer of nodes.
Each layer gets its own matrix of weights, Θ(j).
The dimensions of these matrices of weights is determined as follows:
If network has sj units in layer j and sj+1 units in layer j+1, then Θ(j) will be of dimension sj+1×(sj+1).
The +1 comes from the addition in Θ(j) of the "bias nodes," x0 and Θ0(j). In other words the output nodes will not include the bias nodes while the inputs will. The following image summarizes our model representation:

模型表達Ⅱ(Model Representation II)

以此圖為例,之前我們介紹其數(shù)學表達式。為了方便編碼及運算,我們將其向量化。

其中:

因此,我們可將之前的數(shù)學表達式改寫為:

其中向量X可記為a(1),則z(2) = Θ(1)a(1)。由此可得,a(2) = g(z(2))。
此時假設函數(shù)hθ(x)可改寫為:

其中:

補充筆記
Model Representation II
To re-iterate, the following is an example of a neural network:

In this section we'll do a vectorized implementation of the above functions. We're going to define a new variable zk(j) that encompasses the parameters inside our g function. In our previous example if we replaced by the variable z for all the parameters we would get:

In other words, for layer j=2 and node k, the variable z will be:

The vector representation of x and zj is:

Setting x=a(1), we can rewrite the equation as:

We are multiplying our matrix Θ(j?1) with dimensions sj×(n+1) (where sj is the number of our activation nodes) by our vector a(j?1) with height (n+1). This gives us our vector z(j) with height sj. Now we can get a vector of our activation nodes for layer j as follows:

Where our function g can be applied element-wise to our vector z(j).
We can then add a bias unit (equal to 1) to layer j after we have computed a(j). This will be element a0(j) and will be equal to 1. To compute our final hypothesis, let's first compute another z vector:

We get this final z vector by multiplying the next theta matrix after Θ(j?1) with the values of all the activation nodes we just got. This last theta matrix Θ(j) will have only one row which is multiplied by one column a(j) so that our result is a single number. We then get our final result with:

Notice that in this last step, between layer j and layer j+1, we are doing exactly the same thing as we did in logistic regression. Adding all these intermediate layers in neural networks allows us to more elegantly produce interesting and more complex non-linear hypotheses.