Batch Normalization論文翻譯——中英文對照

文章作者:Tyan
博客:noahsnail.com ?|? CSDN ?|? 簡書

聲明:作者翻譯論文僅為學(xué)習(xí),如有侵權(quán)請聯(lián)系作者刪除博文,謝謝!

翻譯論文匯總:https://github.com/SnailTyan/deep-learning-papers-translation

Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift

Abstract

Training Deep Neural Networks is complicated by the fact that the distribution of each layer's inputs changes during training, as the parameters of the previous layers change. This slows down the training by requiring lower learning rates and careful parameter initialization, and makes it notoriously hard to train models with saturating nonlinearities. We refer to this phenomenon as internal covariate shift, and address the problem by normalizing layer inputs. Our method draws its strength from making normalization a part of the model architecture and performing the normalization for each training mini-batch. Batch Normalization allows us to use much higher learning rates and be less careful about initialization. It also acts as a regularizer, in some cases eliminating the need for Dropout. Applied to a state-of-the-art image classification model, Batch Normalization achieves the same accuracy with 14 times fewer training steps, and beats the original model by a significant margin. Using an ensemble of batch-normalized networks, we improve upon the best published result on ImageNet classification: reaching 4.9% top-5 validation error (and 4.8% test error), exceeding the accuracy of human raters.

摘要

訓(xùn)練深度神經(jīng)網(wǎng)絡(luò)的復(fù)雜性在于,每層輸入的分布在訓(xùn)練過程中會發(fā)生變化,因為前面的層的參數(shù)會發(fā)生變化。通過要求較低的學(xué)習(xí)率和仔細(xì)的參數(shù)初始化減慢了訓(xùn)練,并且使具有飽和非線性的模型訓(xùn)練起來非常困難。我們將這種現(xiàn)象稱為內(nèi)部協(xié)變量轉(zhuǎn)移,并通過標(biāo)準(zhǔn)化層輸入來解決這個問題。我們的方法力圖使標(biāo)準(zhǔn)化成為模型架構(gòu)的一部分,并為每個訓(xùn)練小批量數(shù)據(jù)執(zhí)行標(biāo)準(zhǔn)化。批標(biāo)準(zhǔn)化使我們能夠使用更高的學(xué)習(xí)率,并且不用太注意初始化。它也作為一個正則化項,在某些情況下不需要Dropout。將批量標(biāo)準(zhǔn)化應(yīng)用到最先進(jìn)的圖像分類模型上,批標(biāo)準(zhǔn)化在取得相同的精度的情況下,減少了14倍的訓(xùn)練步驟,并以顯著的差距擊敗了原始模型。使用批標(biāo)準(zhǔn)化網(wǎng)絡(luò)的組合,我們改進(jìn)了在ImageNet分類上公布的最佳結(jié)果:達(dá)到了4.9% top-5的驗證誤差(和4.8%測試誤差),超過了人類評估者的準(zhǔn)確性。

1. Introduction

Deep learning has dramatically advanced the state of the art in vision, speech, and many other areas. Stochastic gradient descent (SGD) has proved to be an effective way of training deep networks, and SGD variants such as momentum (Sutskever et al., 2013) and Adagrad (Duchi et al., 2011) have been used to achieve state of the art performance. SGD optimizes the parameters $\Theta$ of the network, so as to minimize the loss

$$\Theta = \arg \min_\Theta \frac{1}{N}\sum_{i=1}^N \ell(x_i, \Theta)$$

where $x_{1\ldots N}$ is the training data set. With SGD, the training proceeds in steps, and at each step we consider a mini-batch $x_{1\ldots m}$ of size $m$. The mini-batch is used to approximate the gradient of the loss function with respect to the parameters, by computing $\frac {1} {m} \sum _{i=1} ^m \frac {\partial \ell(x_i, \Theta)} {\partial \Theta}$. Using mini-batches of examples, as opposed to one example at a time, is helpful in several ways. First, the gradient of the loss over a mini-batch is an estimate of the gradient over the training set, whose quality improves as the batch size increases. Second, computation over a batch can be much more efficient than $m$ computations for individual examples, due to the parallelism afforded by the modern computing platforms.

1. 引言

深度學(xué)習(xí)在視覺、語音等諸多方面顯著提高了現(xiàn)有技術(shù)的水平。隨機梯度下降(SGD)已經(jīng)被證明是訓(xùn)練深度網(wǎng)絡(luò)的有效方式,并且已經(jīng)使用諸如動量(Sutskever等,2013)和Adagrad(Duchi等人,2011)等SGD變種取得了最先進(jìn)的性能。SGD優(yōu)化網(wǎng)絡(luò)參數(shù)$\Theta$,以最小化損失

$$\Theta = \arg \min_\Theta \frac{1}{N}\sum_{i=1}^N \ell(x_i, \Theta)$$

$x_{1\ldots N}$是訓(xùn)練數(shù)據(jù)集。使用SGD,訓(xùn)練將逐步進(jìn)行,在每一步中,我們考慮一個大小為$m$的小批量數(shù)據(jù)$x_{1 \ldots m}$。通過計算$\frac {1} {m} \sum _{i=1} ^m \frac {\partial \ell(x_i, \Theta)} {\partial \Theta}$,使用小批量數(shù)據(jù)來近似損失函數(shù)關(guān)于參數(shù)的梯度。使用小批量樣本,而不是一次一個樣本,在一些方面是有幫助的。首先,小批量數(shù)據(jù)的梯度損失是訓(xùn)練集上的梯度估計,其質(zhì)量隨著批量增加而改善。第二,由于現(xiàn)代計算平臺提供的并行性,對一個批次的計算比單個樣本計算$m$次效率更高。

While stochastic gradient is simple and effective, it requires careful tuning of the model hyper-parameters, specifically the learning rate used in optimization, as well as the initial values for the model parameters. The training is complicated by the fact that the inputs to each layer are affected by the parameters of all preceding layers —— so that small changes to the network parameters amplify as the network becomes deeper.

雖然隨機梯度是簡單有效的,但它需要仔細(xì)調(diào)整模型的超參數(shù),特別是優(yōu)化中使用的學(xué)習(xí)速率以及模型參數(shù)的初始值。訓(xùn)練的復(fù)雜性在于每層的輸入受到前面所有層的參數(shù)的影響——因此當(dāng)網(wǎng)絡(luò)變得更深時,網(wǎng)絡(luò)參數(shù)的微小變化就會被放大。

The change in the distributions of layers' inputs presents a problem because the layers need to continuously adapt to the new distribution. When the input distribution to a learning system changes, it is said to experience covariate shift (Shimodaira, 2000). This is typically handled via domain adaptation (Jiang, 2008). However, the notion of covariate shift can be extended beyond the learning system as a whole, to apply to its parts, such as a sub-network or a layer. Consider a network computing $$\ell = F_2(F_1(u, \Theta_1), \Theta_2)$$ where $F_1$ and $F_2$ are arbitrary transformations, and the parameters $\Theta_1, \Theta_2$ are to be learned so as to minimize the loss $\ell$. Learning $\Theta_2$ can be viewed as if the inputs $x=F_1(u,\Theta_1)$ are fed into the sub-network $$\ell = F_2(x, \Theta_2).$$

層輸入的分布變化是一個問題,因為這些層需要不斷適應(yīng)新的分布。當(dāng)學(xué)習(xí)系統(tǒng)的輸入分布發(fā)生變化時,據(jù)說會經(jīng)歷協(xié)變量轉(zhuǎn)移(Shimodaira,2000)。這通常是通過域適應(yīng)(Jiang,2008)來處理的。然而,協(xié)變量漂移的概念可以擴展到整個學(xué)習(xí)系統(tǒng)之外,應(yīng)用到學(xué)習(xí)系統(tǒng)的一部分,例如子網(wǎng)絡(luò)或一層??紤]網(wǎng)絡(luò)計算$$\ell = F_2(F_1(u, \Theta_1), \Theta_2)$$ $F_1$和$F_2$是任意變換,學(xué)習(xí)參數(shù)$\Theta_1,\Theta_2$以便最小化損失$\ell$。學(xué)習(xí)$\Theta_2$可以看作輸入$x=F_1(u,\Theta_1)$送入到子網(wǎng)絡(luò)$$\ell = F_2(x, \Theta_2)。$$

For example, a gradient descent step $$\Theta_2\leftarrow \Theta_2 - \frac {\alpha} {m} \sum_{i=1}^m \frac {\partial F_2(x_i,\Theta_2)} {\partial \Theta_2}$$ (for batch size $m$ and learning rate $\alpha$) is exactly equivalent to that for a stand-alone network $F_2$ with input $x$. Therefore, the input distribution properties that make training more efficient —— such as having the same distribution between the training and test data —— apply to training the sub-network as well. As such it is advantageous for the distribution of $x$ to remain fixed over time. Then, $\Theta_2$ does not have to readjust to compensate for the change in the distribution of $x$.

例如,梯度下降步驟$$\Theta_2\leftarrow \Theta_2 - \frac {\alpha} {m} \sum_{i=1}^m \frac {\partial F_2(x_i,\Theta_2)} {\partial \Theta_2}$$(對于批大小$m$和學(xué)習(xí)率$\alpha$)與輸入為$x$的單獨網(wǎng)絡(luò)$F_2$完全等價。因此,輸入分布特性使訓(xùn)練更有效——例如訓(xùn)練數(shù)據(jù)和測試數(shù)據(jù)之間有相同的分布——也適用于訓(xùn)練子網(wǎng)絡(luò)。因此$x$的分布在時間上保持固定是有利的。然后,$\Theta_2$不必重新調(diào)整來補償$x$分布的變化。

Fixed distribution of inputs to a sub-network would have positive consequences for the layers outside the sub-network, as well. Consider a layer with a sigmoid activation function $z = g(Wu+b)$ where $u$ is the layer input, the weight matrix $W$ and bias vector $b$ are the layer parameters to be learned, and $g(x) = \frac{1}{1+\exp(-x)}$. As $|x|$ increases, $g'(x)$ tends to zero. This means that for all dimensions of $x=Wu+b$ except those with small absolute values, the gradient flowing down to $u$ will vanish and the model will train slowly. However, since $x$ is affected by $W, b$ and the parameters of all the layers below, changes to those parameters during training will likely move many dimensions of $x$ into the saturated regime of the nonlinearity and slow down the convergence. This effect is amplified as the network depth increases. In practice, the saturation problem and the resulting vanishing gradients are usually addressed by using Rectified Linear Units (Nair & Hinton, 2010) $ReLU(x)=\max(x,0)$, careful initialization (Bengio & Glorot, 2010; Saxe et al., 2013), and small learning rates. If, however, we could ensure that the distribution of nonlinearity inputs remains more stable as the network trains, then the optimizer would be less likely to get stuck in the saturated regime, and the training would accelerate.

子網(wǎng)絡(luò)輸入的固定分布對于子網(wǎng)絡(luò)外的層也有積極的影響。考慮一個激活函數(shù)為$g(x) = \frac{1}{1+\exp(-x)}$的層,$u$是層輸入,權(quán)重矩陣$W$和偏置向量$b$是要學(xué)習(xí)的層參數(shù),$g(x) = \frac{1}{1+\exp(-x)}$。隨著$|x|$的增加,$g'(x)$趨向于0。這意味著對于$x=Wu+b$的所有維度,除了那些具有小的絕對值之外,流向$u$的梯度將會消失,模型將緩慢的進(jìn)行訓(xùn)練。然而,由于$x$受$W,b$和下面所有層的參數(shù)的影響,訓(xùn)練期間那些參數(shù)的改變可能會將$x$的許多維度移動到非線性的飽和狀態(tài)并減慢收斂。這個影響隨著網(wǎng)絡(luò)深度的增加而放大。在實踐中,飽和問題和由此產(chǎn)生的梯度消失通常通過使用修正線性單元(Nair & Hinton, 2010) $ReLU(x)=\max(x,0)$,仔細(xì)的初始化(Bengio & Glorot, 2010; Saxe et al., 2013)和小的學(xué)習(xí)率來解決。然而,如果我們能保證非線性輸入的分布在網(wǎng)絡(luò)訓(xùn)練時保持更穩(wěn)定,那么優(yōu)化器將不太可能陷入飽和狀態(tài),訓(xùn)練將加速。

We refer to the change in the distributions of internal nodes of a deep network, in the course of training, as Internal Covariate Shift. Eliminating it offers a promise of faster training. We propose a new mechanism, which we call Batch Normalization, that takes a step towards reducing internal covariate shift, and in doing so dramatically accelerates the training of deep neural nets. It accomplishes this via a normalization step that fixes the means and variances of layer inputs. Batch Normalization also has a beneficial effect on the gradient flow through the network, by reducing the dependence of gradients on the scale of the parameters or of their initial values. This allows us to use much higher learning rates without the risk of divergence. Furthermore, batch normalization regularizes the model and reduces the need for Dropout (Srivastava et al., 2014). Finally, Batch Normalization makes it possible to use saturating nonlinearities by preventing the network from getting stuck in the saturated modes.

我們把訓(xùn)練過程中深度網(wǎng)絡(luò)內(nèi)部結(jié)點的分布變化稱為內(nèi)部協(xié)變量轉(zhuǎn)移。消除它可以保證更快的訓(xùn)練。我們提出了一種新的機制,我們稱為為批標(biāo)準(zhǔn)化,它是減少內(nèi)部協(xié)變量轉(zhuǎn)移的一個步驟,這樣做可以顯著加速深度神經(jīng)網(wǎng)絡(luò)的訓(xùn)練。它通過標(biāo)準(zhǔn)化步驟來實現(xiàn),標(biāo)準(zhǔn)化步驟修正了層輸入的均值和方差。批標(biāo)準(zhǔn)化減少了梯度對參數(shù)或它們的初始值尺度上的依賴,對通過網(wǎng)絡(luò)的梯度流動有有益的影響。這允許我們使用更高的學(xué)習(xí)率而沒有發(fā)散的風(fēng)險。此外,批標(biāo)準(zhǔn)化使模型正則化并減少了對Dropout(Srivastava et al., 2014)的需求。最后,批標(biāo)準(zhǔn)化通過阻止網(wǎng)絡(luò)陷入飽和模式讓使用飽和非線性成為可能。

In Sec. 4.2, we apply Batch Normalization to the best-performing ImageNet classification network, and show that we can match its performance using only 7% of the training steps, and can further exceed its accuracy by a substantial margin. Using an ensemble of such networks trained with Batch Normalization, we achieve the top-5 error rate that improves upon the best known results on ImageNet classification.

在4.2小節(jié),我們將批標(biāo)準(zhǔn)化應(yīng)用到性能最好的ImageNet分類網(wǎng)絡(luò)上,并且表明我們可以使用僅7%的訓(xùn)練步驟來匹配其性能,并且可以進(jìn)一步超過其準(zhǔn)確性一大截。通過使用批標(biāo)準(zhǔn)化訓(xùn)練的網(wǎng)絡(luò)的集合,我們?nèi)〉昧藅op-5錯誤率,其改進(jìn)了ImageNet分類上已知的最佳結(jié)果。

2. Towards Reducing Internal Covariate Shift

We define Internal Covariate Shift as the change in the distribution of network activations due to the change in network parameters during training. To improve the training, we seek to reduce the internal covariate shift. By fixing the distribution of the layer inputs $x$ as the training progresses, we expect to improve the training speed. It has been long known (LeCun et al., 1998b; Wiesler & Ney, 2011) that the network training converges faster if its inputs are whitened – i.e., linearly transformed to have zero means and unit variances, and decorrelated. As each layer observes the inputs produced by the layers below, it would be advantageous to achieve the same whitening of the inputs of each layer. By whitening the inputs to each layer, we would take a step towards achieving the fixed distributions of inputs that would remove the ill effects of the internal covariate shift.

2. 減少內(nèi)部協(xié)變量轉(zhuǎn)變

由于訓(xùn)練過程中網(wǎng)絡(luò)參數(shù)的變化,我們將內(nèi)部協(xié)變量轉(zhuǎn)移定義為網(wǎng)絡(luò)激活分布的變化。為了改善訓(xùn)練,我們尋求減少內(nèi)部協(xié)變量轉(zhuǎn)移。隨著訓(xùn)練的進(jìn)行,通過固定層輸入$x$的分布,我們期望提高訓(xùn)練速度。眾所周知(LeCun et al., 1998b; Wiesler & Ney, 2011)如果對網(wǎng)絡(luò)的輸入進(jìn)行白化,網(wǎng)絡(luò)訓(xùn)練將會收斂的更快——即輸入線性變換為具有零均值和單位方差,并去相關(guān)。當(dāng)每一層觀察下面的層產(chǎn)生的輸入時,實現(xiàn)每一層輸入進(jìn)行相同的白化將是有利的。通過白化每一層的輸入,我們將采取措施實現(xiàn)輸入的固定分布,消除內(nèi)部協(xié)變量轉(zhuǎn)移的不良影響。

We could consider whitening activations at every training step or at some interval, either by modifying the network directly or by changing the parameters of the optimization algorithm to depend on the network activation values (Wiesler et al., 2014; Raiko et al., 2012; Povey et al., 2014; Desjardins & Kavukcuoglu). However, if these modifications are interspersed with the optimization steps, then the gradient descent step may attempt to update the parameters in a way that requires the normalization to be updated, which reduces the effect of the gradient step. For example, consider a layer with the input $u$ that adds the learned bias $b$, and normalizes the result by subtracting the mean of the activation computed over the training data: $\hat x=x - E[x]$ where $x = u+b$, $X={x_{1\ldots N}}$ is the set of values of $x$ over the training set, and $E[x] = \frac{1}{N}\sum_{i=1}^N x_i$. If a gradient descent step ignores the dependence of $E[x]$ on $b$, then it will update $b\leftarrow b+\Delta b$, where $\Delta b\propto -\partial{\ell}/\partial{\hat x}$. Then $u+(b+\Delta b) -E[u+(b+\Delta b)] = u+b-E[u+b]$. Thus, the combination of the update to $b$ and subsequent change in normalization led to no change in the output of the layer nor, consequently, the loss. As the training continues, $b$ will grow indefinitely while the loss remains fixed. This problem can get worse if the normalization not only centers but also scales the activations. We have observed this empirically in initial experiments, where the model blows up when the normalization parameters are computed outside the gradient descent step.

我們考慮在每個訓(xùn)練步驟或在某些間隔來白化激活值,通過直接修改網(wǎng)絡(luò)或根據(jù)網(wǎng)絡(luò)激活值來更改優(yōu)化方法的參數(shù)(Wiesler et al., 2014; Raiko et al., 2012; Povey et al., 2014; Desjardins & Kavukcuoglu)。然而,如果這些修改分散在優(yōu)化步驟中,那么梯度下降步驟可能會試圖以要求標(biāo)準(zhǔn)化進(jìn)行更新的方式來更新參數(shù),這會降低梯度下降步驟的影響。例如,考慮一個層,其輸入$u$加上學(xué)習(xí)到的偏置$b$,通過減去在訓(xùn)練集上計算的激活值的均值對結(jié)果進(jìn)行歸一化:$\hat x=x - E[x]$,$x = u+b$, $X={x_{1\ldots N}}$是訓(xùn)練集上$x$值的集合,$E[x] = \frac{1}{N}\sum_{i=1}^N x_i$。如果梯度下降步驟忽略了$E[x]$對$b$的依賴,那它將更新$b\leftarrow b+\Delta b$,其中$\Delta b\propto -\partial{\ell}/\partial{\hat x}$。然后$u+(b+\Delta b) -E[u+(b+\Delta b)] = u+b-E[u+b]$。因此,結(jié)合$b$的更新和接下來標(biāo)準(zhǔn)化中的改變會導(dǎo)致層的輸出沒有變化,從而導(dǎo)致?lián)p失沒有變化。隨著訓(xùn)練的繼續(xù),$b$將無限增長而損失保持不變。如果標(biāo)準(zhǔn)化不僅中心化而且縮放了激活值,問題會變得更糟糕。我們在最初的實驗中已經(jīng)觀察到了這一點,當(dāng)標(biāo)準(zhǔn)化參數(shù)在梯度下降步驟之外計算時,模型會爆炸。

The issue with the above approach is that the gradient descent optimization does not take into account the fact that the normalization takes place. To address this issue, we would like to ensure that, for any parameter values, the network always produces activations with the desired distribution. Doing so would allow the gradient of the loss with respect to the model parameters to account for the normalization, and for its dependence on the model parameters $\Theta$. Let again $x$ be a layer input, treated as a vector, and $\cal X$ be the set of these inputs over the training data set. The normalization can then be written as a transformation $$\hat x=Norm(x, \cal X)$$ which depends not only on the given training example $x$ but on all examples $\cal X$ -- each of which depends on $\Theta$ if $x$ is generated by another layer. For backpropagation, we would need to compute the Jacobians $\frac {\partial Norm(x,\cal X)} {\partial x}$ and $\frac {\partial Norm(x,\cal X)} {\partial \cal X}$; ignoring the latter term would lead to the explosion described above. Within this framework, whitening the layer inputs is expensive, as it requires computing the covariance matrix $Cov[x]=E_{x\in \cal X}[x x^T]- E[x]E[x]^T$ and its inverse square root, to produce the whitened activations $Cov[x]^{-1/2}(x-E[x])$, as well as the derivatives of these transforms for backpropagation. This motivates us to seek an alternative that performs input normalization in a way that is differentiable and does not require the analysis of the entire training set after every parameter update.

上述方法的問題是梯度下降優(yōu)化沒有考慮到標(biāo)準(zhǔn)化中發(fā)生的事實。為了解決這個問題,我們希望確保對于任何參數(shù)值,網(wǎng)絡(luò)總是產(chǎn)生具有所需分布的激活值。這樣做將允許關(guān)于模型參數(shù)損失的梯度來解釋標(biāo)準(zhǔn)化,以及它對模型參數(shù)$\Theta$的依賴。設(shè)$x$為層的輸入,將其看作向量,$\cal X$是這些輸入在訓(xùn)練集上的集合。標(biāo)準(zhǔn)化可以寫為變換$$\hat x=Norm(x, \cal X)$$它不僅依賴于給定的訓(xùn)練樣本$x$而且依賴于所有樣本$\cal X$——它們中的每一個都依賴于$\Theta$,如果$x$是由另一層生成的。對于反向傳播,我們將需要計算雅可比行列式$\frac {\partial Norm(x,\cal X)} {\partial x}$和$\frac {\partial Norm(x,\cal X)} {\partial \cal X}$;忽略后一項會導(dǎo)致上面描述的爆炸。在這個框架中,白化層輸入是昂貴的,因為它要求計算協(xié)方差矩陣$Cov[x]=E_{x\in \cal X}[x x^T]- E[x]E[x]T$和它的平方根倒數(shù),從而生成白化的激活$Cov[x]{-1/2}(x-E[x])$和這些變換進(jìn)行反向傳播的偏導(dǎo)數(shù)。這促使我們尋求一種替代方案,以可微分的方式執(zhí)行輸入標(biāo)準(zhǔn)化,并且在每次參數(shù)更新后不需要對整個訓(xùn)練集進(jìn)行分析。

Some of the previous approaches (e.g. (Lyu & Simoncelli, 2008)) use statistics computed over a single training example, or, in the case of image networks, over different feature maps at a given location. However, this changes the representation ability of a network by discarding the absolute scale of activations. We want to a preserve the information in the network, by normalizing the activations in a training example relative to the statistics of the entire training data.

以前的一些方法(例如(Lyu&Simoncelli,2008))使用通過單個訓(xùn)練樣本計算的統(tǒng)計信息,或者在圖像網(wǎng)絡(luò)的情況下,使用給定位置處不同特征圖上的統(tǒng)計。然而,通過丟棄激活值絕對尺度改變了網(wǎng)絡(luò)的表示能力。我們希望通過對相對于整個訓(xùn)練數(shù)據(jù)統(tǒng)計信息的單個訓(xùn)練樣本的激活值進(jìn)行歸一化來保留網(wǎng)絡(luò)中的信息。

3. Normalization via Mini-Batch Statistics

Since the full whitening of each layer's inputs is costly and not everywhere differentiable, we make two necessary simplifications. The first is that instead of whitening the features in layer inputs and outputs jointly, we will normalize each scalar feature independently, by making it have the mean of zero and unit variance. For a layer with $d$-dimensional input $x = (x^{(1)}\ldots x^{(d)})$, we will normalize each dimension $$\hat x^{(k)} = \frac{x^{(k)} - E[x^{(k)}]} {\sqrt {Var[x^{(k)}]}}$$ where the expectation and variance are computed over the training data set. As shown in (LeCun et al., 1998b), such normalization speeds up convergence, even when the features are not decorrelated.

3. 通過Mini-Batch統(tǒng)計進(jìn)行標(biāo)準(zhǔn)化

由于每一層輸入的整個白化是代價昂貴的并且不是到處可微分的,因此我們做了兩個必要的簡化。首先是我們將單獨標(biāo)準(zhǔn)化每個標(biāo)量特征,從而代替在層輸入輸出對特征進(jìn)行共同白化,使其具有零均值和單位方差。對于具有$d$維輸入$x = (x^{(1)}\ldots x^{(d)})$的層,我們將標(biāo)準(zhǔn)化每一維$$\hat x^{(k)} = \frac{x^{(k)} - E[x^{(k)}]} {\sqrt {Var[x^{(k)}]}}$$其中期望和方差在整個訓(xùn)練數(shù)據(jù)集上計算。如(LeCun et al., 1998b)中所示,這種標(biāo)準(zhǔn)化加速了收斂,即使特征沒有去相關(guān)。

Note that simply normalizing each input of a layer may change what the layer can represent. For instance, normalizing the inputs of a sigmoid would constrain them to the linear regime of the nonlinearity. To address this, we make sure that the transformation inserted in the network can represent the identity transform. To accomplish this, we introduce, for each activation $x^{(k)}$, a pair of parameters $\gamma^{(k)}, \beta^{(k)}$, which scale and shift the normalized value: $$y^{(k)} = \gamma^{(k)}\hat x^{(k)} + \beta^{(k)}.$$ These parameters are learned along with the original model parameters, and restore the representation power of the network. Indeed, by setting $\gamma^{(k)} = \sqrt{Var[x^{(k)}]}$ and $\beta^{(k)} = E[x^{(k)}]$, we could recover the original activations, if that were the optimal thing to do.

注意簡單標(biāo)準(zhǔn)化層的每一個輸入可能會改變層可以表示什么。例如,標(biāo)準(zhǔn)化sigmoid的輸入會將它們約束到非線性的線性狀態(tài)。為了解決這個問題,我們要確保插入到網(wǎng)絡(luò)中的變換可以表示恒等變換。為了實現(xiàn)這個,對于每一個激活值$x{(k)}$,我們引入成對的參數(shù)$\gamma{(k)},\beta{(k)}$,它們會歸一化和移動標(biāo)準(zhǔn)化值:$$y{(k)} = \gamma^{(k)}\hat x^{(k)} + \beta{(k)}.$$這些參數(shù)與原始的模型參數(shù)一起學(xué)習(xí),并恢復(fù)網(wǎng)絡(luò)的表示能力。實際上,通過設(shè)置$\gamma{(k)} = \sqrt{Var[x{(k)}]}$和$\beta{(k)} = E[x^{(k)}]$,我們可以重新獲得原始的激活值,如果這是要做的最優(yōu)的事。

In the batch setting where each training step is based on the entire training set, we would use the whole set to normalize activations. However, this is impractical when using stochastic optimization. Therefore, we make the second simplification: since we use mini-batches in stochastic gradient training, each mini-batch produces estimates of the mean and variance of each activation. This way, the statistics used for normalization can fully participate in the gradient backpropagation. Note that the use of mini-batches is enabled by computation of per-dimension variances rather than joint covariances; in the joint case, regularization would be required since the mini-batch size is likely to be smaller than the number of activations being whitened, resulting in singular covariance matrices.

每個訓(xùn)練步驟的批處理設(shè)置是基于整個訓(xùn)練集的,我們將使用整個訓(xùn)練集來標(biāo)準(zhǔn)化激活值。然而,當(dāng)使用隨機優(yōu)化時,這是不切實際的。因此,我們做了第二個簡化:由于我們在隨機梯度訓(xùn)練中使用小批量,每個小批量產(chǎn)生每次激活平均值和方差的估計。這樣,用于標(biāo)準(zhǔn)化的統(tǒng)計信息可以完全參與梯度反向傳播。注意,通過計算每一維的方差而不是聯(lián)合協(xié)方差,可以實現(xiàn)小批量的使用;在聯(lián)合情況下,將需要正則化,因為小批量大小可能小于白化的激活值的數(shù)量,從而導(dǎo)致單個協(xié)方差矩陣。

Consider a mini-batch $\cal B$ of size $m$. Since the normalization is applied to each activation independently, let us focus on a particular activation $x^{(k)}$ and omit $k$ for clarity. We have $m$ values of this activation in the mini-batch, $$\cal B=\lbrace x_{1\ldots m} \rbrace.$$ Let the normalized values be $\hat x_{1\ldots m}$, and their linear transformations be $y_{1\ldots m}$. We refer to the transform $$BN_{\gamma,\beta}: x_{1\ldots m}\rightarrow y_{1\ldots m}$$ as the Batch Normalizing Transform. We present the BN Transform in Algorithm 1. In the algorithm, $\epsilon$ is a constant added to the mini-batch variance for numerical stability.

Algorithm 1

考慮一個大小為$m$的小批量數(shù)據(jù)$\cal B$。由于標(biāo)準(zhǔn)化被單獨地應(yīng)用于每一個激活,所以讓我們集中在一個特定的激活$x^{(k)}$,為了清晰忽略$k$。在小批量數(shù)據(jù)里我們有這個激活的$m$個值,$$\cal B=\lbrace x_{1\ldots m} \rbrace.$$設(shè)標(biāo)準(zhǔn)化值為$\hat x_{1\ldots m}$,它們的線性變換為$y_{1\ldots m}$。我們把變換$$BN_{\gamma,\beta}: x_{1\ldots m}\rightarrow y_{1\ldots m}$$看作批標(biāo)準(zhǔn)化變換。我們在算法1中提出了BN變換。在算法中,為了數(shù)值穩(wěn)定,$\epsilon$是一個加到小批量數(shù)據(jù)方差上的常量。

Algorithm 1

The BN transform can be added to a network to manipulate any activation. In the notation $y = BN_{\gamma,\beta}(x)$, we indicate that the parameters $\gamma$ and $\beta$ are to be learned, but it should be noted that the BN transform does not independently process the activation in each training example. Rather, $BN_{\gamma,\beta}(x)$ depends both on the training example and the other examples in the mini-batch. The scaled and shifted values $y$ are passed to other network layers. The normalized activations $\hat x$ are internal to our transformation, but their presence is crucial. The distributions of values of any $\hat x$ has the expected value of $0$ and the variance of $1$, as long as the elements of each mini-batch are sampled from the same distribution, and if we neglect $\epsilon$. This can be seen by observing that $\sum_{i=1}^m \hat x_i = 0$ and $\frac {1} {m} \sum_{i=1}^m \hat x_i^2 = 1$, and taking expectations. Each normalized activation $\hat x^{(k)}$ can be viewed as an input to a sub-network composed of the linear transform $y{(k)}=\gamma{(k)}\hat x{(k)}+\beta{(k)}$, followed by the other processing done by the original network. These sub-network inputs all have fixed means and variances, and although the joint distribution of these normalized $\hat x^{(k)}$ can change over the course of training, we expect that the introduction of normalized inputs accelerates the training of the sub-network and, consequently, the network as a whole.

BN變換可以添加到網(wǎng)絡(luò)上來操縱任何激活。在公式$y = BN_{\gamma,\beta}(x)$中,我們指出參數(shù)$\gamma$和$\beta$需要進(jìn)行學(xué)習(xí),但應(yīng)該注意到在每一個訓(xùn)練樣本中BN變換不單獨處理激活。相反,$BN_{\gamma,\beta}(x)$取決于訓(xùn)練樣本和小批量數(shù)據(jù)中的其它樣本??s放和移動的值$y$傳遞到其它的網(wǎng)絡(luò)層。標(biāo)準(zhǔn)化的激活值$\hat x$在我們的變換內(nèi)部,但它們的存在至關(guān)重要。只要每個小批量的元素從相同的分布中進(jìn)行采樣,如果我們忽略$\epsilon$,那么任何$\hat x$值的分布都具有期望為$0$,方差為$1$。這可以通過觀察$\sum_{i=1}^m \hat x_i = 0$和$\frac {1} {m} \sum_{i=1}^m \hat x_i^2 = 1$看到,并取得預(yù)期。每一個標(biāo)準(zhǔn)化的激活值$\hat x{(k)}$可以看作由線性變換$y{(k)}=\gamma^{(k)}\hat x{(k)}+\beta{(k)}$組成的子網(wǎng)絡(luò)的輸入,接下來是原始網(wǎng)絡(luò)的其它處理。所有的這些子網(wǎng)絡(luò)輸入都有固定的均值和方差,盡管這些標(biāo)準(zhǔn)化的$\hat x^{(k)}$的聯(lián)合分布可能在訓(xùn)練過程中改變,但我們預(yù)計標(biāo)準(zhǔn)化輸入的引入會加速子網(wǎng)絡(luò)的訓(xùn)練,從而加速整個網(wǎng)絡(luò)的訓(xùn)練。

During training we need to backpropagate the gradient of loss $\ell$ through this transformation, as well as compute the gradients with respect to the parameters of the BN transform. We use chain rule, as follows (before simplification):

$$
\begin {align}
&\frac {\partial \ell}{\partial \hat x_i} = \frac {\partial \ell} {\partial y_i} \cdot \gamma\\
&\frac {\partial \ell}{\partial \sigma_\cal B^2} = \sum_{i=1}^m \frac {\partial \ell}{\partial \hat x_i}\cdot(x_i-\mu_\cal B)\cdot \frac {-1}{2}(\sigma_\cal B2+\epsilon){-3/2}\\
&\frac {\partial \ell}{\partial \mu_\cal B} = \sum_{i=1}^m \frac {\partial \ell}{\partial \hat x_i}\cdot \frac {-1} {\sqrt {\sigma_\cal B^2 + \epsilon}}\\
&\frac {\partial \ell}{\partial x_i} = \sum_{i=1}^m \frac {\partial \ell}{\partial \hat x_i} \cdot \frac {-1} {\sqrt {\sigma_\cal B^2 + \epsilon}} + \frac {\partial \ell}{\partial \sigma_\cal B^2} \cdot \frac {2(x_i - \mu_\cal B)} {m} + \frac {\partial \ell} {\partial \mu_\cal B} \cdot \frac {1} {m}\\
&\frac {\partial \ell}{\partial \gamma} = \sum_{i=1}^m \frac {\partial \ell}{\partial y_i} \cdot \hat x_i \\
&\frac {\partial \ell}{\partial \beta} = \sum_{i=1}^m \frac {\partial \ell}{\partial y_i}
\end{align}
$$

Thus, BN transform is a differentiable transformation that introduces normalized activations into the network. This ensures that as the model is training, layers can continue learning on input distributions that exhibit less internal covariate shift, thus accelerating the training. Furthermore, the learned affine transform applied to these normalized activations allows the BN transform to represent the identity transformation and preserves the network capacity.

在訓(xùn)練過程中我們需要通過這個變換反向傳播損失$\ell$的梯度,以及計算關(guān)于BN變換參數(shù)的梯度。我們使用的鏈?zhǔn)椒▌t如下(簡化之前):

$$
\begin {align}
&\frac {\partial \ell}{\partial \hat x_i} = \frac {\partial \ell} {\partial y_i} \cdot \gamma\\
&\frac {\partial \ell}{\partial \sigma_\cal B^2} = \sum_{i=1}^m \frac {\partial \ell}{\partial \hat x_i}\cdot(x_i-\mu_\cal B)\cdot \frac {-1}{2}(\sigma_\cal B2+\epsilon){-3/2}\\
&\frac {\partial \ell}{\partial \mu_\cal B} = \sum_{i=1}^m \frac {\partial \ell}{\partial \hat x_i}\cdot \frac {-1} {\sqrt {\sigma_\cal B^2 + \epsilon}}\\
&\frac {\partial \ell}{\partial x_i} = \sum_{i=1}^m \frac {\partial \ell}{\partial \hat x_i} \cdot \frac {-1} {\sqrt {\sigma_\cal B^2 + \epsilon}} + \frac {\partial \ell}{\partial \sigma_\cal B^2} \cdot \frac {2(x_i - \mu_\cal B)} {m} + \frac {\partial \ell} {\partial \mu_\cal B} \cdot \frac {1} {m}\\
&\frac {\partial \ell}{\partial \gamma} = \sum_{i=1}^m \frac {\partial \ell}{\partial y_i} \cdot \hat x_i \\
&\frac {\partial \ell}{\partial \beta} = \sum_{i=1}^m \frac {\partial \ell}{\partial y_i}
\end{align}
$$

因此,BN變換是將標(biāo)準(zhǔn)化激活引入到網(wǎng)絡(luò)中的可微變換。這確保了在模型訓(xùn)練時,層可以繼續(xù)學(xué)習(xí)輸入分布,表現(xiàn)出更少的內(nèi)部協(xié)變量轉(zhuǎn)移,從而加快訓(xùn)練。此外,應(yīng)用于這些標(biāo)準(zhǔn)化的激活上的學(xué)習(xí)到的仿射變換允許BN變換表示恒等變換并保留網(wǎng)絡(luò)的能力。

3.1. Training and Inference with Batch-Normalized Networks

To Batch-Normalize a network, we specify a subset of activations and insert the BN transform for each of them, according to Alg.1. Any layer that previously received $x$ as the input, now receives $BN(x)$. A model employing Batch Normalization can be trained using batch gradient descent, or Stochastic Gradient Descent with a mini-batch size $m>1$, or with any of its variants such as Adagrad (Duchi et al., 2011). The normalization of activations that depends on the mini-batch allows efficient training, but is neither necessary nor desirable during inference; we want the output to depend only on the input, deterministically. For this, once the network has been trained, we use the normalization $$\hat x=\frac {x - E[x]} {\sqrt{Var[x] + \epsilon}}$$ using the population, rather than mini-batch, statistics. Neglecting $\epsilon$, these normalized activations have the same mean 0 and variance 1 as during training. We use the unbiased variance estimate $Var[x] = \frac {m} {m-1} \cdot E_\cal B[\sigma_\cal B^2]$, where the expectation is over training mini-batches of size $m$ and $\sigma_\cal B^2$ are their sample variances. Using moving averages instead, we can track the accuracy of a model as it trains. Since the means and variances are fixed during inference, the normalization is simply a linear transform applied to each activation. It may further be composed with the scaling by $\gamma$ and shift by $\beta$, to yield a single linear transform that replaces $BN(x)$. Algorithm 2 summarizes the procedure for training batch-normalized networks.

Algorithm 2

3.1 批標(biāo)準(zhǔn)化網(wǎng)絡(luò)的訓(xùn)練和推斷

為了批標(biāo)準(zhǔn)化一個網(wǎng)絡(luò),根據(jù)算法1,我們指定一個激活的子集,然后在每一個激活中插入BN變換。任何以前接收$x$作為輸入的層現(xiàn)在接收$BN(x)$作為輸入。采用批標(biāo)準(zhǔn)化的模型可以使用批梯度下降,或者用小批量數(shù)據(jù)大小為$m>1$的隨機梯度下降,或使用它的任何變種例如Adagrad (Duchi et al., 2011)進(jìn)行訓(xùn)練。依賴小批量數(shù)據(jù)的激活值的標(biāo)準(zhǔn)化可以有效地訓(xùn)練,但在推斷過程中是不必要的也是不需要的;我們希望輸出只確定性地取決于輸入。為此,一旦網(wǎng)絡(luò)訓(xùn)練完成,我們使用總體統(tǒng)計來進(jìn)行標(biāo)準(zhǔn)化$$\hat x=\frac {x - E[x]} {\sqrt{Var[x] + \epsilon}}$$,而不是小批量數(shù)據(jù)統(tǒng)計。跟訓(xùn)練過程中一樣,如果忽略$\epsilon$,這些標(biāo)準(zhǔn)化的激活具有相同的均值0和方差1。我們使用無偏方差估計$Var[x] = \frac {m} {m-1} \cdot E_\cal B[\sigma_\cal B^2]$,其中期望是在大小為$m$的小批量訓(xùn)練數(shù)據(jù)上得到的,$\sigma_\cal B^2$是其樣本方差。使用這些值移動平均,我們在訓(xùn)練過程中可以跟蹤模型的準(zhǔn)確性。由于均值和方差在推斷時是固定的,因此標(biāo)準(zhǔn)化是應(yīng)用到每一個激活上的簡單線性變換。它可以進(jìn)一步由縮放$\gamma$和轉(zhuǎn)移$\beta$組成,以產(chǎn)生代替$BN(x)$的單線性變換。算法2總結(jié)了訓(xùn)練批標(biāo)準(zhǔn)化網(wǎng)絡(luò)的過程。

Algorithm 2

3.2. Batch-Normalized Convolutional Networks

Batch Normalization can be applied to any set of activations in the network. Here, we focus on transforms that consist of an affine transformation followed by an element-wise nonlinearity: $$z = g(Wu+b)$$ where $W$ and $b$ are learned parameters of the model, and $g(\cdot)$ is the nonlinearity such as sigmoid or ReLU. This formulation covers both fully-connected and convolutional layers. We add the BN transform immediately before the nonlinearity, by normalizing $x=Wu+b$. We could have also normalized the layer inputs $u$, but since $u$ is likely the output of another nonlinearity, the shape of its distribution is likely to change during training, and constraining its first and second moments would not eliminate the covariate shift. In contrast, $Wu+b$ is more likely to have a symmetric, non-sparse distribution, that is "more Gaussian" (Hyva?rinen & Oja, 2000); normalizing it is likely to produce activations with a stable distribution.

3.2. 批標(biāo)準(zhǔn)化卷積網(wǎng)絡(luò)

批標(biāo)準(zhǔn)化可以應(yīng)用于網(wǎng)絡(luò)的任何激活集合。這里我們專注于仿射變換和元素級非線性組成的變換:$$z = g(Wu+b)$$ 其中$W$和$b$是模型學(xué)習(xí)的參數(shù),$g(\cdot)$是非線性例如sigmoid或ReLU。這個公式涵蓋了全連接層和卷積層。我們在非線性之前通過標(biāo)準(zhǔn)化$x=Wu+b$加入BN變換。我們也可以標(biāo)準(zhǔn)化層輸入$u$,但由于$u$可能是另一個非線性的輸出,它的分布形狀可能在訓(xùn)練過程中改變,并且限制其第一矩或第二矩不能去除協(xié)變量轉(zhuǎn)移。相比之下,$Wu+b$更可能具有對稱,非稀疏分布,即“更高斯”(Hyv?rinen&Oja,2000);對其標(biāo)準(zhǔn)化可能產(chǎn)生具有穩(wěn)定分布的激活。

Note that, since we normalize $Wu+b$, the bias $b$ can be ignored since its effect will be canceled by the subsequent mean subtraction (the role of the bias is subsumed by $\beta$ in Alg.1). Thus, $z = g(Wu+b)$ is replaced with $$z = g(BN(Wu))$$ where the BN transform is applied independently to each dimension of $x=Wu$, with a separate pair of learned parameters $\gamma^{(k)}$, $\beta^{(k)}$ per dimension.

注意,由于我們對$Wu+b$進(jìn)行標(biāo)準(zhǔn)化,偏置$b$可以忽略,因為它的效應(yīng)將會被后面的中心化取消(偏置的作用會歸入到算法1的$\beta$)。因此,$z = g(Wu+b)$被$$z = g(BN(Wu))$$替代,其中BN變換獨立地應(yīng)用到$x=Wu$的每一維,每一維具有單獨的成對學(xué)習(xí)參數(shù)$\gamma{(k)}$,$\beta{(k)}$。

For convolutional layers, we additionally want the normalization to obey the convolutional property —— so that different elements of the same feature map, at different locations, are normalized in the same way. To achieve this, we jointly normalize all the activations in a mini-batch, over all locations. In Alg.1, we let $\cal B$ be the set of all values in a feature map across both the elements of a mini-batch and spatial locations —— so for a mini-batch of size $m$ and feature maps of size $p\times q$, we use the effective mini-batch of size $m'=|\cal B| = m\cdot p, q$. We learn a pair of parameters $\gamma^{(k)}$ and $\beta^{(k)}$ per feature map, rather than per activation. Alg.2 is modified similarly, so that during inference the BN transform applies the same linear transformation to each activation in a given feature map.

另外,對于卷積層我們希望標(biāo)準(zhǔn)化遵循卷積特性——為的是同一特征映射的不同元素,在不同的位置,以相同的方式進(jìn)行標(biāo)準(zhǔn)化。為了實現(xiàn)這個,我們在所有位置聯(lián)合標(biāo)準(zhǔn)化了小批量數(shù)據(jù)中的所有激活。在算法1中,我們讓$\cal B$是跨越小批量數(shù)據(jù)的所有元素和空間位置的特征圖中所有值的集合——因此對于大小為$m$的小批量數(shù)據(jù)和大小為$p\times q$的特征映射,我們使用有效的大小為$m'=|\cal B| = m\cdot p, q$的小批量數(shù)據(jù)。我們每個特征映射學(xué)習(xí)一對參數(shù)$\gamma{(k)}$和$\beta{(k)}$,而不是每個激活。算法2進(jìn)行類似的修改,以便推斷期間BN變換對在給定的特征映射上的每一個激活應(yīng)用同樣的線性變換。

3.3. Batch Normalization enables higher learning rates

In traditional deep networks, too high a learning rate may result in the gradients that explode or vanish, as well as getting stuck in poor local minima. Batch Normalization helps address these issues. By normalizing activations throughout the network, it prevents small changes in layer parameters from amplifying as the data propagates through a deep network. For example, this enables the sigmoid nonlinearities to more easily stay in their non-saturated regimes, which is crucial for training deep sigmoid networks but has traditionally been hard to accomplish.

3.3. 批標(biāo)準(zhǔn)化可以提高學(xué)習(xí)率

在傳統(tǒng)的深度網(wǎng)絡(luò)中,學(xué)習(xí)率過高可能會導(dǎo)致梯度爆炸或梯度消失,以及陷入差的局部最小值。批標(biāo)準(zhǔn)化有助于解決這些問題。通過標(biāo)準(zhǔn)化整個網(wǎng)絡(luò)的激活值,在數(shù)據(jù)通過深度網(wǎng)絡(luò)傳播時,它可以防止層參數(shù)的微小變化被放大。例如,這使sigmoid非線性更容易保持在它們的非飽和狀態(tài),這對訓(xùn)練深度sigmoid網(wǎng)絡(luò)至關(guān)重要,但在傳統(tǒng)上很難實現(xiàn)。

Batch Normalization also makes training more resilient to the parameter scale. Normally, large learning rates may increase the scale of layer parameters, which then amplify the gradient during backpropagation and lead to the model explosion. However, with Batch Normalization, backpropagation through a layer is unaffected by the scale of its parameters. Indeed, for a scalar $a$, $$BN(Wu) = BN((aW)u)$$ and thus $\frac {\partial BN((aW)u)} {\partial u}= \frac {\partial BN(Wu)} {\partial u} $, so the scale does not affect the layer Jacobian nor, consequently, the gradient propagation. Moreover, $\frac {\partial BN((aW)u)} {\partial (aW)}= \frac {\partial BN(Wu)} {\partial W}$ so larger weights lead to smaller gradients, and Batch Normalization will stabilize the parameter growth.

批標(biāo)準(zhǔn)化也使訓(xùn)練對參數(shù)的縮放更有彈性。通常,大的學(xué)習(xí)率可能會增加層參數(shù)的縮放,這會在反向傳播中放大梯度并導(dǎo)致模型爆炸。然而,通過批標(biāo)準(zhǔn)化,通過層的反向傳播不受其參數(shù)縮放的影響。實際上,對于標(biāo)量$a$,$$BN(Wu) = BN((aW)u)$$因此$\frac {\partial BN((aW)u)} {\partial u}= \frac {\partial BN(Wu)} {\partial u}$,因此標(biāo)量不影響層的雅可比行列式,從而不影響梯度傳播。此外,$\frac {\partial BN((aW)u)} {\partial (aW)}=\frac {1} {a} \cdot \frac {\partial BN(Wu)} {\partial W}$,因此更大的權(quán)重會導(dǎo)致更小的梯度,并且批標(biāo)準(zhǔn)化會穩(wěn)定參數(shù)的增長。

We further conjecture that Batch Normalization may lead the layer Jacobians to have singular values close to 1, which is known to be beneficial for training (Saxe et al., 2013). Consider two consecutive layers with normalized inputs, and the transformation between these normalized vectors: $\hat z = F(\hat x)$. If we assume that $\hat x$ and $\hat z$ are Gaussian and uncorrelated, and that $F(\hat x)\approx J \hat x$ is a linear transformation for the given model parameters, then both $\hat x$ and $\hat z$ have unit covariances, and $I=Cov[\hat z] =J Cov[\hat x] J^T = JJ^T$. Thus, $J$ is orthogonal, which preserves the gradient magnitudes during backpropagation. Although the above assumptions are not true in reality, we expect Batch Normalization to help make gradient propagation better behaved. This remains an area of further study.

我們進(jìn)一步推測,批標(biāo)準(zhǔn)化可能會導(dǎo)致雅可比行列式的奇異值接近于1,這被認(rèn)為對訓(xùn)練是有利的(Saxe et al., 2013)。考慮具有標(biāo)準(zhǔn)化輸入的兩個連續(xù)的層,并且變換位于這些標(biāo)準(zhǔn)化向量之間:$\hat z = F(\hat x)$。如果我們假設(shè)$\hat x$和$\hat z$是高斯分布且不相關(guān)的,那么$F(\hat x)\approx J \hat x$是對給定模型參數(shù)的一個線性變換,$\hat x$和$\hat z$有單位方差,并且$I=Cov[\hat z] =J Cov[\hat x] J^T = JJ^T$。因此,$J$是正交的,其保留了反向傳播中的梯度大小。盡管上述假設(shè)在現(xiàn)實中不是真實的,但我們希望批標(biāo)準(zhǔn)化有助于梯度傳播更好的執(zhí)行。這有待于進(jìn)一步研究。

4. Experiments

4.1. Activations over time

To verify the effects of internal covariate shift on training, and the ability of Batch Normalization to combat it, we considered the problem of predicting the digit class on the MNIST dataset (LeCun et al., 1998a). We used a very simple network, with a 28x28 binary image as input, and 3 fully-connected hidden layers with 100 activations each. Each hidden layer computes $y = g(Wu+b)$ with sigmoid nonlinearity, and the weights $W$ initialized to small random Gaussian values. The last hidden layer is followed by a fully-connected layer with 10 activations (one per class) and cross-entropy loss. We trained the network for 50000 steps, with 60 examples per mini-batch. We added Batch Normalization to each hidden layer of the network, as in Sec.3.1. We were interested in the comparison between the baseline and batch-normalized networks, rather than achieving the state of the art performance on MNIST (which the described architecture does not).

4. 實驗

4.1. 隨時間激活

為了驗證內(nèi)部協(xié)變量轉(zhuǎn)移對訓(xùn)練的影響,以及批標(biāo)準(zhǔn)化對抗它的能力,我們考慮了在MNIST數(shù)據(jù)集上預(yù)測數(shù)字類別的問題(LeCun et al., 1998a)。我們使用非常簡單的網(wǎng)絡(luò),28x28的二值圖像作為輸入,以及三個全連接層,每層100個激活。每一個隱藏層用sigmoid非線性計算$y = g(Wu+b)$,權(quán)重$W$初始化為小的隨機高斯值。最后的隱藏層之后是具有10個激活(每類1個)和交叉熵?fù)p失的全連接層。我們訓(xùn)練網(wǎng)絡(luò)50000次迭代,每份小批量數(shù)據(jù)中有60個樣本。如第3.1節(jié)所述,我們在網(wǎng)絡(luò)的每一個隱藏層后添加批標(biāo)準(zhǔn)化。我們對基準(zhǔn)線和批標(biāo)準(zhǔn)化網(wǎng)絡(luò)之間的比較感興趣,而不是實現(xiàn)在MNIST上的最佳性能(所描述的架構(gòu)沒有)。

Figure 1(a) shows the fraction of correct predictions by the two networks on held-out test data, as training progresses. The batch-normalized network enjoys the higher test accuracy. To investigate why, we studied inputs to the sigmoid, in the original network $N$ and batch-normalized network $N_{BN}^{tr}$ (Alg. 2) over the course of training. In Fig. 1(b,c) we show, for one typical activation from the last hidden layer of each network, how its distribution evolves. The distributions in the original network change significantly over time, both in their mean and the variance, which complicates the training of the subsequent layers. In contrast, the distributions in the batch-normalized network are much more stable as training progresses, which aids the training.

Figure 1

Figure 1. (a) The test accuracy of the MNIST network trained with and without Batch Normalization, vs. the number of training steps. Batch Normalization helps the network train faster and achieve higher accuracy. (b, c) The evolution of input distributions to a typical sigmoid, over the course of training, shown as {15, 50, 85}th percentiles. Batch Normalization makes the distribution more stable and reduces the internal covariate shift.

圖1(a)顯示了隨著訓(xùn)練進(jìn)行,兩個網(wǎng)絡(luò)在提供的測試數(shù)據(jù)上正確預(yù)測的分?jǐn)?shù)。批標(biāo)準(zhǔn)化網(wǎng)絡(luò)具有更高的測試準(zhǔn)確率。為了調(diào)查原因,我們在訓(xùn)練過程中研究了原始網(wǎng)絡(luò)$N$和批標(biāo)準(zhǔn)化網(wǎng)絡(luò)$N_{BN}^{tr}$(Alg. 2)中的sigmoid輸入。在圖1(b,c)中,我們顯示,對于來自每個網(wǎng)絡(luò)的最后一個隱藏層的一個典型的激活,其分布如何演變。原始網(wǎng)絡(luò)中的分布隨著時間的推移而發(fā)生顯著變化,無論是平均值還是方差,都會使后面的層的訓(xùn)練復(fù)雜化。相比之下,隨著訓(xùn)練的進(jìn)行,批標(biāo)準(zhǔn)化網(wǎng)絡(luò)中的分布更加穩(wěn)定,這有助于訓(xùn)練。

Figure 1

圖1。(a)使用批標(biāo)準(zhǔn)化和不使用批標(biāo)準(zhǔn)化訓(xùn)練的網(wǎng)絡(luò)在MNIST上的測試準(zhǔn)確率,以及訓(xùn)練的迭代次數(shù)。批標(biāo)準(zhǔn)化有助于網(wǎng)絡(luò)訓(xùn)練的更快,取得更高的準(zhǔn)確率。(b,c)典型的sigmoid在訓(xùn)練過程中輸入分布的演變,顯示為15%,50%,85%。批標(biāo)準(zhǔn)化使分布更穩(wěn)定并降低了內(nèi)部協(xié)變量轉(zhuǎn)移。

4.2. ImageNet classification

We applied Batch Normalization to a new variant of the Inception network (Szegedy et al., 2014), trained on the ImageNet classification task (Russakovsky et al., 2014). The network has a large number of convolutional and pooling layers, with a softmax layer to predict the image class, out of 1000 possibilities. Convolutional layers use ReLU as the nonlinearity. The main difference to the network described in (Szegedy et al., 2014) is that the 5x5 convolutional layers are replaced by two consecutive layers of 3x3 convolutions with up to 128 filters. The network contains $13.6\cdot10^6$ parameters, and, other than the top softmax layer, has no fully-connected layers. We refer to this model as Inception in the rest of the text. The training was performed on a large-scale, distributed architecture (Dean et al., 2012), using 5 concurrent steps on each of 10 model replicas, using asynchronous SGD with momentum (Sutskever et al.,2013), with the mini-batch size of 32. All networks are evaluated as training progresses by computing the validation accuracy @1, i.e. the probability of predicting the correct label out of 1000 possibilities, on a held-out set, using a single crop per image.

4.2. ImageNet分類

我們將批標(biāo)準(zhǔn)化化應(yīng)用于在ImageNet分類任務(wù)(Russakovsky等,2014)上訓(xùn)練的Inception網(wǎng)絡(luò)的新變種(Szegedy等,2014)。網(wǎng)絡(luò)具有大量的卷積和池化層,和一個softmax層用來在1000個可能之中預(yù)測圖像的類別。卷積層使用ReLU作為非線性。與(Szegedy等人,2014年)中描述的網(wǎng)絡(luò)的主要區(qū)別是5×5卷積層被兩個連續(xù)的3x3卷積層替換,最多可以有128個濾波器。該網(wǎng)絡(luò)包含$13.6 \cdot 10^6$個參數(shù),除了頂部的softmax層之外,沒有全連接層。在其余的文本中我們將這個模型稱為Inception。訓(xùn)練在大型分布式架構(gòu)(Dean et al。,2012)上進(jìn)行,10個模型副本中的每一個都使用了5個并行步驟,使用異步帶動量的SGD(Sutskever等,2013),小批量數(shù)據(jù)大小為32。隨著訓(xùn)練進(jìn)行,所有網(wǎng)絡(luò)都通過計算驗證準(zhǔn)確率@1來評估,即每幅圖像使用單個裁剪圖像,在1000個可能性中預(yù)測正確標(biāo)簽的概率。

In our experiments, we evaluated several modifications of Inception with Batch Normalization. In all cases, Batch Normalization was applied to the input of each nonlinearity, in a convolutional way, as described in section 3.2, while keeping the rest of the architecture constant.

在我們的實驗中,我們評估了幾個帶有批標(biāo)準(zhǔn)化的Inception修改版本。在所有情況下,如第3.2節(jié)所述,批標(biāo)準(zhǔn)化以卷積方式應(yīng)用于每個非線性的輸入,同時保持架構(gòu)的其余部分不變。

4.2.1. ACCELERATING BN NETWORKS

Simply adding Batch Normalization to a network does not take full advantage of our method. To do so, we applied the following modifications:

4.2.1. 加速BN網(wǎng)絡(luò)

將批標(biāo)準(zhǔn)化簡單添加到網(wǎng)絡(luò)中不能充分利用我們方法的優(yōu)勢。為此,我們進(jìn)行了以下修改:

Increase learning rate. In a batch-normalized model, we have been able to achieve a training speedup from higher learning rates, with no ill side effects (Sec. 3.3).

提高學(xué)習(xí)率。在批標(biāo)準(zhǔn)化模型中,我們已經(jīng)能夠從高學(xué)習(xí)率中實現(xiàn)訓(xùn)練加速,沒有不良的副作用(第3.3節(jié))。

Remove Dropout. We have found that removing Dropout from BN-Inception allows the network to achieve higher validation accuracy. We conjecture that Batch Normalization provides similar regularization benefits as Dropout, since the activations observed for a training example are affected by the random selection of examples in the same mini-batch.

刪除丟棄。我們發(fā)現(xiàn)從BN-Inception中刪除丟棄可以使網(wǎng)絡(luò)實現(xiàn)更高的驗證準(zhǔn)確率。我們推測,批標(biāo)準(zhǔn)化提供了類似丟棄的正則化收益,因為對于訓(xùn)練樣本觀察到的激活受到了同一小批量數(shù)據(jù)中樣本隨機選擇的影響。

Shuffle training examples more thoroughly. We enabled within-shard shuffling of the training data, which prevents the same examples from always appearing in a mini-batch together. This led to about 1% improvement in the validation accuracy, which is consistent with the view of Batch Normalization as a regularizer: the randomization inherent in our method should be most beneficial when it affects an example differently each time it is seen.

更徹底地攪亂訓(xùn)練樣本。我們啟用了分布內(nèi)部攪亂訓(xùn)練數(shù)據(jù),這樣可以防止同一個例子一起出現(xiàn)在小批量數(shù)據(jù)中。這導(dǎo)致驗證準(zhǔn)確率提高了約1%,這與批標(biāo)準(zhǔn)化作為正則化項的觀點是一致的:它每次被看到時都會影響一個樣本,在我們的方法中內(nèi)在的隨機化應(yīng)該是最有益的。

Reduce the L2 weight regularization. While in Inception an L2 loss on the model parameters controls overfitting, in modified BN-Inception the weight of this loss is reduced by a factor of 5. We find that this improves the accuracy on the held-out validation data.

減少L2全中正則化。雖然在Inception中模型參數(shù)的L2損失會控制過擬合,但在修改的BN-Inception中,損失的權(quán)重減少了5倍。我們發(fā)現(xiàn)這提高了在提供的驗證數(shù)據(jù)上的準(zhǔn)確性。

Accelerate the learning rate decay. In training Inception, learning rate was decayed exponentially. Because our network trains faster than Inception, we lower the learning rate 6 times faster.

加速學(xué)習(xí)率衰減。在訓(xùn)練Inception時,學(xué)習(xí)率呈指數(shù)衰減。因為我們的網(wǎng)絡(luò)訓(xùn)練速度比Inception更快,所以我們將學(xué)習(xí)速度降低加快6倍。

Remove Local Response Normalization While Inception and other networks (Srivastava et al., 2014) benefit from it, we found that with Batch Normalization it is not necessary.

刪除局部響應(yīng)歸一化。雖然Inception和其它網(wǎng)絡(luò)(Srivastava等人,2014)從中受益,但是我們發(fā)現(xiàn)使用批標(biāo)準(zhǔn)化它是不必要的。

Reduce the photometric distortions. Because batch-normalized networks train faster and observe each training example fewer times, we let the trainer focus on more “real” images by distorting them less.

減少光照扭曲。因為批標(biāo)準(zhǔn)化網(wǎng)絡(luò)訓(xùn)練更快,并且觀察每個訓(xùn)練樣本更少的次數(shù),所以通過更少地扭曲它們,我們讓訓(xùn)練器關(guān)注更多的“真實”圖像。

4.2.2. SINGLE-NETWORK CLASSIFICATION

We evaluated the following networks, all trained on the LSVRC2012 training data, and tested on the validation data:

4.2.2. 單網(wǎng)絡(luò)分類

我們評估了下面的網(wǎng)絡(luò),所有的網(wǎng)絡(luò)都在LSVRC2012訓(xùn)練數(shù)據(jù)上訓(xùn)練,并在驗證數(shù)據(jù)上測試:

Inception: the network described at the beginning of Section 4.2, trained with the initial learning rate of 0.0015.

Inception:在4.2小節(jié)開頭描述的網(wǎng)絡(luò),以0.0015的初始學(xué)習(xí)率進(jìn)行訓(xùn)練。

BN-Baseline: Same as Inception with Batch Normalization before each nonlinearity.

BN-Baseline:每個非線性之前加上批標(biāo)準(zhǔn)化,其它的與Inception一樣。

BN-x5: Inception with Batch Normalization and the modifications in Sec. 4.2.1. The initial learning rate was increased by a factor of 5, to 0.0075. The same learning rate increase with original Inception caused the model parameters to reach machine infinity.

BN-x5:帶有批標(biāo)準(zhǔn)化的Inception,修改在4.2.1小節(jié)中。初始學(xué)習(xí)率增加5倍到了0.0075。原始Inception增加同樣的學(xué)習(xí)率會使模型參數(shù)達(dá)到機器無限大。

BN-x30: Like BN-x5, but with the initial learning rate 0.045 (30 times that of Inception).

BN-x30:類似于BN-x5,但初始學(xué)習(xí)率為0.045(Inception學(xué)習(xí)率的30倍)。

BN-x5-Sigmoid: Like BN-x5, but with sigmoid nonlinearity $g(t)=\frac{1}{1+\exp(-x)}$ instead of ReLU. We also attempted to train the original Inception with sigmoid, but the model remained at the accuracy equivalent to chance.

BN-x5-Sigmoid:類似于BN-x5,但使用sigmoud非線性$g(t)=\frac{1}{1+\exp(-x)}$來代替ReLU。我們也嘗試訓(xùn)練帶有sigmoid的原始Inception,但模型保持在相當(dāng)于機會的準(zhǔn)確率。

In Figure 2, we show the validation accuracy of the networks, as a function of the number of training steps. Inception reached the accuracy of 72.2% after $31 \cdot 10^6$ training steps. The Figure 3 shows, for each network, the number of training steps required to reach the same 72.2% accuracy, as well as the maximum validation accuracy reached by the network and the number of steps to reach it.

Figure 2

Figure 2. Single crop validation accuracy of Inception and its batch-normalized variants, vs. the number of training steps.

Figure 2

Figure 3. For Inception and the batch-normalized variants, the number of training steps required to reach the maximum accuracy of Inception (72.2%), and the maximum accuracy achieved by the network.

在圖2中,我們顯示了網(wǎng)絡(luò)的驗證集準(zhǔn)確率,作為訓(xùn)練步驟次數(shù)的函數(shù)。Inception網(wǎng)絡(luò)在$31 \cdot 10^6$次訓(xùn)練步驟后達(dá)到了72.2%的準(zhǔn)確率。圖3顯示,對于每個網(wǎng)絡(luò),達(dá)到同樣的72.2%準(zhǔn)確率需要的訓(xùn)練步驟數(shù)量,以及網(wǎng)絡(luò)達(dá)到的最大驗證集準(zhǔn)確率和達(dá)到該準(zhǔn)確率的訓(xùn)練步驟數(shù)量。

Figure 2

圖2。Inception和它的批標(biāo)準(zhǔn)化變種在單個裁剪圖像上的驗證準(zhǔn)確率以及訓(xùn)練步驟的數(shù)量。

Figure 2

圖3。對于Inception和它的批標(biāo)準(zhǔn)化變種,達(dá)到Inception最大準(zhǔn)確率(72.2%)所需要的訓(xùn)練步驟數(shù)量,以及網(wǎng)絡(luò)取得的最大準(zhǔn)確率。

By only using Batch Normalization (BN-Baseline), we match the accuracy of Inception in less than half the number of training steps. By applying the modifications in Sec. 4.2.1, we significantly increase the training speed of the network. BN-x5 needs 14 times fewer steps than Inception to reach the 72.2% accuracy. Interestingly, increasing the learning rate further (BN-x30) causes the model to train somewhat slower initially, but allows it to reach a higher final accuracy. This phenomenon is counterintuitive and should be investigated further. BN-x30 reaches 74.8% after $6 \cdot 10^6$ steps, i.e. 5 times fewer steps than required by Inception to reach 72.2%.

通過僅使用批標(biāo)準(zhǔn)化(BN-Baseline),我們在不到Inception一半的訓(xùn)練步驟數(shù)量內(nèi)將準(zhǔn)確度與其相匹配。通過應(yīng)用4.2.1小節(jié)中的修改,我們顯著提高了網(wǎng)絡(luò)的訓(xùn)練速度。BN-x5需要比Inception少14倍的步驟就達(dá)到了72.2%的準(zhǔn)確率。有趣的是,進(jìn)一步提高學(xué)習(xí)率(BN-x30)使得該模型最初訓(xùn)練有點慢,但可以使其達(dá)到更高的最終準(zhǔn)確率。這種現(xiàn)象是違反直覺的,應(yīng)進(jìn)一步調(diào)查。在$6 \cdot 10^6$步驟之后,BN-x30達(dá)到74.8%的準(zhǔn)確率,即比Inception達(dá)到72.2%的準(zhǔn)確率所需的步驟減少了5倍。

We also verified that the reduction in internal covariate shift allows deep networks with Batch Normalization to be trained when sigmoid is used as the nonlinearity, despite the well-known difficulty of training such networks. Indeed, BN-x5-Sigmoid achieves the accuracy of 69.8%. Without Batch Normalization, Inception with sigmoid never achieves better than 1/1000 accuracy.

我們也證實了盡管訓(xùn)練這樣的網(wǎng)絡(luò)是眾所周知的困難,但是當(dāng)使用sigmoid作為非線性時,內(nèi)部協(xié)變量轉(zhuǎn)移的減少允許具有批標(biāo)準(zhǔn)化的深層網(wǎng)絡(luò)被訓(xùn)練。的確,BN-x5-Sigmoid取得了69.8%的準(zhǔn)確率達(dá)。沒有批標(biāo)準(zhǔn)化,使用sigmoid的Inception從未達(dá)到比1/1000準(zhǔn)確率更好的結(jié)果。

4.2.3. ENSEMBLE CLASSIFICATION

The current reported best results on the ImageNet Large Scale Visual Recognition Competition are reached by the Deep Image ensemble of traditional models (Wu et al., 2015) and the ensemble model of (He et al., 2015). The latter reports the error of 4.94%, as evaluated by the ILSVRC test server. Here we report a test error of 4.82% on test server. This improves upon the previous best result, and exceeds the estimated accuracy of human raters according to (Russakovsky et al., 2014).

4.2.3. 組合分類

目前在ImageNet大型視覺識別競賽中報道的最佳結(jié)果是傳統(tǒng)模型(Wu et al。,2015)的Deep Image組合和(He等,2015)的組合模型。后者報告了ILSVRC測試服務(wù)器評估的4.94%的top-5錯誤率。這里我們在測試服務(wù)器上報告4.82%的測試錯誤率。這提高了以前的最佳結(jié)果,并且根據(jù)(Russakovsky等,2014)這超過了人類評估者的評估準(zhǔn)確率。

For our ensemble, we used 6 networks. Each was based on BN-x30, modified via some of the following: increased initial weights in the convolutional layers; using Dropout (with the Dropout probability of 5% or 10%, vs. 40% for the original Inception); and using non-convolutional Batch Normalization with last hidden layers of the model. Each network achieved its maximum accuracy after about $6 \cdot 10^6$ training steps. The ensemble prediction was based on the arithmetic average of class probabilities predicted by the constituent networks. The details of ensemble and multi-crop inference are similar to (Szegedy et al., 2014).

對于我們的組合,我們使用了6個網(wǎng)絡(luò)。每個都是基于BN-x30的,進(jìn)行了以下一些修改:增加卷積層中的初始重量;使用Dropout(丟棄概率為5%或10%,而原始Inception為40%);模型最后的隱藏層使用非卷積批標(biāo)準(zhǔn)化。每個網(wǎng)絡(luò)在大約$6 \cdot 10^6$個訓(xùn)練步驟之后實現(xiàn)了最大的準(zhǔn)確率。組合預(yù)測是基于組成網(wǎng)絡(luò)的預(yù)測類概率的算術(shù)平均。組合和多裁剪圖像推斷的細(xì)節(jié)與(Szegedy et al,2014)類似。

We demonstrate in Fig. 4 that batch normalization allows us to set new state-of-the-art on the ImageNet classification challenge benchmarks.

Figure 4

Figure 4. Batch-Normalized Inception comparison with previous state of the art on the provided validation set comprising 50000 images. Ensemble results are test server evaluation results on the test set. The BN-Inception ensemble has reached 4.9% top-5 error on the 50000 images of the validation set. All other reported results are on the validation set.

我們在圖4中證實了批標(biāo)準(zhǔn)化使我們能夠在ImageNet分類挑戰(zhàn)基準(zhǔn)上設(shè)置新的最佳結(jié)果。

Figure 4

圖4。批標(biāo)準(zhǔn)化Inception與以前的最佳結(jié)果在提供的包含5萬張圖像的驗證集上的比較。組合結(jié)果是在測試集上由測試服務(wù)器評估的結(jié)果。BN-Inception組合在驗證集的5萬張圖像上取得了4.9% top-5的錯誤率。所有報道的其它結(jié)果是在驗證集上。

5. Conclusion

We have presented a novel mechanism for dramatically accelerating the training of deep networks. It is based on the premise that covariate shift, which is known to complicate the training of machine learning systems, also applies to sub-networks and layers, and removing it from internal activations of the network may aid in training. Our proposed method draws its power from normalizing activations, and from incorporating this normalization in the network architecture itself. This ensures that the normalization is appropriately handled by any optimization method that is being used to train the network. To enable stochastic optimization methods commonly used in deep network training, we perform the normalization for each mini-batch, and backpropagate the gradients through the normalization parameters. Batch Normalization adds only two extra paramters per activation, and in doing so preserves the representation ability of the network. We presented an algorithm for constructing, training, and performing inference with batch-normalized networks. The resulting networks can be trained with saturating nonlinearities, are more tolerant to increased training rates, and often do not require Dropout for regularization.

5. 結(jié)論

我們提出了一個新的機制,大大加快了深度網(wǎng)絡(luò)的訓(xùn)練。它是基于前提協(xié)變量轉(zhuǎn)移的,已知其會使機器學(xué)習(xí)系統(tǒng)的訓(xùn)練復(fù)雜化,也適用于子網(wǎng)絡(luò)和層,并且從網(wǎng)絡(luò)的內(nèi)部激活中去除它可能有助于訓(xùn)練。我們提出的方法從其標(biāo)準(zhǔn)化激活中獲取其功能,并將這種標(biāo)準(zhǔn)化合并到網(wǎng)絡(luò)架構(gòu)本身。這確保了標(biāo)準(zhǔn)化可以被用來訓(xùn)練網(wǎng)絡(luò)的任何優(yōu)化方法進(jìn)行恰當(dāng)?shù)奶幚怼榱俗屔疃染W(wǎng)絡(luò)訓(xùn)練中常用的隨機優(yōu)化方法可用,我們對每個小批量數(shù)據(jù)執(zhí)行標(biāo)準(zhǔn)化,并通過標(biāo)準(zhǔn)化參數(shù)來反向傳播梯度。批標(biāo)準(zhǔn)化每個激活只增加了兩個額外的參數(shù),這樣做可以保持網(wǎng)絡(luò)的表示能力。我們提出了一個用于構(gòu)建,訓(xùn)練和執(zhí)行推斷的批標(biāo)準(zhǔn)化網(wǎng)絡(luò)算法。所得到的網(wǎng)絡(luò)可以用飽和非線性進(jìn)行訓(xùn)練,能更容忍增加的訓(xùn)練率,并且通常不需要丟棄來進(jìn)行正則化。

Merely adding Batch Normalization to a state-of-the-art image classification model yields a substantial speedup in training. By further increasing the learning rates, removing Dropout, and applying other modifications afforded by Batch Normalization, we reach the previous state of the art with only a small fraction of training steps —— and then beat the state of the art in single-network image classification. Furthermore, by combining multiple models trained with Batch Normalization, we perform better than the best known system on ImageNet, by a significant margin.

僅僅將批標(biāo)準(zhǔn)化添加到了最新的圖像分類模型中便在訓(xùn)練中取得了實質(zhì)的加速。通過進(jìn)一步提高學(xué)習(xí)率,刪除丟棄和應(yīng)用批標(biāo)準(zhǔn)化所提供的其它修改,我們只用了少部分的訓(xùn)練步驟就達(dá)到了以前的技術(shù)水平——然后在單網(wǎng)絡(luò)圖像分類中擊敗了最先進(jìn)的技術(shù)。此外,通過組合多個使用批標(biāo)準(zhǔn)化訓(xùn)練的模型,我們在ImageNet上的表現(xiàn)顯著優(yōu)于最好的已知系統(tǒng)。

Our method bears similarity to the standardization layer of (Gu?lc?ehre & Bengio, 2013), though the two address different goals. Batch Normalization seeks a stable distribution of activation values throughout training, and normalizes the inputs of a nonlinearity since that is where matching the moments is more likely to stabilize the distribution. On the contrary, the standardization layer is applied to the output of the nonlinearity, which results in sparser activations. We have not observed the nonlinearity inputs to be sparse, neither with nor without Batch Normalization. Other notable differences of Batch Normalization include the learned scale and shift that allow the BN transform to represent identity, handling of convolutional layers, and deterministic inference that does not depend on the mini-batch.

我們的方法與(Gül?ehre&Bengio,2013)的標(biāo)準(zhǔn)化層相似,盡管這兩個方法解決的目標(biāo)不同。批標(biāo)準(zhǔn)化尋求在整個訓(xùn)練過程中激活值的穩(wěn)定分布,并且對非線性的輸入進(jìn)行歸一化,因為這時更有可能穩(wěn)定分布。相反,標(biāo)準(zhǔn)化層被應(yīng)用于非線性的輸出,這導(dǎo)致了更稀疏的激活。我們沒有觀察到非線性輸入是稀疏的,無論是有批標(biāo)準(zhǔn)化還是沒有批標(biāo)準(zhǔn)化。批標(biāo)準(zhǔn)化的其它顯著差異包括學(xué)習(xí)到的縮放和轉(zhuǎn)移允許BN變換表示恒等,卷積層處理以及不依賴于小批量數(shù)據(jù)的確定性推斷。

In this work, we have not explored the full range of possibilities that Batch Normalization potentially enables. Our future work includes applications of our method to Recurrent Neural Networks (Pascanu et al., 2013), where the internal covariate shift and the vanishing or exploding gradients may be especially severe, and which would allow us to more thoroughly test the hypothesis that normalization improves gradient propagation (Sec. 3.3). More study is needed of the regularization properties of Batch Normalization, which we believe to be responsible for the improvements we have observed when Dropout is removed from BN-Inception. We plan to investigate whether Batch Normalization can help with domain adaptation, in its traditional sense —— i.e. whether the normalization performed by the network would allow it to more easily generalize to new data distributions, perhaps with just a recomputation of the population means and variances (Alg. 2). Finally, we believe that further theoretical analysis of the algorithm would allow still more improvements and applications.

在這項工作中,我們沒有探索批標(biāo)準(zhǔn)化可能實現(xiàn)的全部可能性。我們的未來工作包括將我們的方法應(yīng)用于循環(huán)神經(jīng)網(wǎng)絡(luò)(Pascanu et al.,2013),其中內(nèi)部協(xié)變量轉(zhuǎn)移和梯度消失或爆炸可能特別嚴(yán)重,這將使我們能夠更徹底地測試假設(shè)標(biāo)準(zhǔn)化改善了梯度傳播(第3.3節(jié))。需要對批標(biāo)準(zhǔn)化的正則化屬性進(jìn)行更多的研究,我們認(rèn)為這是BN-Inception中刪除丟棄時我們觀察到的改善的原因。我們計劃調(diào)查批標(biāo)準(zhǔn)化是否有助于傳統(tǒng)意義上的域自適應(yīng)——即網(wǎng)絡(luò)執(zhí)行標(biāo)準(zhǔn)化是否能夠更容易泛化到新的數(shù)據(jù)分布,也許僅僅是對總體均值和方差的重新計算(Alg.2)。最后,我們認(rèn)為,該算法的進(jìn)一步理論分析將允許更多的改進(jìn)和應(yīng)用。

Acknowledgments

We thank Vincent Vanhoucke and Jay Yagnik for help and discussions, and the reviewers for insightful comments.

致謝

我們感謝Vincent Vanhoucke和Jay Yagnik的幫助和討論,以及審稿人的深刻評論。

References

Bengio, Yoshua and Glorot, Xavier. Understanding the difficulty of training deep feedforward neural networks. In Proceedings of AISTATS 2010, volume 9, pp. 249–256, May 2010.

Dean, Jeffrey, Corrado, Greg S., Monga, Rajat, Chen, Kai, Devin, Matthieu, Le, Quoc V., Mao, Mark Z., Ranzato, Marc’Aurelio, Senior, Andrew, Tucker, Paul, Yang, Ke, and Ng, Andrew Y. Large scale distributed deep networks. In NIPS, 2012.

Desjardins, Guillaume and Kavukcuoglu, Koray. Natural neural networks. (unpublished).

Duchi, John, Hazan, Elad, and Singer, Yoram. Adaptive subgradient methods for online learning and stochastic optimization. J. Mach. Learn. Res., 12:2121–2159, July 2011. ISSN 1532-4435.

Gu ?lc ?ehre, C ? aglar and Bengio, Yoshua. Knowledge matters: Importance of prior information for optimization. CoRR, abs/1301.4083, 2013.

He, K., Zhang, X., Ren, S., and Sun, J. Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification. ArXiv e-prints, February 2015.

Hyva ?rinen, A. and Oja, E. Independent component analysis: Algorithms and applications. Neural Netw., 13(4-5): 411–430, May 2000.
Jiang, Jing. A literature survey on domain adaptation of statistical classifiers, 2008.

LeCun, Y., Bottou, L., Bengio, Y., and Haffner, P. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, November 1998a.

LeCun, Y., Bottou, L., Orr, G., and Muller, K. Efficient backprop. In Orr, G. and K., Muller (eds.), Neural Networks: Tricks of the trade. Springer, 1998b.

Lyu, S and Simoncelli, E P. Nonlinear image representation using divisive normalization. In Proc. Computer Vision and Pattern Recognition, pp. 1–8. IEEE Computer Society, Jun 23-28 2008. doi: 10.1109/CVPR.2008.4587821.

Nair, Vinod and Hinton, Geoffrey E. Rectified linear units improve restricted boltzmann machines. In ICML, pp. 807–814. Omnipress, 2010.

Pascanu, Razvan, Mikolov, Tomas, and Bengio, Yoshua. On the difficulty of training recurrent neural networks. In Proceedings of the 30th International Conference on Machine Learning, ICML 2013, Atlanta, GA, USA, 16-21 June 2013, pp. 1310–1318, 2013.

Povey, Daniel, Zhang, Xiaohui, and Khudanpur, Sanjeev. Parallel training of deep neural networks with natural gradient and parameter averaging. CoRR, abs/1410.7455, 2014.

Raiko, Tapani, Valpola, Harri, and LeCun, Yann. Deep learning made easier by linear transformations in perceptrons. In International Conference on Artificial Intelligence and Statistics (AISTATS), pp. 924–932, 2012.

Russakovsky, Olga, Deng, Jia, Su, Hao, Krause, Jonathan, Satheesh, Sanjeev, Ma, Sean, Huang, Zhiheng, Karpathy, Andrej, Khosla, Aditya, Bernstein, Michael, Berg, Alexander C., and Fei-Fei, Li. ImageNet Large Scale Visual Recognition Challenge, 2014.

Saxe, Andrew M., McClelland, James L., and Ganguli, Surya. Exact solutions to the nonlinear dynamics of learning in deep linear neural networks. CoRR, abs/1312.6120, 2013.

Shimodaira, Hidetoshi. Improving predictive inference under covariate shift by weighting the log-likelihood function. Journal of Statistical Planning and Inference, 90 (2):227–244, October 2000.

Srivastava, Nitish, Hinton, Geoffrey, Krizhevsky, Alex, Sutskever, Ilya, and Salakhutdinov, Ruslan. Dropout: A simple way to prevent neural networks from overfitting. J. Mach. Learn. Res., 15(1):1929–1958, January 2014.

Sutskever, Ilya, Martens, James, Dahl, George E., and Hinton, Geoffrey E. On the importance of initialization and momentum in deep learning. In ICML (3), volume 28 of JMLR Proceedings, pp. 1139–1147. JMLR.org, 2013.

Szegedy, Christian, Liu, Wei, Jia, Yangqing, Sermanet, Pierre, Reed, Scott, Anguelov, Dragomir, Erhan, Dumitru, Vanhoucke, Vincent, and Rabinovich, Andrew. Going deeper with convolutions. CoRR, abs/1409.4842, 2014.

Wiesler, Simon and Ney, Hermann. A convergence analysis of log-linear training. In Shawe-Taylor, J., Zemel, R.S., Bartlett, P., Pereira, F.C.N., and Weinberger, K.Q. (eds.), Advances in Neural Information Processing Systems 24, pp. 657–665, Granada, Spain, December 2011.

Wiesler, Simon, Richard, Alexander, Schlu ?ter, Ralf, and Ney, Hermann. Mean-normalized stochastic gradient for large-scale deep learning. In IEEE International Conference on Acoustics, Speech, and Signal Processing, pp. 180–184, Florence, Italy, May 2014.

Wu, Ren, Yan, Shengen, Shan, Yi, Dang, Qingqing, and Sun, Gang. Deep image: Scaling up image recognition, 2015.

最后編輯于
?著作權(quán)歸作者所有,轉(zhuǎn)載或內(nèi)容合作請聯(lián)系作者
【社區(qū)內(nèi)容提示】社區(qū)部分內(nèi)容疑似由AI輔助生成,瀏覽時請結(jié)合常識與多方信息審慎甄別。
平臺聲明:文章內(nèi)容(如有圖片或視頻亦包括在內(nèi))由作者上傳并發(fā)布,文章內(nèi)容僅代表作者本人觀點,簡書系信息發(fā)布平臺,僅提供信息存儲服務(wù)。

友情鏈接更多精彩內(nèi)容