agenda
- 構(gòu)建數(shù)據(jù)和模型
- 數(shù)據(jù)預(yù)處理
- weight初始化
- 正規(guī)化(Regularization)
- Loss 函數(shù)
數(shù)據(jù)預(yù)處理
- 減均值
假如X是一幅圖像(h,w, c): X -= np.mean(X), 若按照顏色通道劃分還可以是np.mean(X, axis = 0), 減均值最大的好處就是處理之后的圖像數(shù)據(jù)均值為零.以圖像中只包含兩個像素為例X = [a, b], Y = X - np.mean(X) = [a - (a + b)/ 2, b - (a + b)/2], E(Y) = (a - (a + b)/ 2 + b - (a + b)/2)/2 = 0, D(Y)確定, 經(jīng)過減均值后Y服從高斯分布.后面在Xavier初始化中有作用. -
標(biāo)準(zhǔn)化(Normalization)
標(biāo)準(zhǔn)化使得數(shù)據(jù)在同一個尺度范圍內(nèi)伸縮, 圖像X像素值除以標(biāo)準(zhǔn)差, X -= np.std(X, axis = 0).應(yīng)用這種方法的前提是不同input feature的大小比例不同, 而它們大體相等對學(xué)習(xí)算法至關(guān)重要, 圖像的像素范圍已經(jīng)是[0,255],所以通常不必應(yīng)用標(biāo)準(zhǔn)化,借原筆記的圖效果如下:
PCA/Whitening這里就不介紹了(In practice. We mention PCA/Whitening in these notes for completeness, but these transformations are not used with Convolutional Networks.) - weight初始化
2個不正確初始化方式:- 全0. 這樣每個神經(jīng)元的輸入都是一樣的, 每次反向傳播所有參數(shù)都經(jīng)歷相同的更新, 每個神經(jīng)元的差異沒有體現(xiàn)
- 接近0的隨機數(shù). 因為weight的導(dǎo)數(shù)和weight成比例, 很小的weight的導(dǎo)數(shù)很小, 再反向傳播時把更新殺死了, 造成參數(shù)基本不更新,難以收斂.
- Xavier 初始化
隨機初始化weight還有一個問題,經(jīng)過一次神經(jīng)元運算后輸出分布的方差大幅增長.下一層輸入個體差異較大, 對于后一層參數(shù)訓(xùn)練非常不利, 個體差異巨大的輸入對于一個weight的小更新就有可能帶來loss的劇烈震蕩.
本著控制運算后輸出方差的思路: 高斯分布(有正/負向作用參數(shù)作用大體相等(均值為0), 參數(shù)間的差異是確定的方差). 假如經(jīng)過一層神經(jīng)網(wǎng)絡(luò)后還可以保證方差穩(wěn)定就達到了目的.
經(jīng)過了一層線性運算后輸出 y = Σ wi*xi + b, W和X獨立同分布, 且X經(jīng)過減均值的預(yù)處理已服從高斯分布,那么E(xi) = E(wi) = 0, N是W的行數(shù).
D(y) = D(Σ wi*xi + b) = D(Σ E(wi)^2*D(xi) + E(xi)^2*D(wi) + D(wi)*D(xi)) = Σ D(wi)*D(xi) = N*D(wi)*D(xi)
要保持方差總體不變則 D(y) = D(xi) = N*D(wi)*D(xi) -> D(wi) = 1/N, 所以W服從高斯分布,并且方差是1/N,經(jīng)過線性運算后方差總體不變.初始化方式可以是:w = np.random.randn(n) / sqrt(n). 實際使用caffe使用了另外一種方式:
可以看到scale = sqrt(3/n), 從區(qū)間[-scale, scale]從一個均值分布篩選出來設(shè)定的n個數(shù).均值函數(shù)概率密度:class XavierFiller : public Filler<Dtype> { public: explicit XavierFiller(const FillerParameter& param) : Filler<Dtype>(param) {} virtual void Fill(Blob<Dtype>* blob) { // set n by configuration Dtype scale = sqrt(Dtype(3) / n); //通過均值概率出E(W) = 0, D(W) = 1/n的高斯分布 caffe_rng_uniform<Dtype>(blob->count(), -scale, scale, blob->mutable_cpu_data()); CHECK_EQ(this->filler_param_.sparse(), -1) << "Sparsity not supported by this Filler."; } }; void caffe_rng_uniform(const int n, const Dtype a, const Dtype b, Dtype* r) { CHECK_GE(n, 0); CHECK(r); CHECK_LE(a, b); boost::uniform_real<Dtype> random_distribution(a, caffe_nextafter<Dtype>(b)); boost::variate_generator<caffe::rng_t*, boost::uniform_real<Dtype> > variate_generator(caffe_rng(), random_distribution); for (int i = 0; i < n; ++i) { r[i] = variate_generator(); } }均值方差如下:
這里b = scale = sqrt(3/n), a = -scale = -sqrt(3/n).可以看出D(W) = 1/12(b - a)^2 = 1/12*(2*sqrt(3/n))^2 = 1/12*4*3/n = 1/n, E(w) = 1/2(a + b) = 0, W是方差為1/n的高斯分布, 至此可以看出caffe是何如通過XavierFiller保證W的方差等于1/n.
再有假如激活函數(shù)是常用Relu, 參見Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification, w = np.random.randn(n) * sqrt(2.0/n) - 正規(guī)化(Regularization)
Regularization作用抑制過擬合的一種技術(shù)手段(其他手段包括relu,數(shù)據(jù)增強), 之前在neural-network-1有提及不能因為過擬合就使用小的網(wǎng)絡(luò).Regularization包括L2,L1,Max norm constraints, dropout.- L2 Regularization
最常用的手段, 施加于在loss函數(shù)上: new_loss = loss + λ/2n*Σ w^2, 感官上L2的作用是對大的w懲戒嚴格, 傾向于離散小的w. 使得噪聲的作用不那么強烈(we discussed in the Linear Classification section, due to multiplicative interactions between weights and inputs this has the appealing property of encouraging the network to use all of its inputs a little rather that some of its inputs a lot).
還有一點對于L2 W跟新是線性的, d nl/ dw = dl/dw + λ/n*w, 對w的更新變成了
newly w = w - learning_rate*dl/dw - learning_rate*λ/n*w, 這里可以看出對w的更新依然線性.再看下caffe L2的實現(xiàn):
L1類似于L2, 這里就不細講了.void SGDSolver<Dtype>::ApplyUpdate() { CHECK(Caffe::root_solver()); Dtype rate = GetLearningRate(); for (int param_id = 0; param_id < this->net_->learnable_params().size(); ++param_id) { //標(biāo)準(zhǔn)化 Normalize(param_id); //正規(guī)化 Regularize(param_id); //其實沒更新 ComputeUpdateValue(param_id, rate); } //真更新了 this->net_->Update(); } case Caffe::CPU: { if (local_decay) { if (regularization_type == "L2") { // add weight decay y = ax + y = local_decay*w + loss caffe_axpy(net_params[param_id]->count(), local_decay, net_params[param_id]->cpu_data(), // 可修改數(shù)據(jù)的地址, 內(nèi)部是static cast, l += local_decay*w, 這里其實就直接計算導(dǎo)數(shù)的變化了,至此可以理解問loss更新了 net_params[param_id]->mutable_cpu_diff()); } void SGDSolver<Dtype>::ComputeUpdateValue(int param_id, Dtype rate) { const vector<Blob<Dtype>*>& net_params = this->net_->learnable_params(); const vector<float>& net_params_lr = this->net_->params_lr(); //擺脫local minima的加速度 Dtype momentum = this->param_.momentum(); Dtype local_rate = rate * net_params_lr[param_id]; // Compute the update to history, then copy it to the parameter diff. switch (Caffe::mode()) { case Caffe::CPU: { //先改歷史數(shù)據(jù), y = a*x + y*b = local_rate*loss(包括了L2項) + momentum*w(加速度) caffe_cpu_axpby(net_params[param_id]->count(), local_rate, net_params[param_id]->cpu_diff(), momentum, history_[param_id]->mutable_cpu_data()); //拷貝到cpu_data中, w跟新 caffe_copy(net_params[param_id]->count(), history_[param_id]->cpu_data(), net_params[param_id]->mutable_cpu_diff()); break; } - L2 Regularization
- Max norm constraints
最大范數(shù)約束.相當(dāng)于給參數(shù)更新設(shè)置了一個邊界,防止過度更新,一定程度上防止學(xué)習(xí)率過高造成訓(xùn)練震蕩.這里參考一下Keras的實現(xiàn).
class MaxNorm(Constraint):
"""MaxNorm weight constraint.
Constrains the weights incident to each hidden unit
to have a norm less than or equal to a desired value.
# Arguments
m: the maximum norm for the incoming weights.
axis: integer, axis along which to calculate weight norms.
For instance, in a `Dense` layer the weight matrix
has shape `(input_dim, output_dim)`,
set `axis` to `0` to constrain each weight vector
of length `(input_dim,)`.
In a `Conv2D` layer with `data_format="channels_last"`,
the weight tensor has shape
`(rows, cols, input_depth, output_depth)`,
set `axis` to `[0, 1, 2]`
to constrain the weights of each filter tensor of size
`(rows, cols, input_depth)`.
# References
- [Dropout: A Simple Way to Prevent Neural Networks from Overfitting Srivastava, Hinton, et al. 2014](http://www.cs.toronto.edu/~rsalakhu/papers/srivastava14a.pdf)
"""
def __init__(self, max_value=2, axis=0):
self.max_value = max_value
self.axis = axis
def __call__(self, w):
# from . import backend as K
norms = K.sqrt(K.sum(K.square(w), axis=self.axis, keepdims=True))
# 把norm限制在0-max之間
desired = K.clip(norms, 0, self.max_value)
# K.epsilon()是一個極小的隨機因子 _EPSILON = 1e-7 the fuzz factor used in numeric expressions
w *= (desired / (K.epsilon() + norms))
return w
def get_config(self):
return {'max_value': self.max_value,
'axis': self.axis}
- Dropout
Dropout is an extremely effective, VGG16中就有用到, Dropout: A Simple Way to Prevent Neural Networks from Overfitting.Dropout的想法是降低神經(jīng)元之間的內(nèi)耦, 訓(xùn)練時按照threshold關(guān)閉網(wǎng)絡(luò), 使用部分網(wǎng)絡(luò)訓(xùn)練, 極大的降噪.對Dropout后續(xù)paper還有Dropout paper和Dropout Training as Adaptive RegularizationAll in all:In practice: It is most common to use a single, global L2 regularization strength that is cross-validated. It is also common to combine this with dropout applied after all layers. The value of (p = 0.5) is a reasonable default, but this can be tuned on validation data.
還是從實際caffe如何做Dropout看下運作:
void DropoutLayer<Dtype>::Forward_cpu(const vector<Blob<Dtype>*>& bottom,
const vector<Blob<Dtype>*>& top) {
const Dtype* bottom_data = bottom[0]->cpu_data();
Dtype* top_data = top[0]->mutable_cpu_data();
unsigned int* mask = rand_vec_.mutable_cpu_data();
const int count = bottom[0]->count();
//只在訓(xùn)練時生效
if (this->phase_ == TRAIN) {
// 抹top數(shù)據(jù), 按照threshold_置值
caffe_rng_bernoulli(count, 1. - threshold_, mask);
//prototxt里是否設(shè)置比例訓(xùn)練
if (scale_train_) {
for (int i = 0; i < count; ++i) {
top_data[i] = bottom_data[i] * mask[i] * scale_;
}
} else {
for (int i = 0; i < count; ++i) {
//& mask -> 設(shè)置top數(shù)據(jù)
top_data[i] = bottom_data[i] * mask[i];
}
}
} else {
caffe_copy(bottom[0]->count(), bottom_data, top_data);
if (!scale_train_) {
caffe_scal<Dtype>( count, 1. / scale_, top_data);
}
}
}
- Loss 函數(shù)
常見的分類loss之前有講過, svm和使用cross-entropy的softmax.分類眾多時可以考慮參考Hierarchical Softmax.對于屬性簇或者回歸問題首先考慮是否能夠轉(zhuǎn)化成一些獨立的分類問題, 直接在回歸問題應(yīng)用L2平方loss比較難訓(xùn)練和脆弱(Notice that this is not the case with Softmax, where the precise value of each score is less important: It only matters that their magnitudes are appropriate. Additionally, the L2 loss is less robust because outliers can introduce huge gradients).



