中文字幕免费网站亚洲,日本在线观看aa,夜夜嗨日日

agenda

構(gòu)建數(shù)據(jù)和模型
- 數(shù)據(jù)預(yù)處理
- weight初始化
- 正規(guī)化(Regularization)
Loss 函數(shù)

數(shù)據(jù)預(yù)處理

減均值
假如X是一幅圖像(h,w, c): X -= np.mean(X), 若按照顏色通道劃分還可以是np.mean(X, axis = 0), 減均值最大的好處就是處理之后的圖像數(shù)據(jù)均值為零.以圖像中只包含兩個像素為例X = [a, b], Y = X - np.mean(X) = [a - (a + b)/ 2, b - (a + b)/2], E(Y) = (a - (a + b)/ 2 + b - (a + b)/2)/2 = 0, D(Y)確定, 經(jīng)過減均值后Y服從高斯分布.后面在Xavier初始化中有作用.
標(biāo)準(zhǔn)化(Normalization)

標(biāo)準(zhǔn)化使得數(shù)據(jù)在同一個尺度范圍內(nèi)伸縮, 圖像X像素值除以標(biāo)準(zhǔn)差, X -= np.std(X, axis = 0).應(yīng)用這種方法的前提是不同input feature的大小比例不同, 而它們大體相等對學(xué)習(xí)算法至關(guān)重要, 圖像的像素范圍已經(jīng)是[0,255],所以通常不必應(yīng)用標(biāo)準(zhǔn)化,借原筆記的圖效果如下:

PCA/Whitening這里就不介紹了(In practice. We mention PCA/Whitening in these notes for completeness, but these transformations are not used with Convolutional Networks.)
weight初始化
2個不正確初始化方式:
1. 全0. 這樣每個神經(jīng)元的輸入都是一樣的, 每次反向傳播所有參數(shù)都經(jīng)歷相同的更新, 每個神經(jīng)元的差異沒有體現(xiàn)
2. 接近0的隨機數(shù). 因為weight的導(dǎo)數(shù)和weight成比例, 很小的weight的導(dǎo)數(shù)很小, 再反向傳播時把更新殺死了, 造成參數(shù)基本不更新,難以收斂.
Xavier 初始化
隨機初始化weight還有一個問題,經(jīng)過一次神經(jīng)元運算后輸出分布的方差大幅增長.下一層輸入個體差異較大, 對于后一層參數(shù)訓(xùn)練非常不利, 個體差異巨大的輸入對于一個weight的小更新就有可能帶來loss的劇烈震蕩.
本著控制運算后輸出方差的思路: 高斯分布(有正/負向作用參數(shù)作用大體相等(均值為0), 參數(shù)間的差異是確定的方差). 假如經(jīng)過一層神經(jīng)網(wǎng)絡(luò)后還可以保證方差穩(wěn)定就達到了目的.
經(jīng)過了一層線性運算后輸出 y = Σ wi*xi + b, W和X獨立同分布, 且X經(jīng)過減均值的預(yù)處理已服從高斯分布,那么E(xi) = E(wi) = 0, N是W的行數(shù).
D(y) = D(Σ wi*xi + b) = D(Σ E(wi)^2*D(xi) + E(xi)^2*D(wi) + D(wi)*D(xi)) = Σ D(wi)*D(xi) = N*D(wi)*D(xi)
要保持方差總體不變則 D(y) = D(xi) = N*D(wi)*D(xi) -> D(wi) = 1/N, 所以W服從高斯分布,并且方差是1/N,經(jīng)過線性運算后方差總體不變.初始化方式可以是:w = np.random.randn(n) / sqrt(n). 實際使用caffe使用了另外一種方式:
```
class XavierFiller : public Filler<Dtype> {
   public:
    explicit XavierFiller(const FillerParameter& param)
        : Filler<Dtype>(param) {}
    virtual void Fill(Blob<Dtype>* blob) {
      // set n by configuration
      Dtype scale = sqrt(Dtype(3) / n);
      //通過均值概率出E(W) = 0, D(W) = 1/n的高斯分布
      caffe_rng_uniform<Dtype>(blob->count(), -scale, scale,
          blob->mutable_cpu_data());
      CHECK_EQ(this->filler_param_.sparse(), -1)
           << "Sparsity not supported by this Filler.";
    }
  };

  void caffe_rng_uniform(const int n, const Dtype a, const Dtype b, Dtype* r) {
    CHECK_GE(n, 0);
    CHECK(r);
    CHECK_LE(a, b);
    boost::uniform_real<Dtype> random_distribution(a, caffe_nextafter<Dtype>(b));
    boost::variate_generator<caffe::rng_t*, boost::uniform_real<Dtype> >
        variate_generator(caffe_rng(), random_distribution);
    for (int i = 0; i < n; ++i) {
      r[i] = variate_generator();
    }
  }
```
可以看到scale = sqrt(3/n), 從區(qū)間[-scale, scale]從一個均值分布篩選出來設(shè)定的n個數(shù).均值函數(shù)概率密度:

均值方差如下:

這里b = scale = sqrt(3/n), a = -scale = -sqrt(3/n).可以看出D(W) = 1/12(b - a)^2 = 1/12*(2*sqrt(3/n))^2 = 1/12*4*3/n = 1/n, E(w) = 1/2(a + b) = 0, W是方差為1/n的高斯分布, 至此可以看出caffe是何如通過XavierFiller保證W的方差等于1/n.
再有假如激活函數(shù)是常用Relu, 參見Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification, w = np.random.randn(n) * sqrt(2.0/n)

正規(guī)化(Regularization)
Regularization作用抑制過擬合的一種技術(shù)手段(其他手段包括relu,數(shù)據(jù)增強), 之前在neural-network-1有提及不能因為過擬合就使用小的網(wǎng)絡(luò).Regularization包括L2,L1,Max norm constraints, dropout.

L2 Regularization
最常用的手段, 施加于在loss函數(shù)上: new_loss = loss + λ/2n*Σ w^2, 感官上L2的作用是對大的w懲戒嚴格, 傾向于離散小的w. 使得噪聲的作用不那么強烈(we discussed in the Linear Classification section, due to multiplicative interactions between weights and inputs this has the appealing property of encouraging the network to use all of its inputs a little rather that some of its inputs a lot).
還有一點對于L2 W跟新是線性的, d nl/ dw = dl/dw + λ/n*w, 對w的更新變成了
newly w = w - learning_rate*dl/dw - learning_rate*λ/n*w, 這里可以看出對w的更新依然線性.再看下caffe L2的實現(xiàn):

void SGDSolver<Dtype>::ApplyUpdate() {
  CHECK(Caffe::root_solver());
  Dtype rate = GetLearningRate();
  for (int param_id = 0; param_id < this->net_->learnable_params().size();
       ++param_id) {
    //標(biāo)準(zhǔn)化
    Normalize(param_id);
    //正規(guī)化
    Regularize(param_id);
    //其實沒更新
    ComputeUpdateValue(param_id, rate);
  }
  //真更新了
  this->net_->Update();
}


case Caffe::CPU: {
if (local_decay) {
  if (regularization_type == "L2") {
    // add weight decay y = ax + y = local_decay*w + loss
    caffe_axpy(net_params[param_id]->count(),
        local_decay,
        net_params[param_id]->cpu_data(),
        // 可修改數(shù)據(jù)的地址, 內(nèi)部是static cast, l += local_decay*w, 這里其實就直接計算導(dǎo)數(shù)的變化了,至此可以理解問loss更新了
        net_params[param_id]->mutable_cpu_diff());
  } 

  void SGDSolver<Dtype>::ComputeUpdateValue(int param_id, Dtype rate) {
  const vector<Blob<Dtype>*>& net_params = this->net_->learnable_params();
  const vector<float>& net_params_lr = this->net_->params_lr();
  //擺脫local minima的加速度
  Dtype momentum = this->param_.momentum();
  Dtype local_rate = rate * net_params_lr[param_id];
  // Compute the update to history, then copy it to the parameter diff.
  switch (Caffe::mode()) {
  case Caffe::CPU: {
    //先改歷史數(shù)據(jù), y = a*x + y*b =  local_rate*loss(包括了L2項) + momentum*w(加速度)
    caffe_cpu_axpby(net_params[param_id]->count(), local_rate,
              net_params[param_id]->cpu_diff(), momentum,
              history_[param_id]->mutable_cpu_data());
    //拷貝到cpu_data中, w跟新
    caffe_copy(net_params[param_id]->count(),
        history_[param_id]->cpu_data(),
        net_params[param_id]->mutable_cpu_diff());
    break;
  }

L1類似于L2, 這里就不細講了.

Max norm constraints
最大范數(shù)約束.相當(dāng)于給參數(shù)更新設(shè)置了一個邊界,防止過度更新,一定程度上防止學(xué)習(xí)率過高造成訓(xùn)練震蕩.這里參考一下Keras的實現(xiàn).

 class MaxNorm(Constraint):
  """MaxNorm weight constraint.
  Constrains the weights incident to each hidden unit
  to have a norm less than or equal to a desired value.
  # Arguments
      m: the maximum norm for the incoming weights.
      axis: integer, axis along which to calculate weight norms.
          For instance, in a `Dense` layer the weight matrix
          has shape `(input_dim, output_dim)`,
          set `axis` to `0` to constrain each weight vector
          of length `(input_dim,)`.
          In a `Conv2D` layer with `data_format="channels_last"`,
          the weight tensor has shape
          `(rows, cols, input_depth, output_depth)`,
          set `axis` to `[0, 1, 2]`
          to constrain the weights of each filter tensor of size
          `(rows, cols, input_depth)`.
  # References
      - [Dropout: A Simple Way to Prevent Neural Networks from Overfitting Srivastava, Hinton, et al. 2014](http://www.cs.toronto.edu/~rsalakhu/papers/srivastava14a.pdf)
  """

  def __init__(self, max_value=2, axis=0):
      self.max_value = max_value
      self.axis = axis

  def __call__(self, w):
      # from . import backend as K
      norms = K.sqrt(K.sum(K.square(w), axis=self.axis, keepdims=True))
      # 把norm限制在0-max之間
      desired = K.clip(norms, 0, self.max_value)
      # K.epsilon()是一個極小的隨機因子 _EPSILON = 1e-7 the fuzz factor used in numeric expressions
      w *= (desired / (K.epsilon() + norms))
      return w

  def get_config(self):
      return {'max_value': self.max_value,
              'axis': self.axis}

Dropout
Dropout is an extremely effective, VGG16中就有用到, Dropout: A Simple Way to Prevent Neural Networks from Overfitting.Dropout的想法是降低神經(jīng)元之間的內(nèi)耦, 訓(xùn)練時按照threshold關(guān)閉網(wǎng)絡(luò), 使用部分網(wǎng)絡(luò)訓(xùn)練, 極大的降噪.

對Dropout后續(xù)paper還有Dropout paper和Dropout Training as Adaptive Regularization

All in all:In practice: It is most common to use a single, global L2 regularization strength that is cross-validated. It is also common to combine this with dropout applied after all layers. The value of (p = 0.5) is a reasonable default, but this can be tuned on validation data.

還是從實際caffe如何做Dropout看下運作:

void DropoutLayer<Dtype>::Forward_cpu(const vector<Blob<Dtype>*>& bottom,
      const vector<Blob<Dtype>*>& top) {
    const Dtype* bottom_data = bottom[0]->cpu_data();
    Dtype* top_data = top[0]->mutable_cpu_data();
    unsigned int* mask = rand_vec_.mutable_cpu_data();
    const int count = bottom[0]->count();
    //只在訓(xùn)練時生效
    if (this->phase_ == TRAIN) {
      // 抹top數(shù)據(jù), 按照threshold_置值
      caffe_rng_bernoulli(count, 1. - threshold_, mask);
       //prototxt里是否設(shè)置比例訓(xùn)練
      if (scale_train_) {
        for (int i = 0; i < count; ++i) {
          top_data[i] = bottom_data[i] * mask[i] * scale_;
        }
      } else {
        for (int i = 0; i < count; ++i) {
          //& mask -> 設(shè)置top數(shù)據(jù)
          top_data[i] = bottom_data[i] * mask[i];
        }
      }
    } else {
      caffe_copy(bottom[0]->count(), bottom_data, top_data);
      if (!scale_train_) {
        caffe_scal<Dtype>(  count, 1. / scale_, top_data);
      }
    }
  }

Loss 函數(shù)
常見的分類loss之前有講過, svm和使用cross-entropy的softmax.分類眾多時可以考慮參考Hierarchical Softmax.對于屬性簇或者回歸問題首先考慮是否能夠轉(zhuǎn)化成一些獨立的分類問題, 直接在回歸問題應(yīng)用L2平方loss比較難訓(xùn)練和脆弱(Notice that this is not the case with Softmax, where the precise value of each score is less important: It only matters that their magnitudes are appropriate. Additionally, the L2 loss is less robust because outliers can introduce huge gradients).

色偷偷精品伊人,欧洲久久精品,欧美综合婷婷骚逼,国产AV主播,国产最新探花在线,九色在线视频一区,伊人大交九欧美,1769亚洲,黄色成人av

cs231n## neural-networks-2

cs231n## neural-networks-2

agenda

數(shù)據(jù)預(yù)處理

相關(guān)閱讀更多精彩內(nèi)容

友情鏈接更多精彩內(nèi)容

色偷偷精品伊人,欧洲久久精品,欧美综合婷婷骚逼,国产AV主播,国产最新探花在线,九色在线视频一区,伊人大交九 欧美,1769亚洲,黄色成人av

cs231n## neural-networks-2

agenda

數(shù)據(jù)預(yù)處理

相關(guān)閱讀更多精彩內(nèi)容

友情鏈接更多精彩內(nèi)容

色偷偷精品伊人,欧洲久久精品,欧美综合婷婷骚逼,国产AV主播,国产最新探花在线,九色在线视频一区,伊人大交九欧美,1769亚洲,黄色成人av