cs231n## neural-networks-2

agenda

  • 構(gòu)建數(shù)據(jù)和模型
    • 數(shù)據(jù)預(yù)處理
    • weight初始化
    • 正規(guī)化(Regularization)
  • Loss 函數(shù)

數(shù)據(jù)預(yù)處理

  • 減均值
    假如X是一幅圖像(h,w, c): X -= np.mean(X), 若按照顏色通道劃分還可以是np.mean(X, axis = 0), 減均值最大的好處就是處理之后的圖像數(shù)據(jù)均值為零.以圖像中只包含兩個像素為例X = [a, b], Y = X - np.mean(X) = [a - (a + b)/ 2, b - (a + b)/2], E(Y) = (a - (a + b)/ 2 + b - (a + b)/2)/2 = 0, D(Y)確定, 經(jīng)過減均值后Y服從高斯分布.后面在Xavier初始化中有作用.
  • 標(biāo)準(zhǔn)化(Normalization)

    標(biāo)準(zhǔn)化使得數(shù)據(jù)在同一個尺度范圍內(nèi)伸縮, 圖像X像素值除以標(biāo)準(zhǔn)差, X -= np.std(X, axis = 0).應(yīng)用這種方法的前提是不同input feature的大小比例不同, 而它們大體相等對學(xué)習(xí)算法至關(guān)重要, 圖像的像素范圍已經(jīng)是[0,255],所以通常不必應(yīng)用標(biāo)準(zhǔn)化,借原筆記的圖效果如下:

    PCA/Whitening這里就不介紹了(In practice. We mention PCA/Whitening in these notes for completeness, but these transformations are not used with Convolutional Networks.)
  • weight初始化
    2個不正確初始化方式:
    1. 全0. 這樣每個神經(jīng)元的輸入都是一樣的, 每次反向傳播所有參數(shù)都經(jīng)歷相同的更新, 每個神經(jīng)元的差異沒有體現(xiàn)
    2. 接近0的隨機數(shù). 因為weight的導(dǎo)數(shù)和weight成比例, 很小的weight的導(dǎo)數(shù)很小, 再反向傳播時把更新殺死了, 造成參數(shù)基本不更新,難以收斂.
  • Xavier 初始化
    隨機初始化weight還有一個問題,經(jīng)過一次神經(jīng)元運算后輸出分布的方差大幅增長.下一層輸入個體差異較大, 對于后一層參數(shù)訓(xùn)練非常不利, 個體差異巨大的輸入對于一個weight的小更新就有可能帶來loss的劇烈震蕩.
    本著控制運算后輸出方差的思路: 高斯分布(有正/負向作用參數(shù)作用大體相等(均值為0), 參數(shù)間的差異是確定的方差). 假如經(jīng)過一層神經(jīng)網(wǎng)絡(luò)后還可以保證方差穩(wěn)定就達到了目的.
    經(jīng)過了一層線性運算后輸出 y = Σ wi*xi + b, W和X獨立同分布, 且X經(jīng)過減均值的預(yù)處理已服從高斯分布,那么E(xi) = E(wi) = 0, N是W的行數(shù).
    D(y) = D(Σ wi*xi + b) = D(Σ E(wi)^2*D(xi) + E(xi)^2*D(wi) + D(wi)*D(xi)) = Σ D(wi)*D(xi) = N*D(wi)*D(xi)
    要保持方差總體不變則 D(y) = D(xi) = N*D(wi)*D(xi) -> D(wi) = 1/N, 所以W服從高斯分布,并且方差是1/N,經(jīng)過線性運算后方差總體不變.初始化方式可以是:w = np.random.randn(n) / sqrt(n). 實際使用caffe使用了另外一種方式:
    class XavierFiller : public Filler<Dtype> {
       public:
        explicit XavierFiller(const FillerParameter& param)
            : Filler<Dtype>(param) {}
        virtual void Fill(Blob<Dtype>* blob) {
          // set n by configuration
          Dtype scale = sqrt(Dtype(3) / n);
          //通過均值概率出E(W) = 0, D(W) = 1/n的高斯分布
          caffe_rng_uniform<Dtype>(blob->count(), -scale, scale,
              blob->mutable_cpu_data());
          CHECK_EQ(this->filler_param_.sparse(), -1)
               << "Sparsity not supported by this Filler.";
        }
      };
    
      void caffe_rng_uniform(const int n, const Dtype a, const Dtype b, Dtype* r) {
        CHECK_GE(n, 0);
        CHECK(r);
        CHECK_LE(a, b);
        boost::uniform_real<Dtype> random_distribution(a, caffe_nextafter<Dtype>(b));
        boost::variate_generator<caffe::rng_t*, boost::uniform_real<Dtype> >
            variate_generator(caffe_rng(), random_distribution);
        for (int i = 0; i < n; ++i) {
          r[i] = variate_generator();
        }
      }
    
    可以看到scale = sqrt(3/n), 從區(qū)間[-scale, scale]從一個均值分布篩選出來設(shè)定的n個數(shù).均值函數(shù)概率密度:
    均值方差如下:

    這里b = scale = sqrt(3/n), a = -scale = -sqrt(3/n).可以看出D(W) = 1/12(b - a)^2 = 1/12*(2*sqrt(3/n))^2 = 1/12*4*3/n = 1/n, E(w) = 1/2(a + b) = 0, W是方差為1/n的高斯分布, 至此可以看出caffe是何如通過XavierFiller保證W的方差等于1/n.
    再有假如激活函數(shù)是常用Relu, 參見Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification, w = np.random.randn(n) * sqrt(2.0/n)
  • 正規(guī)化(Regularization)
    Regularization作用抑制過擬合的一種技術(shù)手段(其他手段包括relu,數(shù)據(jù)增強), 之前在neural-network-1有提及不能因為過擬合就使用小的網(wǎng)絡(luò).Regularization包括L2,L1,Max norm constraints, dropout.
    • L2 Regularization
      最常用的手段, 施加于在loss函數(shù)上: new_loss = loss + λ/2n*Σ w^2, 感官上L2的作用是對大的w懲戒嚴格, 傾向于離散小的w. 使得噪聲的作用不那么強烈(we discussed in the Linear Classification section, due to multiplicative interactions between weights and inputs this has the appealing property of encouraging the network to use all of its inputs a little rather that some of its inputs a lot).
      還有一點對于L2 W跟新是線性的, d nl/ dw = dl/dw + λ/n*w, 對w的更新變成了
      newly w = w - learning_rate*dl/dw - learning_rate*λ/n*w, 這里可以看出對w的更新依然線性.再看下caffe L2的實現(xiàn):
    void SGDSolver<Dtype>::ApplyUpdate() {
      CHECK(Caffe::root_solver());
      Dtype rate = GetLearningRate();
      for (int param_id = 0; param_id < this->net_->learnable_params().size();
           ++param_id) {
        //標(biāo)準(zhǔn)化
        Normalize(param_id);
        //正規(guī)化
        Regularize(param_id);
        //其實沒更新
        ComputeUpdateValue(param_id, rate);
      }
      //真更新了
      this->net_->Update();
    }
    
    
    case Caffe::CPU: {
    if (local_decay) {
      if (regularization_type == "L2") {
        // add weight decay y = ax + y = local_decay*w + loss
        caffe_axpy(net_params[param_id]->count(),
            local_decay,
            net_params[param_id]->cpu_data(),
            // 可修改數(shù)據(jù)的地址, 內(nèi)部是static cast, l += local_decay*w, 這里其實就直接計算導(dǎo)數(shù)的變化了,至此可以理解問loss更新了
            net_params[param_id]->mutable_cpu_diff());
      } 
    
      void SGDSolver<Dtype>::ComputeUpdateValue(int param_id, Dtype rate) {
      const vector<Blob<Dtype>*>& net_params = this->net_->learnable_params();
      const vector<float>& net_params_lr = this->net_->params_lr();
      //擺脫local minima的加速度
      Dtype momentum = this->param_.momentum();
      Dtype local_rate = rate * net_params_lr[param_id];
      // Compute the update to history, then copy it to the parameter diff.
      switch (Caffe::mode()) {
      case Caffe::CPU: {
        //先改歷史數(shù)據(jù), y = a*x + y*b =  local_rate*loss(包括了L2項) + momentum*w(加速度)
        caffe_cpu_axpby(net_params[param_id]->count(), local_rate,
                  net_params[param_id]->cpu_diff(), momentum,
                  history_[param_id]->mutable_cpu_data());
        //拷貝到cpu_data中, w跟新
        caffe_copy(net_params[param_id]->count(),
            history_[param_id]->cpu_data(),
            net_params[param_id]->mutable_cpu_diff());
        break;
      }
    
    L1類似于L2, 這里就不細講了.
  • Max norm constraints
    最大范數(shù)約束.相當(dāng)于給參數(shù)更新設(shè)置了一個邊界,防止過度更新,一定程度上防止學(xué)習(xí)率過高造成訓(xùn)練震蕩.這里參考一下Keras的實現(xiàn).
 class MaxNorm(Constraint):
  """MaxNorm weight constraint.
  Constrains the weights incident to each hidden unit
  to have a norm less than or equal to a desired value.
  # Arguments
      m: the maximum norm for the incoming weights.
      axis: integer, axis along which to calculate weight norms.
          For instance, in a `Dense` layer the weight matrix
          has shape `(input_dim, output_dim)`,
          set `axis` to `0` to constrain each weight vector
          of length `(input_dim,)`.
          In a `Conv2D` layer with `data_format="channels_last"`,
          the weight tensor has shape
          `(rows, cols, input_depth, output_depth)`,
          set `axis` to `[0, 1, 2]`
          to constrain the weights of each filter tensor of size
          `(rows, cols, input_depth)`.
  # References
      - [Dropout: A Simple Way to Prevent Neural Networks from Overfitting Srivastava, Hinton, et al. 2014](http://www.cs.toronto.edu/~rsalakhu/papers/srivastava14a.pdf)
  """

  def __init__(self, max_value=2, axis=0):
      self.max_value = max_value
      self.axis = axis

  def __call__(self, w):
      # from . import backend as K
      norms = K.sqrt(K.sum(K.square(w), axis=self.axis, keepdims=True))
      # 把norm限制在0-max之間
      desired = K.clip(norms, 0, self.max_value)
      # K.epsilon()是一個極小的隨機因子 _EPSILON = 1e-7 the fuzz factor used in numeric expressions
      w *= (desired / (K.epsilon() + norms))
      return w

  def get_config(self):
      return {'max_value': self.max_value,
              'axis': self.axis}
  • Dropout
    Dropout is an extremely effective, VGG16中就有用到, Dropout: A Simple Way to Prevent Neural Networks from Overfitting.Dropout的想法是降低神經(jīng)元之間的內(nèi)耦, 訓(xùn)練時按照threshold關(guān)閉網(wǎng)絡(luò), 使用部分網(wǎng)絡(luò)訓(xùn)練, 極大的降噪.
    對Dropout后續(xù)paper還有Dropout paperDropout Training as Adaptive Regularization

    All in all:In practice: It is most common to use a single, global L2 regularization strength that is cross-validated. It is also common to combine this with dropout applied after all layers. The value of (p = 0.5) is a reasonable default, but this can be tuned on validation data.

還是從實際caffe如何做Dropout看下運作:

void DropoutLayer<Dtype>::Forward_cpu(const vector<Blob<Dtype>*>& bottom,
      const vector<Blob<Dtype>*>& top) {
    const Dtype* bottom_data = bottom[0]->cpu_data();
    Dtype* top_data = top[0]->mutable_cpu_data();
    unsigned int* mask = rand_vec_.mutable_cpu_data();
    const int count = bottom[0]->count();
    //只在訓(xùn)練時生效
    if (this->phase_ == TRAIN) {
      // 抹top數(shù)據(jù), 按照threshold_置值
      caffe_rng_bernoulli(count, 1. - threshold_, mask);
       //prototxt里是否設(shè)置比例訓(xùn)練
      if (scale_train_) {
        for (int i = 0; i < count; ++i) {
          top_data[i] = bottom_data[i] * mask[i] * scale_;
        }
      } else {
        for (int i = 0; i < count; ++i) {
          //& mask -> 設(shè)置top數(shù)據(jù)
          top_data[i] = bottom_data[i] * mask[i];
        }
      }
    } else {
      caffe_copy(bottom[0]->count(), bottom_data, top_data);
      if (!scale_train_) {
        caffe_scal<Dtype>(  count, 1. / scale_, top_data);
      }
    }
  }
  • Loss 函數(shù)
    常見的分類loss之前有講過, svm和使用cross-entropy的softmax.分類眾多時可以考慮參考Hierarchical Softmax.對于屬性簇或者回歸問題首先考慮是否能夠轉(zhuǎn)化成一些獨立的分類問題, 直接在回歸問題應(yīng)用L2平方loss比較難訓(xùn)練和脆弱(Notice that this is not the case with Softmax, where the precise value of each score is less important: It only matters that their magnitudes are appropriate. Additionally, the L2 loss is less robust because outliers can introduce huge gradients).
最后編輯于
?著作權(quán)歸作者所有,轉(zhuǎn)載或內(nèi)容合作請聯(lián)系作者
【社區(qū)內(nèi)容提示】社區(qū)部分內(nèi)容疑似由AI輔助生成,瀏覽時請結(jié)合常識與多方信息審慎甄別。
平臺聲明:文章內(nèi)容(如有圖片或視頻亦包括在內(nèi))由作者上傳并發(fā)布,文章內(nèi)容僅代表作者本人觀點,簡書系信息發(fā)布平臺,僅提供信息存儲服務(wù)。

相關(guān)閱讀更多精彩內(nèi)容

友情鏈接更多精彩內(nèi)容