圖片normalization為什么拿整個數(shù)據(jù)集的mean做標準?

這是我在StackExchange上看到的問題,很有趣:

問題

原文:
There are some variations on how to normalize the images but most seem to use these two methods:

  1. Subtract the mean per channel calculated over all images (e.g. VGG_ILSVRC_16_layers)
  2. Subtract by pixel/channel calculated over all images (e.g. CNN_S, also see Caffe's reference network)

The natural approach would in my mind to normalize each image. An image taken in broad daylight will cause more neurons to fire than a night-time image and while it may inform us of the time we usually care about more interesting features present in the edges etc.

Pierre Sermanet refers in 3.3.3 that local contrast normalization that would be per-image based but I haven't come across this in any of the examples/tutorials that I've seen. I've also seen an interesting Quora question and Xiu-Shen Wei's post but they don't seem to support the two above approaches.

What exactly am I missing? Is this a color normalization issue or is there a paper that actually explain why so many use this approach?
大致意思是講:
關(guān)于如何標準化圖像有一些變化,但大多數(shù)似乎使用這兩種方法:

  1. 減去對所有圖像計算的每個通道的平均值(例如 VGG_ILSVRC_16_layers)
  2. 通過對所有圖像計算的像素/通道減去(例如 CNN_S,另見 Caffe 的參考網(wǎng)絡(luò))

最自然的方法是對每個圖像進行標準化。與夜間圖像相比,在光天化日之下拍攝的圖像會導(dǎo)致更多的神經(jīng)元被激發(fā),雖然它可能會告訴我們我們通常關(guān)心邊緣等中存在的更有趣特征的時間。 這是一個顏色標準化問題還是有一篇論文實際上解釋了為什么這么多人使用這種方法?

回答

原文:
Subtracting the dataset mean serves to "center" the data. Additionally, you ideally would like to divide by the sttdev of that feature or pixel as well if you want to normalize each feature value to a z-score.

The reason we do both of those things is because in the process of training our network, we're going to be multiplying (weights) and adding to (biases) these initial inputs in order to cause activations that we then backpropogate with the gradients to train the model.

We'd like in this process for each feature to have a similar range so that our gradients don't go out of control (and that we only need one global learning rate multiplier).

Another way you can think about it is deep learning networks traditionally share many parameters - if you didn't scale your inputs in a way that resulted in similarly-ranged feature values (ie: over the whole dataset by subtracting mean) sharing wouldn't happen very easily because to one part of the image weight w is a lot and to another it's too small.

You will see in some CNN models that per-image whitening is used, which is more along the lines of your thinking.

大致意思是:
我們希望在BackPropagation的過程中每個特征都有一個相似的范圍,這樣我們的梯度就不會失控(而且我們只需要一個全局學(xué)習(xí)率)。 您可以考慮的另一種方式是深度學(xué)習(xí)網(wǎng)絡(luò)傳統(tǒng)上共享許多參數(shù) - 如果您沒有以導(dǎo)致類似范圍特征值的方式縮放輸入(即:通過減去平均值在整個數(shù)據(jù)集上)共享不會很容易發(fā)生,因為圖像權(quán)重 w 的一部分很大,而另一部分則太小。 你會在一些 CNN 模型中看到使用了每張圖像的白化。

?著作權(quán)歸作者所有,轉(zhuǎn)載或內(nèi)容合作請聯(lián)系作者
【社區(qū)內(nèi)容提示】社區(qū)部分內(nèi)容疑似由AI輔助生成,瀏覽時請結(jié)合常識與多方信息審慎甄別。
平臺聲明:文章內(nèi)容(如有圖片或視頻亦包括在內(nèi))由作者上傳并發(fā)布,文章內(nèi)容僅代表作者本人觀點,簡書系信息發(fā)布平臺,僅提供信息存儲服務(wù)。

相關(guān)閱讀更多精彩內(nèi)容

友情鏈接更多精彩內(nèi)容