Scale和Normalization的異同

前言

在處理數(shù)據(jù)的時候,經(jīng)常會遇到兩個名詞ScaleNormalization,這兩個名詞經(jīng)常會被混雜著使用,讓我在理解一些操作的時候經(jīng)常會迷糊,那么我就結(jié)合R語言里面的scale()函數(shù)講解一下這兩個名詞的實在意義。

正文

One of the reasons that it's easy to get confused between scaling and normalization is because the terms are sometimes used interchangeably and, to make it even more confusing, they are very similar! In both cases, you're transforming the values of numeric variables so that the transformed data points have specific helpful properties. The difference is that, in scaling, you're changing the range of your data while in normalization you're changing the shape of the distribution of your data. Let's talk a little more in-depth about each of these options.
先說結(jié)論,Scale改變數(shù)據(jù)的range(范圍),Normalization改變數(shù)據(jù)的distribution()分布。

認知

Scale

scale意味著你可以轉(zhuǎn)化你的數(shù)據(jù)到一個制定的范圍,類似于1-100或者0-1。當(dāng)你使用某種基于數(shù)值大小的方法的時候(比如SVM或者KNN)時,就需要用到scale。

Scale示例

Normalization

scale只是改變你數(shù)據(jù)的range(范圍),Normalization則是一個更加激進的轉(zhuǎn)化。
Normalization的目的就在于把你的數(shù)據(jù)轉(zhuǎn)化為一個正態(tài)分布,從而進行下游的數(shù)據(jù)分析(t-tests, ANOVAs, linear regression, linear discriminant analysis (LDA) and Gaussian naive Bayes).

image.png

R語言操作

首先在R console里面查看scale函數(shù)的用法:

?scale
## 可以得到以下的介紹
The value of center determines how column centering is performed. If center is a numeric-alike vector with length equal to the number of columns of x, then each column of x has the corresponding value from center subtracted from it. If center is TRUE then centering is done by subtracting the column means (omitting NAs) of x from their corresponding columns, and if center is FALSE, no centering is done.

The value of scale determines how column scaling is performed (after centering). If scale is a numeric-alike vector with length equal to the number of columns of x, then each column of x is divided by the corresponding value from scale. If scale is TRUE then scaling is done by dividing the (centered) columns of x by their standard deviations if center is TRUE, and the root mean square otherwise. If scale is FALSE, no scaling is done.

The root-mean-square for a (possibly centered) column is defined as sqrt(sum(x^2)/(n-1)), where x is a vector of the non-missing values and n is the number of non-missing values. In the case center = TRUE, this is the same as the standard deviation, but in general it is not. (To scale by the standard deviations without centering, use scale(x, center = FALSE, scale = apply(x, 2, sd, na.rm = TRUE)).)

而且可以看到,scale函數(shù)的用法是scale(matrix, center = T/F, scale = T/F),那么就用示例說明一下問題。

> x <- matrix(1:20, ncol = 4)
> x
     [,1] [,2] [,3] [,4]
[1,]    1    6   11   16
[2,]    2    7   12   17
[3,]    3    8   13   18
[4,]    4    9   14   19
[5,]    5   10   15   20
> scale(x, center = T, scale = T)
           [,1]       [,2]       [,3]       [,4]
[1,] -1.2649111 -1.2649111 -1.2649111 -1.2649111
[2,] -0.6324555 -0.6324555 -0.6324555 -0.6324555
[3,]  0.0000000  0.0000000  0.0000000  0.0000000
[4,]  0.6324555  0.6324555  0.6324555  0.6324555
[5,]  1.2649111  1.2649111  1.2649111  1.2649111
attr(,"scaled:center")
[1]  3  8 13 18
attr(,"scaled:scale")
[1] 1.581139 1.581139 1.581139 1.581139
> scale(x, center = T, scale = F)
     [,1] [,2] [,3] [,4]
[1,]   -2   -2   -2   -2
[2,]   -1   -1   -1   -1
[3,]    0    0    0    0
[4,]    1    1    1    1
[5,]    2    2    2    2
attr(,"scaled:center")
[1]  3  8 13 18
> scale(x, center = T, scale = F)/sd(scale(x, center = T, scale = F)[1:5])
           [,1]       [,2]       [,3]       [,4]
[1,] -1.2649111 -1.2649111 -1.2649111 -1.2649111
[2,] -0.6324555 -0.6324555 -0.6324555 -0.6324555
[3,]  0.0000000  0.0000000  0.0000000  0.0000000
[4,]  0.6324555  0.6324555  0.6324555  0.6324555
[5,]  1.2649111  1.2649111  1.2649111  1.2649111
attr(,"scaled:center")
[1]  3  8 13 18

這里我們可以看出,scale()函數(shù)事實上做了兩件事,center和scale,而這里的center就是減去每列的均值,scale則是用center后的數(shù)據(jù)除以該列的標準差,做了一個正態(tài)分布的轉(zhuǎn)化,也就是z = \frac{X - \mu }{ \sigma},下面我作圖以示轉(zhuǎn)化過程。

data <- runif(100, min = 10, max = 100)

plot(1:100, data)
plot(1:100, scale(data, center = T, scale = F))
plot(1:100, scale(data, center = T, scale = T))
raw_data
data_center
data_center_scale

結(jié)語

R語言里面的scale()函數(shù)的centerscale參數(shù)需要用對才可以正確處理你的數(shù)據(jù)。

最后編輯于
?著作權(quán)歸作者所有,轉(zhuǎn)載或內(nèi)容合作請聯(lián)系作者
【社區(qū)內(nèi)容提示】社區(qū)部分內(nèi)容疑似由AI輔助生成,瀏覽時請結(jié)合常識與多方信息審慎甄別。
平臺聲明:文章內(nèi)容(如有圖片或視頻亦包括在內(nèi))由作者上傳并發(fā)布,文章內(nèi)容僅代表作者本人觀點,簡書系信息發(fā)布平臺,僅提供信息存儲服務(wù)。

相關(guān)閱讀更多精彩內(nèi)容

友情鏈接更多精彩內(nèi)容