前言
在處理數(shù)據(jù)的時候,經(jīng)常會遇到兩個名詞Scale和Normalization,這兩個名詞經(jīng)常會被混雜著使用,讓我在理解一些操作的時候經(jīng)常會迷糊,那么我就結(jié)合R語言里面的scale()函數(shù)講解一下這兩個名詞的實在意義。
正文
One of the reasons that it's easy to get confused between scaling and normalization is because the terms are sometimes used interchangeably and, to make it even more confusing, they are very similar! In both cases, you're transforming the values of numeric variables so that the transformed data points have specific helpful properties. The difference is that, in scaling, you're changing the range of your data while in normalization you're changing the shape of the distribution of your data. Let's talk a little more in-depth about each of these options.
先說結(jié)論,Scale改變數(shù)據(jù)的range(范圍),Normalization改變數(shù)據(jù)的distribution()分布。
認知
Scale
scale意味著你可以轉(zhuǎn)化你的數(shù)據(jù)到一個制定的范圍,類似于1-100或者0-1。當(dāng)你使用某種基于數(shù)值大小的方法的時候(比如SVM或者KNN)時,就需要用到scale。

Normalization
scale只是改變你數(shù)據(jù)的range(范圍),Normalization則是一個更加激進的轉(zhuǎn)化。
Normalization的目的就在于把你的數(shù)據(jù)轉(zhuǎn)化為一個正態(tài)分布,從而進行下游的數(shù)據(jù)分析(t-tests, ANOVAs, linear regression, linear discriminant analysis (LDA) and Gaussian naive Bayes).

R語言操作
首先在R console里面查看scale函數(shù)的用法:
?scale
## 可以得到以下的介紹
The value of center determines how column centering is performed. If center is a numeric-alike vector with length equal to the number of columns of x, then each column of x has the corresponding value from center subtracted from it. If center is TRUE then centering is done by subtracting the column means (omitting NAs) of x from their corresponding columns, and if center is FALSE, no centering is done.
The value of scale determines how column scaling is performed (after centering). If scale is a numeric-alike vector with length equal to the number of columns of x, then each column of x is divided by the corresponding value from scale. If scale is TRUE then scaling is done by dividing the (centered) columns of x by their standard deviations if center is TRUE, and the root mean square otherwise. If scale is FALSE, no scaling is done.
The root-mean-square for a (possibly centered) column is defined as sqrt(sum(x^2)/(n-1)), where x is a vector of the non-missing values and n is the number of non-missing values. In the case center = TRUE, this is the same as the standard deviation, but in general it is not. (To scale by the standard deviations without centering, use scale(x, center = FALSE, scale = apply(x, 2, sd, na.rm = TRUE)).)
而且可以看到,scale函數(shù)的用法是scale(matrix, center = T/F, scale = T/F),那么就用示例說明一下問題。
> x <- matrix(1:20, ncol = 4)
> x
[,1] [,2] [,3] [,4]
[1,] 1 6 11 16
[2,] 2 7 12 17
[3,] 3 8 13 18
[4,] 4 9 14 19
[5,] 5 10 15 20
> scale(x, center = T, scale = T)
[,1] [,2] [,3] [,4]
[1,] -1.2649111 -1.2649111 -1.2649111 -1.2649111
[2,] -0.6324555 -0.6324555 -0.6324555 -0.6324555
[3,] 0.0000000 0.0000000 0.0000000 0.0000000
[4,] 0.6324555 0.6324555 0.6324555 0.6324555
[5,] 1.2649111 1.2649111 1.2649111 1.2649111
attr(,"scaled:center")
[1] 3 8 13 18
attr(,"scaled:scale")
[1] 1.581139 1.581139 1.581139 1.581139
> scale(x, center = T, scale = F)
[,1] [,2] [,3] [,4]
[1,] -2 -2 -2 -2
[2,] -1 -1 -1 -1
[3,] 0 0 0 0
[4,] 1 1 1 1
[5,] 2 2 2 2
attr(,"scaled:center")
[1] 3 8 13 18
> scale(x, center = T, scale = F)/sd(scale(x, center = T, scale = F)[1:5])
[,1] [,2] [,3] [,4]
[1,] -1.2649111 -1.2649111 -1.2649111 -1.2649111
[2,] -0.6324555 -0.6324555 -0.6324555 -0.6324555
[3,] 0.0000000 0.0000000 0.0000000 0.0000000
[4,] 0.6324555 0.6324555 0.6324555 0.6324555
[5,] 1.2649111 1.2649111 1.2649111 1.2649111
attr(,"scaled:center")
[1] 3 8 13 18
這里我們可以看出,scale()函數(shù)事實上做了兩件事,center和scale,而這里的center就是減去每列的均值,scale則是用center后的數(shù)據(jù)除以該列的標準差,做了一個正態(tài)分布的轉(zhuǎn)化,也就是,下面我作圖以示轉(zhuǎn)化過程。
data <- runif(100, min = 10, max = 100)
plot(1:100, data)
plot(1:100, scale(data, center = T, scale = F))
plot(1:100, scale(data, center = T, scale = T))



結(jié)語
R語言里面的scale()函數(shù)的center和scale參數(shù)需要用對才可以正確處理你的數(shù)據(jù)。