聚類
聚類(clustering),指將樣本分到不同的組中,使得同一組中的樣本差異盡可能的小,而不同組中的樣本差異盡可能的大(這個(gè)定義很虛哈╮(╯_╰)╭)。我也不知道考試的話,聚類能考個(gè)啥
聚類的話,課件上和作業(yè)里提到的似乎是層次聚類(Hierarchical cluster ),可以用R里面的hclust函數(shù)。然后稍微注意幾點(diǎn)的是,hclust函數(shù)有不同的method,到時(shí)候如果要的話,根據(jù)題目來就行了。
method
the agglomeration method to be used. This should be (an unambiguous abbreviation of) one of "ward.D", "ward.D2", "single", "complete", "average" (= UPGMA), "mcquitty" (= WPGMA), "median" (= WPGMC) or "centroid" (= UPGMC).
然后hclust函數(shù)輸入的數(shù)據(jù)是各個(gè)樣本之間的距離,用dist函數(shù)就可以了,dist函數(shù)里面可以設(shè)置不同的度量距離的方法,比如歐氏距離,曼哈頓距離等等
method
the distance measure to be used. This must be one of "euclidean", "maximum", "manhattan", "canberra", "binary" or "minkowski". Any unambiguous substring can be given.
然后畫圖的話,就是plot。
舉個(gè)作業(yè)上的例子:請(qǐng)利用廣泛使用的iris數(shù)據(jù)的花瓣屬性值進(jìn)行簡單層次聚類。
# 整理數(shù)據(jù),因?yàn)轼S尾花數(shù)據(jù)第5列是花的品種,所以不選
> dat <- iris[,1:4]
> head(dat)
Sepal.Length Sepal.Width Petal.Length Petal.Width
1 5.1 3.5 1.4 0.2
2 4.9 3.0 1.4 0.2
3 4.7 3.2 1.3 0.2
4 4.6 3.1 1.5 0.2
5 5.0 3.6 1.4 0.2
6 5.4 3.9 1.7 0.4
# 然后利用dist計(jì)算距離
> dat_dist <- dist(dat)
# 把距離輸入hclust,然后畫圖
plot(hclust(dat_dist))
聚類的話,在《R語言實(shí)戰(zhàn)》第二版的第16章。
在聚類那一章里面提到了縮放數(shù)據(jù),這個(gè)可以稍微提下,好像看到了往年有道題目考了這個(gè)。代碼來源于P343
# 因?yàn)槲覀兪亲兞績?nèi)部進(jìn)行縮放,變量之間是沒有關(guān)系的,而我們每列都是一個(gè)變量,所以apply的margin是2
data <- data.frame(x1=c(100:105),x2=(0:5))
> data
x1 x2
1 100 0
2 101 1
3 102 2
4 103 3
5 104 4
6 105 5
# 每個(gè)變量標(biāo)準(zhǔn)化為均值為0和標(biāo)準(zhǔn)差為1的變量。
> apply(data, 2, function(x){(x-mean(x))/sd(x)})
x1 x2
[1,] -1.3363062 -1.3363062
[2,] -0.8017837 -0.8017837
[3,] -0.2672612 -0.2672612
[4,] 0.2672612 0.2672612
[5,] 0.8017837 0.8017837
[6,] 1.3363062 1.3363062
# 每個(gè)變量被其最大值相除
> apply(data, 2, function(x){x/max(x)})
x1 x2
[1,] 0.9523810 0.0
[2,] 0.9619048 0.2
[3,] 0.9714286 0.4
[4,] 0.9809524 0.6
[5,] 0.9904762 0.8
[6,] 1.0000000 1.0
# 該變量減去它的平均值并除以變量的平均絕對(duì)偏差(Mean Absolute Deviation,查下百度吧)
> apply(data, 2, function(x){(x - mean(x)) / mad(x)})
x1 x2
[1,] -1.1241513 -1.1241513
[2,] -0.6744908 -0.6744908
[3,] -0.2248303 -0.2248303
[4,] 0.2248303 0.2248303
[5,] 0.6744908 0.6744908
[6,] 1.1241513 1.1241513
# 第一種方法可以用scale解決
> scale(data)
x1 x2
[1,] -1.3363062 -1.3363062
[2,] -0.8017837 -0.8017837
[3,] -0.2672612 -0.2672612
[4,] 0.2672612 0.2672612
[5,] 0.8017837 0.8017837
[6,] 1.3363062 1.3363062
attr(,"scaled:center")
x1 x2
102.5 2.5
attr(,"scaled:scale")
x1 x2
1.870829 1.870829
主成分分析
主成分分析的話,我用一個(gè)例子來說明我們可能會(huì)問到的問題(我PCA其實(shí)搞的不清楚,所以還是按照作業(yè)答案來。)
對(duì)鳶尾花數(shù)據(jù)進(jìn)行PCA分析
進(jìn)行主成分分析
# 因?yàn)轼S尾花第5列是物種名,所以做PCA的時(shí)候去掉第五列
# 記得要cor = T,這樣應(yīng)該是可以保證對(duì)你的數(shù)據(jù)是標(biāo)準(zhǔn)化
# 但具體原因還是不太清楚
iris_pca <- princomp(iris[,1:4], cor = T)
各個(gè)主成分能解釋多少方差
# 主成分概述
# 這里看Proportion of Variance那一列,代表主成分能解釋多少變異
# 看Cumulative Proportion就可以知道,前面的幾個(gè)主成分能累積解釋多少變異
> summary(iris_pca)
Importance of components:
Comp.1 Comp.2 Comp.3 Comp.4
Standard deviation 1.7083611 0.9560494 0.38308860 0.143926497
Proportion of Variance 0.7296245 0.2285076 0.03668922 0.005178709
Cumulative Proportion 0.7296245 0.9581321 0.99482129 1.000000000
哪些變量能被PC1所解釋
# 在loadings那邊看,所有變量應(yīng)該都能被PC1所解釋
# PC2那邊Petal.Length,Petal.Width,loading就很小,沒有顯示(其實(shí)是有的,不過很?。?,應(yīng)該就無法被解釋
> iris_pca$loadings
Loadings:
Comp.1 Comp.2 Comp.3 Comp.4
Sepal.Length 0.521 0.377 0.720 0.261
Sepal.Width -0.269 0.923 -0.244 -0.124
Petal.Length 0.580 -0.142 -0.801
Petal.Width 0.565 -0.634 0.524
Comp.1 Comp.2 Comp.3 Comp.4
SS loadings 1.00 1.00 1.00 1.00
Proportion Var 0.25 0.25 0.25 0.25
Cumulative Var 0.25 0.50 0.75 1.00
降維后的數(shù)據(jù)
> head(iris_pca$scores)
Comp.1 Comp.2 Comp.3 Comp.4
[1,] -2.264703 0.4800266 0.12770602 0.02416820
[2,] -2.080961 -0.6741336 0.23460885 0.10300677
[3,] -2.364229 -0.3419080 -0.04420148 0.02837705
[4,] -2.299384 -0.5973945 -0.09129011 -0.06595556
[5,] -2.389842 0.6468354 -0.01573820 -0.03592281
[6,] -2.075631 1.4891775 -0.02696829 0.00660818
# 只取投射到PC1和PC2上的數(shù)據(jù)
> head(iris_pca$scores[,1:2])
Comp.1 Comp.2
[1,] -2.264703 0.4800266
[2,] -2.080961 -0.6741336
[3,] -2.364229 -0.3419080
[4,] -2.299384 -0.5973945
[5,] -2.389842 0.6468354
[6,] -2.075631 1.4891775
寫下PC1(以向量的形式)
# 還是在loading那邊看
> iris_pca$loadings
Loadings:
Comp.1 Comp.2 Comp.3 Comp.4
Sepal.Length 0.521 0.377 0.720 0.261
Sepal.Width -0.269 0.923 -0.244 -0.124
Petal.Length 0.580 -0.142 -0.801
Petal.Width 0.565 -0.634 0.524
Comp.1 Comp.2 Comp.3 Comp.4
SS loadings 1.00 1.00 1.00 1.00
Proportion Var 0.25 0.25 0.25 0.25
Cumulative Var 0.25 0.50 0.75 1.00
PC1:(0.521,-0.269,0.580,0.565)