數(shù)據(jù)集[user_duration]是1000名用戶使用APP時(shí)長(zhǎng)及次數(shù)的情況,針對(duì)該數(shù)據(jù)集做出聚類
R實(shí)現(xiàn):
library(cluster)
library(factoextra)
#加載數(shù)據(jù)
user_duration=read.csv('/user_duration.csv')
#查看數(shù)據(jù)
head(user_duration)
>>>結(jié)果
duration launch_cnt
1 3681.6350 4
2 488.7920 3
3 504.2426 3
4 40.2190 4
5 163.1598 14
6 716.0249 1
#觀察數(shù)據(jù)特征
summary(user_duration)
>>>結(jié)果
duration launch_cnt
Min. : 0.019 Min. : 1.000
1st Qu.: 102.964 1st Qu.: 1.000
Median : 300.020 Median : 3.000
Mean : 629.259 Mean : 4.601
3rd Qu.: 737.318 3rd Qu.: 5.000
Max. :4924.707 Max. :34.000
#標(biāo)準(zhǔn)化數(shù)據(jù)集
use_z=scale(user_duration)
#對(duì)標(biāo)準(zhǔn)化后數(shù)據(jù)集進(jìn)行繪圖
plot(use_z)
繪圖結(jié)果如下圖1所示:
數(shù)據(jù)集標(biāo)準(zhǔn)化后可視化.png
#由于k均值聚類需要指定要生成的聚類數(shù)量,使用函數(shù)clusGap()來計(jì)算用于估計(jì)最優(yōu)聚類數(shù)。
gap_user=clusGap(use_z,FUN=kmeans,nstart=25,K.max=10,B=500)
>>>結(jié)果
Clustering k = 1,2,..., K.max (= 10): .. done
Bootstrapping, b = 1,2,..., B (= 500) [one "." per sample]:
.................................................. 50
.................................................. 100
.................................................. 150
.................................................. 200
.................................................. 250
.................................................. 300
.................................................. 350
.................................................. 400
.................................................. 450
.................................................. 500
#可視化最優(yōu)聚類數(shù)
fviz_gap_stat(gap_user)
從下圖上可以看到最優(yōu)聚類數(shù)為2。
最優(yōu)聚類數(shù)可視化.png
#根據(jù)最有聚類數(shù)進(jìn)行聚類
user_km=kmeans(use_z,2,nstart=25)
#計(jì)算和可視化k均值聚類
fviz_cluster(user_km,user_duration)
得到聚類結(jié)果:
聚類結(jié)果.png
#判斷聚類結(jié)果好壞
sil=silhouette(user_km$cluster,dist(use_z))
fviz_silhouette(sil)
>>>結(jié)果
cluster size ave.sil.width
1 1 122 0.22
2 2 878 0.78
輪廓系數(shù)圖:
輪廓系數(shù)圖.png
#將聚類結(jié)果合并
user_duration2=cbind(user_km$cluster,user_duration)