參考:《文本數(shù)據(jù)挖掘》
1、相似度計(jì)算
p_load(stringdist)
# 越接近1相似度越高
stringsim("hello", "Hello", method = "lv")
## [1] 0.8
# 通過(guò)dist函數(shù)求iris數(shù)據(jù)集前四列的相互距離
# 默認(rèn)為歐式距離
dist(t(iris[, 1:4]))
## Sepal.Length Sepal.Width Petal.Length
## Sepal.Width 36.15785
## Petal.Length 28.96619 25.77809
## Petal.Width 57.18304 25.86407 33.86473
# 利用Pearson相關(guān)系數(shù)來(lái)表征不同變量之間的相似度
p_load(apcluster)
corSimMat(t(iris[, 1:4]))
## Sepal.Length Sepal.Width Petal.Length Petal.Width
## Sepal.Length 1.0000000 -0.1175698 0.8717538 0.8179411
## Sepal.Width -0.1175698 1.0000000 -0.4284401 -0.3661259
## Petal.Length 0.8717538 -0.4284401 1.0000000 0.9628654
## Petal.Width 0.8179411 -0.3661259 0.9628654 1.0000000
2、 聚類方法
劃分聚類法:k-means聚類法、k-medoids聚類法等
層次聚類法:合成法(Agglomerative Clustering)和分割法(Divise Clustering)
p_load(tokenizers)
str_vec <- df$sku_name[1:5] %>%
paste0(collapse = " ") %>%
tokenize_words(strip_punct = T, strip_numeric = T,
simplify = T)
# Levenshtein距離計(jì)算
d <- adist(str_vec)
# 如果量綱不一致或極差很大,還需要提前中心化和標(biāo)準(zhǔn)化
# d_scale <- scale(d)
2.1 K-means聚類
# 確定分類數(shù)量
p_load(factoextra)
# method則設(shè)定了最小化損失函數(shù)的計(jì)算方法
fviz_nbclust(d, kmeans, method = "wss")

確定分類數(shù)量
k=3以后就很難減少損失函數(shù),因此設(shè)定k=3。
km <- kmeans(d, centers = 3)
# 查看分類
km$cluster
## [1] 1 2 2 1 3 2 2 1 2 2 1 3 2 1 1 1 1 2 2 1 1 3 1 3 3 1 2 2 3 2 2 2 1 2 2 1 2 2 2 3 2 2 2 2
## [45] 2 2 2 2 2 1 2 2 1 2 2 2 2 3 2 1 1 3 1 2 2 2 2 1 1 2 1 2 2 1 3 2 1 1 2
# 查看字符串分別屬于哪個(gè)類
cbind(class = km$center, string = str_vec)
# 基于PCA的可視化方法呈現(xiàn)分類結(jié)果
fviz_cluster(km,
data = d,
# 橢圓
ellipse.type = "euclid",
# 防止標(biāo)注交疊
repel = T,
# 繪制主題
ggtheme = theme_minimal())

基于PCA方法呈現(xiàn)分類結(jié)果
2.2 PAM算法
可以緩解kmeans聚類的缺點(diǎn)。
p_load(cluster)
# pam算法進(jìn)行最佳聚類數(shù)判斷
fviz_nbclust(d, pam, method = "silhouette")

確定最佳分類數(shù)
可以看到最佳聚類數(shù)為2。
pam <- pam(d, 2)
cbind(class = pam$clustering, string = str_vec)
## class string
## [1,] "1" "1000pcs"
## [2,] "1" "32mm"
## [3,] "1" "0.5ml"
## [4,] "1" "plastic"
## [5,] "2" "centrifuge"
## [6,] "1" "tube"
## [7,] "1" "test"
## [8,] "1" "tubing"
## [9,] "1" "vial"
## [10,] "1" "clear"
## [11,] (省略。。。)
# 聚類效果
fviz_cluster(pam, ellipse.type = "euclid",
repel = T,
ggtheme = theme_classic())

聚類效果
# 使用fpc包高效實(shí)現(xiàn)k-medoids方法
p_load(fpc)
# 設(shè)置K取值范圍為1到10,返回最佳聚類結(jié)果
pam2 <- pamk(d, krange = 1:10)
pam2$nc
## [1] 2
PAM算法雖然解決了很多問(wèn)題,但是它在處理大數(shù)據(jù)集的時(shí)候,對(duì)計(jì)算機(jī)內(nèi)存要求很高,而且耗費(fèi)時(shí)間也比較長(zhǎng)。為了解決這個(gè)問(wèn)題,CLARA(Clustering Large Applications)算法被提了出來(lái)。
2.3 CLARA算法
# 判斷最佳聚類數(shù)
fviz_nbclust(d, clara, method = "silhouette") +
theme_classic()

確定最佳分類數(shù)
# 聚類分析
clara_res <- clara(d, 2,
# 設(shè)定子集大小
samples = 50, pamLike = T)
# 顯示聚類結(jié)果
cbind(class = clara_res$clustering, string = str_vec)
## class string
## [1,] "1" "1000pcs"
## [2,] "2" "32mm"
## [3,] "2" "0.5ml"
## [4,] "1" "plastic"
## [5,] "1" "centrifuge"
## [6,] "2" "tube"
## [7,] "2" "test"
## [8,] "2" "tubing"
## [9,] "2" "vial"
## (省略。。。)
# 可視化展示
fviz_cluster(clara_res, ellipse.type = "euclid",
repel = T, ggtheme = theme_classic())

2.4 層次聚類法
2.4.1 合成法
p_load(cluster)
# 為距離矩陣的行進(jìn)行命名,方便顯示結(jié)果
rownames(d) <- str_vec
# 聚類
# stand參數(shù)控制是否進(jìn)行標(biāo)準(zhǔn)化(默認(rèn)為FALSE),用metric參數(shù)控制樣本距離的計(jì)算方法(默認(rèn)為“euclidean”,即歐式距離)
# 用method參數(shù)設(shè)置聚類方法(默認(rèn)為“average”)
res_agnes <- agnes(d)
# 查看分類結(jié)果
res_agnes
## Call: agnes(x = d)
## Agglomerative coefficient: 0.8408839
## Order of objects:
## [1] 1000pcs 1000pcs plastic plastic plastic plastic nonstick
## [8] 32mm 22mm 33mm 50ml 26ml 0.5ml 0.2ml
## [15] tube tube style size hinge with mini
## [22] tin test case pcr dab jars jars
## [29] home cork gold lot x box box
## [36] box boxes vial vials vials small small
## [43] glass glass clear candy zakka empty metal
## [50] 12pcs 10pcs tubing design wedding garden casket
## [57] silver novelty newest bottles bottles bottles storage
## [64] storage storage storage protable centrifuge centrifuge container
## [71] container containers organizer gardening capacity silicone households
## [78] transparent transparent
## Height (summary):
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 2.000 8.570 8.552 12.258 31.678
##
## Available components:
## [1] "order" "height" "ac" "merge" "diss" "call" "method"
## [8] "order.lab" "data"
通過(guò)agnes函數(shù)求得的是樣本之間的親疏關(guān)系,而沒(méi)有直接進(jìn)行分類。如果要進(jìn)行分類,可以指定分類的數(shù)量,然后用cutree函數(shù)實(shí)現(xiàn)
group_info <- cutree(res_agnes, k = 2)
group_info
## [1] 1 1 1 1 2 1 1 1 1 1 1 2 1 1 1 1 1 1 1 1 1 2 1 2 2 1 1 1 2 1 1 1 2 1 1 1 1 1 1 2 1 1 1 1
## [45] 1 1 1 1 1 1 1 1 1 1 1 1 1 2 1 1 1 2 1 1 1 1 1 1 1 1 2 1 1 1 2 1 1 1 1
# 不進(jìn)行分類
fviz_dend(res_agnes)

為分類結(jié)果
# 進(jìn)行分類
fviz_dend(res_agnes, k = 2,
# 標(biāo)志大小
cex = 0.5,
# 設(shè)定類別顏色
k_colors = c("#FC4E07", "#00AFBB"),
# 設(shè)定標(biāo)志顏色
color_labels_by_k = T,
# 設(shè)定矩形邊框
rect = T)

分類后結(jié)果
# 使用PCA方法對(duì)結(jié)果進(jìn)行可視化
fviz_cluster(list(data = as_tibble(d), cluster = group_info),
palette = c("#FC4E07", "#00AFBB"),
ellipse.type = "convex",
repel = T,
show.clust.cent = F,
ggtheme = theme_minimal())

PCA方法可視化
2.4.2 分割法
只需要將agnes函數(shù)換為diana函數(shù)即可。