主題建模的主要技術(shù)是隱含狄式分布(LDA),它假定在文檔里能找到的主題和單詞分布來源于事先按照狄式分布抽樣的隱藏多項分布。主題建??梢砸暈榫垲惖囊环N形式。
> library(pacman)
> p_load(dplyr, tm, Matrix, topicmodels)
1、數(shù)據(jù)準(zhǔn)備
bbc數(shù)據(jù)集包含了2225篇被分組到5個主題中的文章,這些主題是商業(yè)、娛樂、政治、體育和技術(shù)。
bbsport數(shù)據(jù)集包含了737篇專門關(guān)于體育的文章,這些文字可以根據(jù)它描述的體育項目分組到5個類別里,包括田徑、板球、足球、英式橄欖球和網(wǎng)球。
每個文件夾包含了三個文件,.mtx擴展名文件是包含了一個稀疏矩陣的詞條-文檔矩陣文件;.terms擴展名文件包含了實際的詞條,每行一個詞條,而詞條就是詞條-文檔矩陣的行名;類似的,.docs擴展文件就是詞條-文檔矩陣的列名。
> bbc_folder <- "data_set/bbc"
> bbcsport_folder <- "data_set/bbc/bbcsport"
Matrix包的readMM()函數(shù)可以加載數(shù)據(jù)并將其保存在一個稀疏矩陣對象里,tm包的as.TermDocumentMatrix()函數(shù)可以將這個對象變換為一個詞條-文檔矩陣。
> bbc_mat <- readMM(paste0(bbc_folder, "/bbc.mtx"))
> # weighting參數(shù)指定權(quán)重參數(shù),本數(shù)據(jù)為詞條出現(xiàn)頻率
> bbc_tdm <- as.TermDocumentMatrix(bbc_mat, weighting = weightTf)
> bbc_tdm
## <<TermDocumentMatrix (terms: 9635, documents: 2225)>>
## Non-/sparse entries: 286774/21151101
## Sparsity : 99%
## Maximal term length: NA
## Weighting : term frequency (tf)
> bbcsport_mat <- readMM(paste0(bbcsport_folder, "/bbcsport.mtx"))
> bbcsport_tdm <- as.TermDocumentMatrix(bbcsport_mat, weighting = weightTf)
> bbcsport_tdm
## <<TermDocumentMatrix (terms: 4613, documents: 737)>>
## Non-/sparse entries: 85576/3314205
## Sparsity : 97%
## Maximal term length: NA
## Weighting : term frequency (tf)
給bbc添加行名和列名:
> bbc_rownames <- readr::read_table(paste0(bbc_folder, "/bbc.terms"), col_names = F)
> head(bbc_rownames)
## # A tibble: 6 x 1
## X1
## <chr>
## 1 ad
## 2 sale
## 3 boost
## 4 time
## 5 warner
## 6 profit
> bbc_colnames <- readr::read_table(paste0(bbc_folder, "/bbc.docs"), col_names = F)
> head(bbc_colnames)
## # A tibble: 6 x 1
## X1
## <chr>
## 1 business.001
## 2 business.002
## 3 business.003
## 4 business.004
## 5 business.005
## 6 business.006
> # 去掉文檔名后面的數(shù)字和點
> rm_fun <- function(string) {
+ string <- stringr::str_remove_all(string, "\\d+|\\.")
+ return(string)
+ }
>
> bbc_colnames$X1 <- bbc_colnames$X1 %>%
+ lapply(rm_fun) %>%
+ unlist()
>
> bbc_tdm$dimnames$Terms <- bbc_rownames$X1
> bbc_tdm$dimnames$Docs <- bbc_colnames$X1
>
> # 將tdm轉(zhuǎn)換為dtm矩陣
> bbc_dtm <- t(bbc_tdm)
> bbc_dtm
## <<DocumentTermMatrix (documents: 2225, terms: 9635)>>
## Non-/sparse entries: 286774/21151101
## Sparsity : 99%
## Maximal term length: 24
## Weighting : term frequency (tf)
> bbcsport_rownames <- readr::read_table(paste0(bbcsport_folder, "/bbcsport.terms"), col_names = F)
> head(bbcsport_rownames)
## # A tibble: 6 x 1
## X1
## <chr>
## 1 claxton
## 2 hunt
## 3 first
## 4 major
## 5 medal
## 6 british
> bbcsport_colnames <- readr::read_table(paste0(bbcsport_folder, "/bbcsport.docs"), col_names = F)
> head(bbcsport_colnames)
## # A tibble: 6 x 1
## X1
## <chr>
## 1 athletics.001
## 2 athletics.002
## 3 athletics.003
## 4 athletics.004
## 5 athletics.005
## 6 athletics.006
> bbcsport_colnames$X1 <- bbcsport_colnames$X1 %>%
+ lapply(rm_fun) %>%
+ unlist()
>
> bbcsport_tdm$dimnames$Terms <- bbcsport_rownames$X1
> bbcsport_tdm$dimnames$Docs <- bbcsport_colnames$X1
>
> # 將tdm轉(zhuǎn)換為dtm矩陣
> bbcsport_dtm <- t(bbcsport_tdm)
> bbcsport_dtm
## <<DocumentTermMatrix (documents: 737, terms: 4613)>>
## Non-/sparse entries: 85576/3314205
## Sparsity : 97%
## Maximal term length: 17
## Weighting : term frequency (tf)
> table(bbc_colnames$X1)
##
## business entertainment politics sport tech
## 510 386 417 511 401
> table(bbcsport_colnames$X1)
##
## athletics cricket football rugby tennis
## 101 124 265 147 100
數(shù)據(jù)集在各主題上的數(shù)量分布得還算均勻。bbcsport數(shù)據(jù)集關(guān)于足球的文章數(shù)量大約是其他運動的兩倍。
2、建模
構(gòu)建4種不同的主題模型:
LDA_VEM:用變分期望最大化(VEM)方法訓(xùn)練的LDA模型,這種模型會自動估算狄式參數(shù)向量。
LDA_VEM_:用VEM訓(xùn)練的LDA模型,差別是它沒有估算狄式參數(shù)向量
。
LDA_GIB:用吉布斯抽樣方法訓(xùn)練的LDA模型。
CTM_VEM:用VEM訓(xùn)練的相關(guān)主題模型(Correlated Topic Model, CTM)的一種實現(xiàn)。
> # K為主題數(shù)量,topic_seed為訓(xùn)練過程中的內(nèi)在隨機數(shù)種子
> compute_model <- function(k, topic_seed, dtm) {
+ LDA_VEM <- LDA(dtm, k = k, control = list(seed = topic_seed))
+ LDA_VEM_a <- LDA(dtm, k = k, control = list(seed = topic_seed,
+ estimate.alpha = F))
+ # burnin為起始階段省略的吉布斯迭代次數(shù)
+ # thin為中間階段省略的迭代數(shù)量
+ # iter為迭代總次數(shù)
+ LDA_GIB <- LDA(dtm, k = k, method = "Gibbs", control = list(
+ seed = topic_seed, burnin = 1000, thin = 100, iter = 1000
+ ))
+ CTM_VEM <- CTM(dtm, k = k, control = list(
+ seed = topic_seed, var = list(tol = 10 ^ -4),
+ em = list(tol = 10 ^ -3)))
+ return(list(LDA_VEM = LDA_VEM, LDA_VEM_a = LDA_VEM_a,
+ LDA_GIB = LDA_GIB, CTM_VEM = CTM_VEM))
+ }
> bbc_models <- compute_model(k = 5, topic_seed = 123, bbc_dtm)
> bbcsport_models <- compute_model(k = 5, topic_seed = 123, bbcsport_dtm)
查看LDA_VEM模型對BBC數(shù)據(jù)集產(chǎn)生的結(jié)果:
> # topics函數(shù)得到每一個文檔最有可能主題的向量
> table(topics(bbc_models$LDA_VEM), bbc_colnames$X1)
##
## business entertainment politics sport tech
## 1 225 13 32 3 116
## 2 260 15 2 0 259
## 3 1 352 2 159 10
## 4 21 6 381 3 5
## 5 3 0 0 346 11
主題3、4、5幾乎對應(yīng)了entertainment,politics,sport主題,但是主題1和2卻是business和tech主題的混合,所以這個模型并沒有成功區(qū)分我們需要的類別。
查看LDA_GIB模型:
> table(topics(bbc_models$LDA_GIB), bbc_colnames$X1)
##
## business entertainment politics sport tech
## 1 5 364 3 2 13
## 2 9 6 1 0 371
## 3 472 2 8 1 5
## 4 0 0 4 505 3
## 5 24 14 401 3 9
這個模型比上個模型要好很多。
主題模型的總精確度 = 行的最大值 / 文檔總數(shù),每一行的最大值對應(yīng)的就是給這一行代表的模型主題分配的金主題。
構(gòu)建計算總精確度的函數(shù):
> compute_acc <- function(model, gold_type) {
+ model_topics <- topics(model)
+ model_table <- table(model_topics, gold_type)
+ # 按行取最大值
+ model_match <- apply(model_table, 1, max)
+ # 計算精確度
+ model_acc <- sum(model_match) / sum(model_table)
+ return(model_acc)
+ }
計算所有模型的精確度:
> sapply(bbc_models, FUN = function(x) compute_acc(x, bbc_colnames$X1))
## LDA_VEM LDA_VEM_a LDA_GIB CTM_VEM
## 0.7029 0.6854 0.9497 0.7425
> sapply(bbcsport_models, FUN = function(x) compute_acc(x, bbcsport_colnames$X1))
## LDA_VEM LDA_VEM_a LDA_GIB CTM_VEM
## 0.7802 0.7720 0.8820 0.7951
在兩個數(shù)據(jù)集上LDA_GIB模型(吉布斯方法)都比其他模型都好。
另一種評估模型擬合質(zhì)量的方法是計算數(shù)據(jù)的對數(shù)似然,值越大越好:
> sapply(bbc_models, logLik)
## LDA_VEM LDA_VEM_a LDA_GIB CTM_VEM
## -3205967 -3276286 -3017840 -3235383
> sapply(bbcsport_models, logLik)
## LDA_VEM LDA_VEM_a LDA_GIB CTM_VEM
## -864136 -886482 -814866 -870134
得到同樣的結(jié)果,在兩個數(shù)據(jù)集上LDA_GIB模型都比其他模型要好。
3、模型穩(wěn)定性
上面的模型使用了相同的隨機數(shù)種子,但是如果更換隨機數(shù)種子,就有可能得到不同的結(jié)果。為了驗證哪個模型的表現(xiàn)最穩(wěn)定,可以在訓(xùn)練過程中設(shè)定nstart參數(shù),它指定了在優(yōu)化程序中使用的隨機重啟的次數(shù)。
利用隨機數(shù)重啟意味著會增加訓(xùn)練時間。
> compute_model_r <- function(k, topic_seed, dtm, nstart) {
+ seed_range <- topic_seed:(topic_seed + nstart - 1)
+ LDA_VEM <- LDA(dtm, k = k, control = list(seed = seed_range, nstart = nstart))
+ LDA_VEM_a <- LDA(dtm, k = k, control = list(seed = seed_range,
+ estimate.alpha = F,
+ nstart = nstart))
+ LDA_GIB <- LDA(dtm, k = k, method = "Gibbs", control = list(
+ seed = seed_range, burnin = 1000, thin = 100,
+ iter = 1000, nstart = nstart
+ ))
+ CTM_VEM <- CTM(dtm, k = k, control = list(
+ seed = topic_seed, var = list(tol = 10 ^ -4),
+ em = list(tol = 10 ^ -3)))
+ return(list(LDA_VEM = LDA_VEM, LDA_VEM_a = LDA_VEM_a,
+ LDA_GIB = LDA_GIB, CTM_VEM = CTM_VEM))
+ }
>
> bbc_models <- compute_model_r(k = 5, topic_seed = 123, bbc_dtm, nstart = 5)
> bbcsport_models <- compute_model_r(k = 5, topic_seed = 123, bbcsport_dtm, nstart = 5)
>
> sapply(bbc_models, FUN = function(x) compute_acc(x, bbc_colnames$X1))
## LDA_VEM LDA_VEM_a LDA_GIB CTM_VEM
## 0.9483 0.7762 0.7627 0.7425
> sapply(bbcsport_models, FUN = function(x) compute_acc(x, bbcsport_colnames$X1))
## LDA_VEM LDA_VEM_a LDA_GIB CTM_VEM
## 0.8996 0.8887 0.6499 0.7951
使用了5次隨機重啟后模型的精確度已經(jīng)有所改善,值得注意的是,CTM_VEM模型的精確度能克服隨機重啟的波動(前后表現(xiàn)基本一致)。
4、主題分布
可以調(diào)用posterior()函數(shù)查看每個模型的主題分布。
> options(digits = 4)
> head(posterior(bbc_models[[1]])$topics)
## 1 2 3 4 5
## business 0.0002189 0.0356775 0.7043 0.0002189 0.2595903
## business 0.1073955 0.0002508 0.8797 0.0002508 0.0123840
## business 0.0003209 0.0003209 0.9987 0.0003209 0.0003209
## business 0.0002121 0.0418897 0.9575 0.0002121 0.0002121
## business 0.0004043 0.0192081 0.9796 0.0004043 0.0004043
## business 0.0004669 0.1748144 0.8238 0.0004669 0.0004669
> p_load(ggplot2)
>
> posterior(bbc_models[[1]])$topics %>%
+ # 找出每一行中概率最大值
+ apply(1, max) %>%
+ # 轉(zhuǎn)換為向量
+ tibble(topics = as.numeric(.)) %>%
+ ggplot(aes(topics)) +
+ geom_histogram(binwidth = 0.02)+
+ labs(x = "最可能主題的概率", y = "頻率",
+ title = "LDA_VEM模型最大概率直方圖") +
+ theme_bw()

估算主題分布平滑度的另一種方法是計算模型熵,即不同文檔中的所有主題分布的平均熵值,平滑的分布會比尖的分布具有更高的熵值。
> compute_entropy <- function(x) {
+ return(- sum(x * log(x)))
+ }
>
> # 計算平均熵
> compute_mean_entropy <- function(model) {
+ topics <- posterior(model)$topics
+ return(mean(apply(topics, 1, compute_entropy)))
+ }
>
> # 計算兩個數(shù)據(jù)集上的不同模型的平均熵
> sapply(bbc_models, compute_mean_entropy)
## LDA_VEM LDA_VEM_a LDA_GIB CTM_VEM
## 0.3061 1.2777 1.3189 0.8350
> sapply(bbcsport_models, compute_mean_entropy)
## LDA_VEM LDA_VEM_a LDA_GIB CTM_VEM
## 0.2838 1.2953 1.3699 0.7221
CTM_VEM模型具有比其他模型更平滑的分布。
5、單詞分布
查看LDA_GIB模型中每個主題中最重要的10個單詞。
> terms(bbc_models[[3]], 10)
## Topic 1 Topic 2 Topic 3 Topic 4 Topic 5
## [1,] "govern" "on" "plai" "peopl" "year"
## [2,] "labour" "work" "film" "game" "compani"
## [3,] "parti" "want" "best" "music" "sale"
## [4,] "elect" "includ" "win" "servic" "market"
## [5,] "minist" "world" "game" "technolog" "firm"
## [6,] "plan" "time" "first" "mobil" "2004"
## [7,] "peopl" "london" "award" "phone" "share"
## [8,] "public" "show" "year" "get" "expect"
## [9,] "blair" "week" "england" "on" "month"
## [10,] "sai" "year" "star" "tv" "bank"
從dtm矩陣中計算每個詞條的頻率:
> # 轉(zhuǎn)換為矩陣
> bbc_dtm_mat <- as.matrix(bbc_tdm)
> bbc_dtm_mat[1:5, 10:16]
## Docs
## Terms business business business business business business business
## ad 0 1 0 0 1 2 0
## sale 0 1 0 5 1 0 1
## boost 0 0 0 2 1 0 0
## time 0 0 1 0 2 0 0
## warner 0 0 0 0 0 0 0
> # 按行計算詞頻
> # 轉(zhuǎn)換為數(shù)值型向量是為了去掉屬性名,否則畫圖會報錯
> numTerms <- as.numeric(rowSums(bbc_dtm_mat))
> head(numTerms)
## ad sale boost time warner profit
## 851 736 222 1487 26 315
> # 組成新的詞頻數(shù)據(jù)框
> bbc_tf <- tibble(Terms = rownames(bbc_dtm_mat),
+ numTerms = numTerms)
> head(bbc_tf)
## # A tibble: 6 x 2
## Terms numTerms
## <chr> <dbl>
## 1 ad 851
## 2 sale 736
## 3 boost 222
## 4 time 1487
## 5 warner 26
## 6 profit 315
關(guān)鍵詞云:
> p_load(wordcloud2)
>
> # 畫第三個模型,第3個主題前25個詞條的關(guān)鍵詞云
> terms(bbc_models[[3]], 25) %>%
> as_tibble() %>%
> .[, 3] %>%
> setNames("Terms") %>%
> left_join(bbc_tf, by = "Terms") %>%
> wordcloud2()
