中文字幕在线有码,97五月天色中文,四房播播色播天美传媒

利用brich實(shí)現(xiàn)文本層次聚類,將文本內(nèi)容分類

將相似的文本進(jìn)行聚類然后選出同類中最具有代表的一條數(shù)據(jù)
輸入數(shù)據(jù)：

data.png

2.運(yùn)行結(jié)果如下,聚類前數(shù)據(jù)有9條聚類后6條;
字典key為類別，value是表示同一類別的index(text.dat中的行,從0開(kāi)始) {0: [0, 1, 2], 1: [3, 4], 2: [5], 3: [6], 4: [7], 5: [8]}
0,1,2被聚為一類輸出了該類的中心點(diǎn)"吳亦凡陳偉霆“互懟“酷狗賽道TOP1學(xué)員壓軸來(lái)襲"。
修改Birch(threshold=0.7,n_clusters=None)中的threshold參數(shù)可調(diào)整聚類效果

result.png

參考：
https://blog.csdn.net/Eastmount/article/details/50473675?fps=1&locationNum=4

源碼：
https://github.com/codingMrHu/test_cluster

# coding=utf-8
import sys
import jieba
import numpy as np
from sklearn import feature_extraction    
from sklearn.feature_extraction.text import TfidfTransformer    
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.cluster import Birch
reload(sys)  
sys.setdefaultencoding('utf-8')

''' 
sklearn里面的TF-IDF主要用到了兩個(gè)函數(shù)：CountVectorizer()和TfidfTransformer()。 
    CountVectorizer是通過(guò)fit_transform函數(shù)將文本中的詞語(yǔ)轉(zhuǎn)換為詞頻矩陣。 
    矩陣元素weight[i][j] 表示j詞在第i個(gè)文本下的詞頻，即各個(gè)詞語(yǔ)出現(xiàn)的次數(shù)。 
    通過(guò)get_feature_names()可看到所有文本的關(guān)鍵字，通過(guò)toarray()可看到詞頻矩陣的結(jié)果。 
    TfidfTransformer也有個(gè)fit_transform函數(shù)，它的作用是計(jì)算tf-idf值。 
'''

class Cluster():
    def init_data(self):
        # corpus = [] #文檔預(yù)料 空格連接
        corpus = []
        # f_write = open("jieba_result.dat","w")
        self.title_dict = {}
        with open('text.dat','r') as f:
            index = 0
            for line in f:
                title = line.strip()
                self.title_dict[index] = title
                seglist = jieba.cut(title,cut_all=False)  #精確模式  
                output = ' '.join(['%s'%x for x in list(seglist)]).encode('utf-8')       #空格拼接
                # print index,output
                index +=1
                corpus.append(output.strip())

        #將文本中的詞語(yǔ)轉(zhuǎn)換為詞頻矩陣 矩陣元素a[i][j] 表示j詞在i類文本下的詞頻  
        vectorizer = CountVectorizer()  
        #該類會(huì)統(tǒng)計(jì)每個(gè)詞語(yǔ)的tf-idf權(quán)值  
        transformer = TfidfTransformer()  
        #第一個(gè)fit_transform是計(jì)算tf-idf 第二個(gè)fit_transform是將文本轉(zhuǎn)為詞頻矩陣  
        tfidf = transformer.fit_transform(vectorizer.fit_transform(corpus))  
        #獲取詞袋模型中的所有詞語(yǔ)    
        word = vectorizer.get_feature_names()
        #將tf-idf矩陣抽取出來(lái)，元素w[i][j]表示j詞在i類文本中的tf-idf權(quán)重  
        self.weight = tfidf.toarray()
        # print self.weight

    def birch_cluster(self):
        print ('start cluster Birch -------------------' )
        self.cluster = Birch(threshold=0.6,n_clusters=None)
        self.cluster.fit_predict(self.weight)

        
    def get_title(self):
        # self.cluster.labels_ 為聚類后corpus中文本index 對(duì)應(yīng) 類別 {index: 類別} 類別值int值 相同值代表同一類
        cluster_dict = {}
        # cluster_dict key為Birch聚類后的每個(gè)類，value為 title對(duì)應(yīng)的index
        for index,value in enumerate(self.cluster.labels_):
            if value not in cluster_dict:
                cluster_dict[value] = [index]
            else:
                cluster_dict[value].append(index)
        print cluster_dict

        print ("-----before cluster Birch count title:",len(self.title_dict))
        # result_dict key為Birch聚類后距離中心點(diǎn)最近的title，value為sum_similar求和
        
        result_dict = {}
        for indexs in cluster_dict.values():
            latest_index = indexs[0]
            similar_num = len(indexs)
            if len(indexs)>=2:
                min_s = np.sqrt(np.sum(np.square(self.weight[indexs[0]]-self.cluster.subcluster_centers_[self.cluster.labels_[indexs[0]]])))
                for index in indexs:
                    s = np.sqrt(np.sum(np.square(self.weight[index]-self.cluster.subcluster_centers_[self.cluster.labels_[index]])))
                    if s<min_s:
                        min_s = s
                        latest_index = index

            title = self.title_dict[latest_index]

            result_dict[title] = similar_num
        print ("-----after cluster Birch count title:",len(result_dict))
        for title in result_dict:
            print title,result_dict[title]
        return result_dict
    
    def run(self):
        self.init_data()
        self.birch_cluster()
        self.get_title()

if __name__=='__main__':
    cluster = Cluster()
    cluster.run()

色偷偷精品伊人,欧洲久久精品,欧美综合婷婷骚逼,国产AV主播,国产最新探花在线,九色在线视频一区,伊人大交九欧美,1769亚洲,黄色成人av

2018-05-04 python實(shí)現(xiàn)brich文本層次聚類

2018-05-04 python實(shí)現(xiàn)brich文本層次聚類

利用brich實(shí)現(xiàn)文本層次聚類,將文本內(nèi)容分類

友情鏈接更多精彩內(nèi)容

色偷偷精品伊人,欧洲久久精品,欧美综合婷婷骚逼,国产AV主播,国产最新探花在线,九色在线视频一区,伊人大交九 欧美,1769亚洲,黄色成人av

2018-05-04 python實(shí)現(xiàn)brich文本層次聚類

利用brich實(shí)現(xiàn)文本層次聚類,將文本內(nèi)容分類

友情鏈接更多精彩內(nèi)容

色偷偷精品伊人,欧洲久久精品,欧美综合婷婷骚逼,国产AV主播,国产最新探花在线,九色在线视频一区,伊人大交九欧美,1769亚洲,黄色成人av