前言
最進(jìn)在看分詞源碼,發(fā)現(xiàn)詞庫的存儲(chǔ)是基于Trie樹的數(shù)據(jù)結(jié)構(gòu),特此了解了下其原理。Trie樹又叫前綴樹,字典樹。Trie樹的用途:字典搜索,詞頻統(tǒng)計(jì),前綴查詢等等。原理也不復(fù)雜。
Trie 樹結(jié)構(gòu)。
假設(shè)有 '不問', '不只', '朝', '朝著','不問你'這些詞,那么如何構(gòu)建trie樹呢?直接上圖:
好處:
- 壓縮數(shù)據(jù),將例子中10個(gè)字壓縮成6個(gè)字進(jìn)行存儲(chǔ),節(jié)省空間。
- 查找前綴方便,假如要搜索‘不’開頭的詞不需要遍歷整個(gè)字典。
代碼實(shí)現(xiàn):
樹節(jié)點(diǎn):
class TreeNode:
def __init__(self, word, number = 1,isEndWord = False):
self.word = word
self.number = number#前綴次數(shù)
self.prefix_terms = set() #記錄包含此前綴的所有詞
self.child_nodes = {}
self.isEndWord = isEndWord #判斷是否是一個(gè)詞的最后個(gè)字
if isEndWord:
self.end_number = 1#詞頻次數(shù)
else:
self.end_number = 0#詞頻次數(shù)
def add_child(self, word, isEndWord, term = None):
self.prefix_terms.add(term)
if word in self.child_nodes.keys():
sub_tree_node = self.child_nodes.get(word)
sub_tree_node.prefix_terms.add(term)
sub_tree_node.number += 1
if isEndWord:
sub_tree_node.end_number += 1
else:
self.child_nodes[word] = TreeNode(word, number= 1,isEndWord = isEndWord)
self.child_nodes[word].prefix_terms.add(term)
總共包含5個(gè)成員變量,分別是當(dāng)前字符word,統(tǒng)計(jì)的前綴次數(shù)number,包含此前綴的詞匯prefix_terms,當(dāng)前字符是否是詞的結(jié)束字isEndWord,以及當(dāng)前詞的詞頻。
構(gòu)建樹
class Trie:
def __init__(self):
self.root = TreeNode("root")
def buildTree(self,term):
term = term.strip()
if len(term) == 0:
return
current_node = self.root
iter_num = len(term)
for i in range(iter_num):
if i == (iter_num - 1):
current_node.add_child(term[i], isEndWord = True, term = term)
else:
current_node.add_child(term[i], isEndWord = False, term = term)
current_node = current_node.child_nodes.get(term[i])
樹搜索:
#前綴樹查找
#前綴樹查找
def searchTree(self,term):
term = term.strip()
if len(term) == 0:
return None,0,0
current_node = self.root
iter_num = len(term)
for i in range(iter_num):
if term[i] in current_node.child_nodes.keys():
current_node = current_node.child_nodes.get(term[i])
if i == (iter_num - 1):
return current_node.prefix_terms,current_node.end_number,current_node.number
else:
return None,0,0
#####遞歸,動(dòng)態(tài)規(guī)劃思想
def searchTreeByRecursion(self, term, current_node = None):
if current_node is None:
current_node = self.root
term = term.strip()
if len(term) == 0:
return None,0,0
if term[0] in current_node.child_nodes.keys():
current_node = current_node.child_nodes.get(term[0])
if len(term) == 1:
return current_node.prefix_terms, current_node.end_number, current_node.number
return self.searchTreeByRecursion(term[1:], current_node)
else:
return None, 0, 0
寫了兩個(gè)方法,一個(gè)從上往下遍歷,一個(gè)從下往上基于遞歸。
輸出樹:
def display_tree(self, current_node = None, display_content = None):
display = ''
if current_node is None:
current_node = self.root
print([current_node.word])
display_content = ''
if not current_node.child_nodes:
print(display_content + current_node.word+'/'+str(current_node.number)+'/'+str(current_node.end_number)+'/'+str(current_node.isEndWord))
else:
for sub_node in current_node.child_nodes.values():
self.display_tree(sub_node,display_content+ current_node.word+'/'+str(current_node.number)+'/'+str(current_node.end_number)+'/'+str(current_node.isEndWord)+"--->")
這里輸出樹的每條分支,已經(jīng)各個(gè)樹節(jié)點(diǎn)的變量。
代碼已上傳百度網(wǎng)盤:
鏈接:https://pan.baidu.com/s/1vYmhrcHAIS-3oT3tLN_91Q 密碼:hsbw