Chapter 1 & 2: Language Processing and Python & Accessing Text Corpora and Lexical Resources
- NLTK
- concordance( ) function
- Word Sense Disambiguation & Pronoun Resolution
- Text Corpus Structure
- WordNet
1.Key:
What's NTLK?
NTLK是一個自然語言工具包,最初創(chuàng)建于2001年,最初是賓州大學計算機與信息科學系計算語言學課程的一部分,大部分NLP研究者入門的首選tool。
另外,這本書是關(guān)于用Python進行自然語言處理的一本入門書,基本上可以看做是NLTK這個庫的HandBook,使用的方法均是nltk庫中的方法。如果希望查閱API文檔或者是下載安裝NLTK,可以前往官方網(wǎng)站下載,官網(wǎng)上提供和的API文檔涵蓋了工具包中的每一個模塊、類和函數(shù),詳細說明了各種參數(shù),以及用法示例,在此不再贅述。
- 簡單介紹一下NLTK的幾個重要的模塊以及功能描述:
| 語言處理任務 | NLTK模塊 | 功能描述 |
|---|---|---|
| 獲取語料庫 | nltk.corpus | 語料庫和字典的標準化接口 |
| 字符串處理 | nltk.tokenize, nltk.stem | 分詞、句子分解、提取主干 |
| 搭配探究 | nltk.collocations | t-檢驗、卡方、點互信息 |
| 詞性標識符 | nltk.tag | n-gram、backoff、Brill、HMM、TnT |
| 分類 | nltk.classify,nltk.cluster | 決策樹、最大熵、樸素貝葉斯、EM、k-means |
| 分塊 | nltk.chunk | 正則表達式、n-gram、命名實體 |
| 解析 | nltk.parse | 圖表、基于特征、一致性、概率性、依賴項 |
| 語義解釋 | nltk.sem,nltk.inference | ?演算、一階邏輯、模型檢驗 |
| 指標評測 | nltk.metrice | 精度、召回率、協(xié)議系數(shù) |
| 概率與估計 | nltk.probability | 頻率分布、平滑概率分布 |
| 應用 | nltk.app,nltk.chat | 圖形化的關(guān)鍵詞排序、分析器、WordNet查看器、聊天機器人 |
| 語言學領(lǐng)域的工作 | nltk.toolbox | 處理SIL數(shù)據(jù)格式的工具箱 |
concordance function.
- concordance函數(shù):這個函數(shù)挺有意思的,是nltk下的一個函數(shù),可以顯示指定單詞的出現(xiàn)情況(使用這個函數(shù),指定單詞的大小寫不敏感),同時還可以顯示一些上下文。下面是該函數(shù)的使用場景(其中text1的內(nèi)容是nltk.book導入后中的text1):
>>> text1.concordance("monstrous")
Building index...
Displaying 11 of 11 matches:
ong the former , one was of a most monstrous size . ... This came towards us ,
ON OF THE PSALMS . " Touching that monstrous bulk of the whale or ork we have r
ll over with a heathenish array of monstrous clubs and spears . Some were thick
d as you gazed , and wondered what monstrous cannibal and savage could ever hav
that has survived the flood ; most monstrous and most mountainous ! That Himmal
they might scout at Moby Dick as a monstrous fable , or still worse and more de
th of Radney .'" CHAPTER 55 Of the monstrous Pictures of Whales . I shall ere l
ing Scenes . In connexion with the monstrous pictures of whales , I am strongly
ere to enter upon those still more monstrous stories of them which are to be fo
這個函數(shù)的具體實現(xiàn)如下:
def concordance(self, word, width=79, lines=25):
"""
Print a concordance for ``word`` with the specified context window.
Word matching is not case-sensitive.
:seealso: ``ConcordanceIndex``
"""
if '_concordance_index' not in self.__dict__:
print("Building index...")
self._concordance_index = ConcordanceIndex(self.tokens,
key=lambda s:s.lower())
self._concordance_index.print_concordance(word, width, lines)
Word Sense Disambiguation & Pronoun Resolution
- Word Sense Disambiguation
意思是“詞義消歧”,簡而言之,我們需要做的就是分析出特定上下文中的詞被賦予的是哪個意思。例如:
a. serve: help with food or drink; hold an office; put ball into play
b. dish: plate; course of a meal; communications device
- Pronoun Resolution
指代消解,是解決"詞義消歧"的一個手段,解決“誰對誰做了什么”
,即檢測動詞的主語和賓語,另外還有語義角色標注(semantic role labing)---確定名詞短語如何與動詞相關(guān)聯(lián)(如代理、受事、工具等)。
Text Corpus Structure
以下是幾種常見的語料庫結(jié)構(gòu):
- 最簡單的一種語料庫是一些孤立的沒有什么特別結(jié)構(gòu)的文本集合;
- 一些語料庫按如文體(布朗語料庫)等分類成組織結(jié)構(gòu);
- 一些分類會重疊,如主題類別(路透社語料庫);
- 另外一些語料庫可以表示隨時間變化,語言用法的改變(就職演說語料庫);
WordNet
- Senses and Synonyms.(意義與同義詞)
- Synonyms and Synset.(同義詞集與詞條)
- The WordNet Hierarchy.(WordNet的層次結(jié)構(gòu))
WordNet synsets correspond to abstract concepts, and they don’t always have corre- sponding words in English. These concepts are linked together in a hierarchy. Some concepts are very general, such as Entity, State, Event; these are called unique begin- ners or root synsets. Others, such as gas guzzler and hatchback, are much more specific.
WordNet概念的層次片段:每個節(jié)點對應一個同義詞集;邊表示上位詞/下位詞關(guān)系,即上級概念與從屬概念的關(guān)系。
- Hyponyms and Hypernyms.(下位詞與上位詞)
- Antonyms.(反義詞)
2.Correct errors in printing:
P19.
在[Your Trun]的那塊內(nèi)容中:
使用text2嘗試前面頻率分布的例子。...如果得到的是錯誤信息:NameError: name 'FreqDist'is not defined,則需要在一開始輸入 **nltk.book import ***。
需更正為:
使用text2嘗試前面頻率分布的例子。...如果得到的是錯誤信息:NameError: name 'FreqDist'is not defined,則需要在一開始輸入 **nltk.import ***。
原因:nltk.book中并不存在FreqDist( )這一function.
P48.
在[Inaugural Address Corpus]的那塊代碼部分中:
>>> cfd = nltk.ConditionalFreqDist(
... (target, file[:4])
... for fileid in inaugural.fileids()
需更正為:
>>> cfd = nltk.ConditionalFreqDist(
... (target, fileid[:4])
... for fileid in inaugural.fileids()
3.Practice:
6.?○ In the discussion of comparative wordlists, we created an object called trans late, which you could look up using words in both German and Italian in order to get corresponding words in English. What problem might arise with this approach? Can you suggest a way to avoid this problem?
- 如果輸入錯誤(不存在的詞語或者其他沒有通過translate.update(dict(xx))加入字典的語言詞語,則會引發(fā)KeyError.)。其中一個解決辦法是,添加一個錯誤處理情況。
8.? Define a conditional frequency distribution over the Names Corpus that allows you to see which initial letters are more frequent for males versus females (see Figure 2-7).
cfd = nltk.ConditionalFreqDist((fileid, name[1]
for fileid in names.fileids()
for name in names.words(fileid))
14.? Define a function supergloss(s) that takes a synset s as its argument and returns a string consisting of the concatenation of the definition of s, and the definitions of all the hypernyms and hyponyms of s.
def supergloss(s):
s = wn.synset('s')
hyponyms_of_s = s.hyponyms()
hypernyms_of_s = s.hypernyms()
return str(s) + str(hyponyms_of_s) + str(hypernyms_of_s)
17.? Write a function that finds the 50 most frequently occurring words of a text that are not stopwords.
def most_fifty_words(text):
stopwords = nltk.corpus.stopwords.words('english')
content = [w for w in text if w.lower() not in stopwords]
fdist = FreqDist(content)
vocabulary = list(fdist.keys())
return vocabulary[:50]
4.Still have Question:
- 暫無 ?