??本文將會簡單介紹自然語言處理(NLP)中的命名實體識別(NER)。
??命名實體識別(Named Entity Recognition,簡稱NER)是信息提取、問答系統(tǒng)、句法分析、機器翻譯等應用領域的重要基礎工具,在自然語言處理技術走向?qū)嵱没倪^程中占有重要地位。一般來說,命名實體識別的任務就是識別出待處理文本中三大類(實體類、時間類和數(shù)字類)、七小類(人名、機構(gòu)名、地名、時間、日期、貨幣和百分比)命名實體。
??舉個簡單的例子,在句子“小明早上8點去學校上課?!敝?,對其進行命名實體識別,應該能提取信息
人名:小明,時間:早上8點,地點:學校。
??本文將會介紹幾個工具用來進行命名實體識別,后續(xù)有機會的話,我們將會嘗試著用HMM、CRF或深度學習來實現(xiàn)命名實體識別。
??首先我們來看一下NLTK和Stanford NLP中對命名實體識別的分類,如下圖:

在上圖中,LOCATION和GPE有重合。GPE通常表示地理—政治條目,比如城市,州,國家,洲等。LOCATION除了上述內(nèi)容外,還能表示名山大川等。FACILITY通常表示知名的紀念碑或人工制品等。
??下面介紹兩個工具來進行NER的任務:NLTK和Stanford NLP。
??首先是NLTK,我們的示例文檔(介紹FIFA,來源于維基百科)如下:
FIFA was founded in 1904 to oversee international competition among the national associations of Belgium,
Denmark, France, Germany, the Netherlands, Spain, Sweden, and Switzerland. Headquartered in Zürich, its
membership now comprises 211 national associations. Member countries must each also be members of one of
the six regional confederations into which the world is divided: Africa, Asia, Europe, North & Central America
and the Caribbean, Oceania, and South America.
實現(xiàn)NER的Python代碼如下:
import re
import pandas as pd
import nltk
def parse_document(document):
document = re.sub('\n', ' ', document)
if isinstance(document, str):
document = document
else:
raise ValueError('Document is not string!')
document = document.strip()
sentences = nltk.sent_tokenize(document)
sentences = [sentence.strip() for sentence in sentences]
return sentences
# sample document
text = """
FIFA was founded in 1904 to oversee international competition among the national associations of Belgium,
Denmark, France, Germany, the Netherlands, Spain, Sweden, and Switzerland. Headquartered in Zürich, its
membership now comprises 211 national associations. Member countries must each also be members of one of
the six regional confederations into which the world is divided: Africa, Asia, Europe, North & Central America
and the Caribbean, Oceania, and South America.
"""
# tokenize sentences
sentences = parse_document(text)
tokenized_sentences = [nltk.word_tokenize(sentence) for sentence in sentences]
# tag sentences and use nltk's Named Entity Chunker
tagged_sentences = [nltk.pos_tag(sentence) for sentence in tokenized_sentences]
ne_chunked_sents = [nltk.ne_chunk(tagged) for tagged in tagged_sentences]
# extract all named entities
named_entities = []
for ne_tagged_sentence in ne_chunked_sents:
for tagged_tree in ne_tagged_sentence:
# extract only chunks having NE labels
if hasattr(tagged_tree, 'label'):
entity_name = ' '.join(c[0] for c in tagged_tree.leaves()) #get NE name
entity_type = tagged_tree.label() # get NE category
named_entities.append((entity_name, entity_type))
# get unique named entities
named_entities = list(set(named_entities))
# store named entities in a data frame
entity_frame = pd.DataFrame(named_entities, columns=['Entity Name', 'Entity Type'])
# display results
print(entity_frame)
輸出結(jié)果如下:
Entity Name Entity Type
0 FIFA ORGANIZATION
1 Central America ORGANIZATION
2 Belgium GPE
3 Caribbean LOCATION
4 Asia GPE
5 France GPE
6 Oceania GPE
7 Germany GPE
8 South America GPE
9 Denmark GPE
10 Zürich GPE
11 Africa PERSON
12 Sweden GPE
13 Netherlands GPE
14 Spain GPE
15 Switzerland GPE
16 North GPE
17 Europe GPE
可以看到,NLTK中的NER任務大體上完成得還是不錯的,能夠識別FIFA為組織(ORGANIZATION),Belgium,Asia為GPE, 但是也有一些不太如人意的地方,比如,它將Central America識別為ORGANIZATION,而實際上它應該為GPE;將Africa識別為PERSON,實際上應該為GPE。
??接下來,我們嘗試著用Stanford NLP工具。關于該工具,我們主要使用Stanford NER 標注工具。在使用這個工具之前,你需要在自己的電腦上安裝Java(一般是JDK),并將Java添加到系統(tǒng)路徑中,同時下載英語NER的文件包:stanford-ner-2018-10-16.zip(大小為172MB),下載地址為:https://nlp.stanford.edu/software/CRF-NER.shtml。以筆者的電腦為例,Java所在的路徑為:C:\Program Files\Java\jdk1.8.0_161\bin\java.exe, 下載Stanford NER的zip文件解壓后的文件夾的路徑為:E://stanford-ner-2018-10-16,如下圖所示:

在classifer文件夾中有如下文件:

它們代表的含義如下:
3 class: Location, Person, Organization
4 class: Location, Person, Organization, Misc
7 class: Location, Person, Organization, Money, Percent, Date, Time
??可以使用Python實現(xiàn)Stanford NER,完整的代碼如下:
import re
from nltk.tag import StanfordNERTagger
import os
import pandas as pd
import nltk
def parse_document(document):
document = re.sub('\n', ' ', document)
if isinstance(document, str):
document = document
else:
raise ValueError('Document is not string!')
document = document.strip()
sentences = nltk.sent_tokenize(document)
sentences = [sentence.strip() for sentence in sentences]
return sentences
# sample document
text = """
FIFA was founded in 1904 to oversee international competition among the national associations of Belgium,
Denmark, France, Germany, the Netherlands, Spain, Sweden, and Switzerland. Headquartered in Zürich, its
membership now comprises 211 national associations. Member countries must each also be members of one of
the six regional confederations into which the world is divided: Africa, Asia, Europe, North & Central America
and the Caribbean, Oceania, and South America.
"""
sentences = parse_document(text)
tokenized_sentences = [nltk.word_tokenize(sentence) for sentence in sentences]
# set java path in environment variables
java_path = r'C:\Program Files\Java\jdk1.8.0_161\bin\java.exe'
os.environ['JAVAHOME'] = java_path
# load stanford NER
sn = StanfordNERTagger('E://stanford-ner-2018-10-16/classifiers/english.muc.7class.distsim.crf.ser.gz',
path_to_jar='E://stanford-ner-2018-10-16/stanford-ner.jar')
# tag sentences
ne_annotated_sentences = [sn.tag(sent) for sent in tokenized_sentences]
# extract named entities
named_entities = []
for sentence in ne_annotated_sentences:
temp_entity_name = ''
temp_named_entity = None
for term, tag in sentence:
# get terms with NE tags
if tag != 'O':
temp_entity_name = ' '.join([temp_entity_name, term]).strip() #get NE name
temp_named_entity = (temp_entity_name, tag) # get NE and its category
else:
if temp_named_entity:
named_entities.append(temp_named_entity)
temp_entity_name = ''
temp_named_entity = None
# get unique named entities
named_entities = list(set(named_entities))
# store named entities in a data frame
entity_frame = pd.DataFrame(named_entities, columns=['Entity Name', 'Entity Type'])
# display results
print(entity_frame)
輸出結(jié)果如下:
Entity Name Entity Type
0 1904 DATE
1 Denmark LOCATION
2 Spain LOCATION
3 North & Central America ORGANIZATION
4 South America LOCATION
5 Belgium LOCATION
6 Zürich LOCATION
7 the Netherlands LOCATION
8 France LOCATION
9 Caribbean LOCATION
10 Sweden LOCATION
11 Oceania LOCATION
12 Asia LOCATION
13 FIFA ORGANIZATION
14 Europe LOCATION
15 Africa LOCATION
16 Switzerland LOCATION
17 Germany LOCATION
可以看到,在Stanford NER的幫助下,NER的實現(xiàn)效果較好,將Africa識別為LOCATION,將1904識別為時間(這在NLTK中沒有識別出來),但還是對North & Central America識別有誤,將其識別為ORGANIZATION。
??值得注意的是,并不是說Stanford NER一定會比NLTK NER的效果好,兩者針對的對象,預料,算法可能有差異,因此,需要根據(jù)自己的需求決定使用什么工具。
??本次分享到此結(jié)束,以后有機會的話,將會嘗試著用HMM、CRF或深度學習來實現(xiàn)命名實體識別。
注意:本人現(xiàn)已開通微信公眾號: Python爬蟲與算法(微信號為:easy_web_scrape), 歡迎大家關注哦~~