日韩中文字幕在线,精品一级A久久久,99热精品一区

一、自然語言處理概覽——什么是自然語言處理（NLP)

1）相關(guān)技術(shù)與應(yīng)用

自動問答（Question Answering，QA）：它是一套可以理解復(fù)雜問題，并以充分的準(zhǔn)確度、可信度和速度給出答案的計算系統(tǒng)，以IBM‘s Waston為代表；

信息抽?。↖nformation Extraction，IE）：其目的是將非結(jié)構(gòu)化或半結(jié)構(gòu)化的自然語言描述文本轉(zhuǎn)化結(jié)構(gòu)化的數(shù)據(jù)，如自動根據(jù)郵件內(nèi)容生成Calendar；

情感分析（Sentiment Analysis，SA）：又稱傾向性分析和意見挖掘，它是對帶有情感色彩的主觀性文本進行分析、處理、歸納和推理的過程，如從大量網(wǎng)頁文本中分析用戶對“數(shù)碼相機”的“變焦、價格、大小、重量、閃光、易用性”等屬性的情感傾向；

機器翻譯（Machine Translation，MT）：將文本從一種語言轉(zhuǎn)成另一種語言，如中英機器翻譯。

… …

2）發(fā)展現(xiàn)狀

基本解決：詞性標(biāo)注、命名實體識別、Spam識別

取得長足進展：情感分析、共指消解、詞義消歧、句法分析、機器翻譯、信息抽取

挑戰(zhàn)：自動問答、復(fù)述、文摘、會話機器人

發(fā)展現(xiàn)狀

3）NLP主要難點——歧義問題

詞法分析歧義

分詞，如“嚴(yán)守一把手機關(guān)了”，可能的分詞結(jié)果“嚴(yán)守一/ 把/ 手機/ 關(guān)/ 了” 和“嚴(yán)守/ 一把手/ 機關(guān)/ 了”

詞性標(biāo)注，如“計劃”在不同上下文中有不同的詞性：“我/ 計劃/v 考/ 研/”和“我/ 完成/ 了/ 計劃/n”

語法分析歧義

“那只狼咬死了獵人的狗”

”咬死了獵人的狗失蹤了”

語義分析歧義

計算機會像你的母親那樣很好的理解你（的語言）

計算機理解你喜歡你的母親

計算機會像很好的理解你的母親那樣理解你

機器翻譯：句子“At last, a computer that understands you like your mother”可以有多種含義，如下：

NLP應(yīng)用中的歧義

音字轉(zhuǎn)換：拼音串“ji qi fan yi ji qi ying yong ji qi le ren men ji qi nong hou de xing qu”中的“ji qi”如何轉(zhuǎn)換成正確的詞條

4）為什么自然語言理解如此困難？

用戶生成內(nèi)容中存在大量口語化、成語、方言等非標(biāo)準(zhǔn)的語言描述

分詞問題

新詞不斷產(chǎn)生

基本常識與上下文知識

各式各樣的實體詞

… …

為了解決以上難題，我們需要掌握較多的語言學(xué)知識，構(gòu)建知識庫資源，并找到一種融合各種知識、資源的方法，目前使用較多是概率模型（probabilistic model）或稱為統(tǒng)計模型（statistical model），或者稱為“經(jīng)驗主義模型”，其建模過程基于大規(guī)模真實語料庫，從中各級語言單位上的統(tǒng)計信息，并且，依據(jù)較低級語言單位上的統(tǒng)計信息，運行相關(guān)的統(tǒng)計、推理等技術(shù)計算較高級語言單位上的統(tǒng)計信息。與其相對的“理想主義模型”，即基于Chomsky形式語言的確定性語言模型，它建立在人腦中先天存在語法規(guī)則這一假設(shè)基礎(chǔ)上，認(rèn)為語言是人腦語言能力推導(dǎo)出來的，建立語言模型就是通過建立人工編輯的語言規(guī)則集來模擬這種先天的語言能力。

本課程主要側(cè)重于基于統(tǒng)計的NLP技術(shù)，如Viterbi、貝葉斯和最大熵分類器、N-gram語言模型等等。

二、情感分析（Sentiment Analysis）

1）What is Sentiment Analysis?

情感分析（Sentiment analysis），又稱傾向性分析，意見抽取（Opinion extraction），意見挖掘（Opinion mining），情感挖掘（Sentiment mining），主觀分析（Subjectivity analysis），它是對帶有情感色彩的主觀性文本進行分析、處理、歸納和推理的過程，如從評論文本中分析用戶對“數(shù)碼相機”的“變焦、價格、大小、重量、閃光、易用性”等屬性的情感傾向。

更多例子如下：

l 從電影評論中識別用戶對電影的褒貶評價：

l Google Product Search識別用戶對產(chǎn)品各種屬性的評價，并從評論中選擇代表性評論展示給用戶：

l Bing Shopping識別用戶對產(chǎn)品各種屬性的評價：

l Twitter sentiment versus Gallup Poll of Consumer Confidence：挖掘Twitter（中文：微博）中的用戶情感發(fā)現(xiàn)，其與傳統(tǒng)的調(diào)查、投票等方法結(jié)果有高度的一致性（以消費者信心和政治選舉為例，corelation達(dá)80%），詳細(xì)見論文：Brendan O’Connor, Ramnath Balasubramanyan, Bryan R. Routledge, and Noah A. Smith. 2010. From Tweets to Polls: Linking Text Sentiment to Public Opinion Time Series. In ICWSM-2010。（注：下圖中2008年到2009年初，網(wǎng)民情緒低谷是金融危機導(dǎo)致，從2009年5月份開始慢慢恢復(fù)）

l Twitter sentiment: 通過Twitter用戶情感預(yù)測股票走勢，2012年5月，世界首家基于社交媒體的對沖基金 Derwent Capital Markets 在屢次跳票后終于上線。它會即時關(guān)注Twitter 中的公眾情緒指導(dǎo)投資。正如基金創(chuàng)始人保羅?郝?。≒aul Hawtin）表示：“長期以來，投資者已經(jīng)廣泛地認(rèn)可金融市場由恐懼和貪婪驅(qū)使，但我們從未擁有一種技術(shù)或數(shù)據(jù)來量化人們的情感?！币恢睘榻鹑谑袌龇抢硇耘e動所困惑的投資者，終于有了一扇可以了解心靈世界的窗戶——那便是 Twitter 每天浩如煙海的推文，在一份八月份的報道中顯示，利用Twitter 的對沖基金 Derwent Capital Markets 在首月的交易中已經(jīng)盈利，它以1.85%的收益率，讓平均數(shù)只有0.76%的其他對沖基金相形見絀。類似的工作還有預(yù)測電影票房、選舉結(jié)果等，均是將公眾情緒與社會事件對比，發(fā)現(xiàn)一致性，并用于預(yù)測，如將“冷靜CLAM”情緒指數(shù)后移3天后和道瓊斯工業(yè)平均指數(shù)DIJA驚人一致。詳細(xì)見論文：

Johan Bollen, Huina Mao, Xiaojun Zeng. 2011. Twitter mood predicts the stock market, Journal of Computational Science 2:1, 1-8.（注：DIJA，全稱Dow Jones Industrial Average）

l Target Sentiment on Twitter（Twitter Sentiment App）：對Twitter中包含給定query的tweets進行情感分類。對于公司了解用戶對公司、產(chǎn)品的喜好，用于指導(dǎo)改善產(chǎn)品和服務(wù)，公司還可以據(jù)此發(fā)現(xiàn)競爭對手的優(yōu)劣勢，用戶也可以根據(jù)網(wǎng)友甚至親友評價決定是否購買特定產(chǎn)品。詳細(xì)見論文：Alec Go, Richa Bhayani, Lei Huang. 2009. Twitter Sentiment Classification using Distant Supervision.

情感分析的意義何在？下面以實際應(yīng)用為例進行直觀的闡述：

? Movie: is this review positive or negative?

? Products: what do people think about the new iPhone?

? Public sentiment: how is consumer confidence? Is despair increasing?

? Politics: what do people think about this candidate or issue?

? Prediction: predict election outcomes or market trends from sentiment

情感分析主要目的就是識別用戶對事物或人的看法、態(tài)度（attitudes：enduring, affectively colored beliefs, dispositions towards objects or persons），參與主體主要包括：

Holder (source) of attitude：觀點持有者

Target (aspect) of attitude：評價對象

Type of attitude：評價觀點

From a set of types：Like, love, hate, value, desire, etc.

Or (more commonly) simple weighted polarity: positive, negative, neutral,together with strength

Text containing the attitude：評價文本，一般是句子或整篇文檔

更細(xì)更深入的還包括評價屬性，情感詞/極性詞，評價搭配等、

通常，我們面臨的情感分析任務(wù)包括如下幾類：

Simplest task: Is the attitude of this text positive or negative?

More complex: Rank the attitude of this text from 1 to 5

Advanced: Detect the target, source, or complex attitude types

后續(xù)章節(jié)將以Simplest task為例進行介紹。

2）A Baseline Algorithm

本小節(jié)對影評進行情感分析為例，向大家展示一個簡單、實用的情感分析系統(tǒng)。我們面臨的任務(wù)是“Polarity detection: Is

an IMDB movie review positive or negative?”，數(shù)據(jù)集為“Polrity Data 2.0:

http://www.cs.cornell.edu/people/pabo/movie-review-data”.作者將情感分析當(dāng)作分類任務(wù)，拆

分成如下子任務(wù)：

Tokenization：正文提取，過濾時間、電話號碼等，保留大寫字母開頭的字符串，保留表情符號，切詞；

Feature Extraction：直觀上，我們會認(rèn)為形容詞直接決定文本的情感，而Pang和Lee的實驗表明，采用所有詞（unigram）作為特征，可以達(dá)到更好的情感分類效果。

其中，需要對否定句進行特別的處理，如句子”I didn’t like this movie”vs “I really like this movie”，unigram只差一個詞，但是有著截然不同的含義。為了有效處理這種情況，Das and Chen (2001)提出了“Add NOT_ to every word between negation and following punctuation”，根據(jù)此規(guī)則可以將句子“didn’t like this movie , but I”轉(zhuǎn)換為“didn’t NOT_like NOT_this NOT_movie, but I”。

另外，在抽取特征時，直觀的感覺“Word occurrence may matter more than word frequency”，這是因為最相關(guān)的情感詞在一些文本片段中僅僅出現(xiàn)一次，詞頻模型起得作用有限，甚至是負(fù)作用，則使用多重伯努利模型事件空間代替多項式事件空間，實驗也的確證明了這一點。所以，論文最終選擇二值特征，即詞的出現(xiàn)與否，代替?zhèn)鹘y(tǒng)的頻率特征。log(freq(w))也是一種值得嘗試的降低頻率干擾的方法。

Classification using different classifiers:如Na?ve Bayes、MaxEnt、SVM，以樸素貝葉斯分類器為例，訓(xùn)練過程如下：

預(yù)測過程如下：

實驗表明，MaxEnt和SVM相比Na?ve Bayes可以得到更好的效果。

最后，通過case review可以總結(jié)下，影評情感分類的難點是什么？

語言表達(dá)的含蓄微妙：“If you are reading this because it is your darling fragrance, please wear it at home exclusively, and tape the windows shut.”，“ She runs the gamut of emotions from A to B”。

挫敗感表達(dá)方式：先描述開始的期待（不吝贊美之詞），后表達(dá)最后失望感受，如“This film should be brilliant. It sounds like a great plot, the actors are first grade, and the supporting cast is good as well, and Stallone is attempting to deliver a good performance. However, it can’t hold up.”，“Well as usual Keanu Reeves is nothing special, but surprisingly, the very talented Laurence Fishbourne is not so good either, I was surprised.”。

3）Sentiment Lexicons

情感分析模型非常依賴于情感詞典抽取特征或規(guī)則，以下羅列了較為流行且成熟的開放情感詞典資源：

GI（The General Inquirer）：該詞典給出了每個詞條非常全面的信息，如詞性，反義詞，褒貶，等，組織結(jié)構(gòu)如下：

詳細(xì)見論文：Philip J. Stone, Dexter C Dunphy, Marshall S. Smith, Daniel M. Ogilvie. 1966.The General Inquirer: A Computer Approach to Content Analysis. MIT Press

LIWC (Linguistic Inquiry and Word Count)：該詞典通過大量正則表達(dá)式描述不同類別的情感詞規(guī)律，其類別體系與GI（The General Inquirer）基本一致，組織結(jié)構(gòu)如下：

詳細(xì)見論文：Pennebaker, J.W., Booth, R.J., & Francis, M.E. (2007). Linguistic Inquiry and Word Count: LIWC 2007. Austin, TX

MPQA Subjectivity Cues Lexicon：其中包含Positive words: 2718，Negative words: 4912，組織結(jié)構(gòu)如下圖所示：

詳細(xì)見論文：Stefano Baccianella, Andrea Esuli, and Fabrizio Sebastiani. 2010SENTIWORDNET 3.0: An Enhanced Lexical Resource for Sentiment Analysis and Opinion Mining. LREC-2010

以上給出了一系列可用的情感詞典資源，但是，如何選擇一個合適的為我所用呢？這里，通過對比同一詞條在不同詞典之間的分類，衡量詞典資源的不一致程度，如下：

對于在不同詞典中表現(xiàn)不一致的詞條，我們至少可以做兩件事情。第一，review這些詞條，通過少量人工加以糾正；第二，可以得到一些存在褒貶歧義的詞條。

給定一個詞，如何確定其以多大概率出現(xiàn)在某種情感類別文本中呢？以IMDB下不同打分下影評為例，最簡單的方法就是計算每個分?jǐn)?shù)（星的個數(shù)）對應(yīng)的文本中詞條出現(xiàn)的頻率，如下圖所示為Count(“bad”)分布情況：

如下圖所示，列出了部分詞條在不同類別下的Scaled likelihood，據(jù)此可以判斷每個詞條的傾向性。

另外，我們通常會有這么一個疑問：否定詞（如not, n’t, no, never）是否更容易出現(xiàn)在negative情感文本中？Potts, Christopher（2011）等通過實驗給出了答案：More negation in negative sentiment，如下圖所示：

4）Learning Sentiment Lexicons

我們在慶幸和贊揚眾多公開情感詞典為我所用的同時，我們不免還想了解構(gòu)建情感詞典的方法，正所謂知其然知其所以然。一方面在面臨新的情感分析問題，解決新的情感分析任務(wù)時，難免會需要結(jié)合實際需求構(gòu)建或完善情感詞典，另一方面，可以將成熟的詞典構(gòu)建方法應(yīng)用于其他領(lǐng)域，知識無邊界，許多方法都是相通的。

常見的情感詞典構(gòu)建方法是基于半指導(dǎo)的bootstrapping學(xué)習(xí)方法，主要包括兩步：

Use a small amount of information（Seed）

A few labeled examples

A few hand-built patterns

To bootstrap a lexicon

接下來，通過相關(guān)的幾篇論文，詳細(xì)闡述下構(gòu)建情感詞典的方法。具體如下：

1. Hatzivassiloglou & McKeown：論文見Vasileios Hatzivassiloglou and

Kathleen R. McKeown. 1997. Predicting the Semantic Orientation of

Adjectives. ACL, 174–181，基于這樣的一種語言現(xiàn)象：“Adjectives conjoined by ‘a(chǎn)nd’’

have same polarity；Adjectives conjoined by ‘but ‘ do not”，如下示例：

Fair and legitimate, corrupt and brutal

*fair and brutal, *corrupt and legitimate fair but brutal

Hatzivassiloglou & McKeown（1997）提出了基于bootstrapping的學(xué)習(xí)方法，主要包括四步：

Step 1：Label seed set of 1336 adjectives (all >20 in 21 million word WSJ corpus)

初始種子集包括657個 positive words（如adequate central clever famous

intelligent remarkable reputed sensitive slender thriving…）和679個

negative words（如contagious drunken ignorant lanky listless primitive

strident troublesome unresolved unsuspecting…）

Step 2：Expand seed set to conjoined adjectives，如下圖所示：

Step 3：Supervised classifier assigns “polarity similarity” to each word pair, resulting in graph，如下圖所示：

Step 4：Clustering for partitioning the graph into two

最終，輸出新的情感詞典，如下（加粗詞條為自動挖掘出的詞條）：

Positive: bold decisive disturbing generous good honest important large mature patient peaceful positive proud sound stimulating straightforwardstrange talented vigorous witty…

Negative: ambiguous cautious cynical evasive harmful hypocritical inefficient insecure irrational irresponsible minor outspoken pleasant reckless risky selfish tedious unsupported vulnerable wasteful…

2. Turney Algorithm：論文見Turney (2002): Thumbs Up or Thumbs Down?Semantic Orientation Applied to Unsupervised Classification of Reviews，具體步驟如下：

Step 1：Extract a phrasal lexicon from reviews，通過規(guī)則抽取的phrasal如下圖所示：

Step 2：Learn polarity of each phrase，那么，如何評價phrase的polarity呢？直觀上，有這樣的結(jié)論：“Positive phrases co-occur more with ‘excellent’，Negative phrases co-occur more with ’poor’”，這時，將問題轉(zhuǎn)換成如何衡量詞條之間的共現(xiàn)關(guān)系？于是，學(xué)者們引入了點互信息（Pointwise mutual information，PMI），它經(jīng)常被用于度量兩個具體事件的相關(guān)程度，公式為：

Turney Algorithm在410 reviews（from Epinions）的數(shù)據(jù)集上，其中170 (41%) negative，240 (59%) positive，取得了74%的準(zhǔn)確率（baseline為59%，均標(biāo)注為positive）。

Step 3：Rate a review by the average polarity of its phrases

3. Using WordNet to learn polarity：論文見S.M. Kim and E. Hovy.2004.Determining the sentiment of opinions. COLING 2004，M. Hu and B.Liu. Mining and summarizing customer reviews. In Proceedings of KDD,2004.

該方法步驟如下：

Create positive (“good”) and negative seed-words (“terrible”)

Find Synonyms and Antonyms

Positive Set: Add synonyms of positive words (“well”) and antonyms of negative words

Negative Set: Add synonyms of negative words (“awful”) and antonyms of positive words (”evil”)

Repeat, following chains of synonyms

Filter

以上幾個方法都有較好的領(lǐng)域適應(yīng)性和魯棒性，基本思想可以概括為“Use seeds and semi-supervised learning to induce lexicons”，即：

Start with a seed set of words (‘good’, ‘poor’)

Find other words that have similar polarity:

Using “and” and “but”

Using words that occur nearby in the same document

Using WordNet synonyms and antonyms

Use seeds and semi-supervised learning to induce lexicons

5）Other Sentiment Tasks

上面介紹了文檔級或句子級情感分析，但是，實際中，一篇文檔（評論）中往往會提及不同的方面/屬性/對象（以下統(tǒng)稱屬性），且可能對不同的屬性持有不同的傾向性，如“The food wasgreat but the service was awful”。一般通過Frequent phrases + rules的方法抽取評價屬性，如下：

Find all highly frequent phrases across reviews (“fish tacos”)

Filter by rules like “occurs right after sentiment word”：“…great fish tacos” means fish tacos a likely aspect

通常，我們還會面臨一種問題：評價屬性缺失，準(zhǔn)確的講，評價屬性不在句子中。這是很常見的現(xiàn)象，此時就需要結(jié)合上下文環(huán)境，如來自某電影的評論缺失的評價屬性基本上就是電影名或演員，可以基于已知評價屬性的句子訓(xùn)練分類器，然后對評價屬性缺失的句子進行屬性預(yù)測。

Blair-Goldensohn et al.提出了一套通用的aspect-based summarization models，如下圖所示：

詳細(xì)見論文：S. Blair-Goldensohn, K. Hannan, R. McDonald, T. Neylon,G. Reis, and J. Reynar. 2008. Building a Sentiment Summarizer for Local Service Reviews. WWW Workshop

另外，其他的一些情感分析的相關(guān)任務(wù)有：

Emotion: 個人情緒

Detecting annoyed callers to dialogue system

Detecting confused/frustrated versus confident students

Mood: 個人情緒

Finding traumatized or depressed writers

Interpersonal stances: 人際關(guān)系中的談話方式

Detection of flirtation or friendliness in conversations

Personality traits: 性格

Detection of extroverts

文章出處：大數(shù)據(jù)文摘

引自：深度開源open經(jīng)驗? （http://www.open-open.com/lib/view/open1421114964515.html）

色偷偷精品伊人,欧洲久久精品,欧美综合婷婷骚逼,国产AV主播,国产最新探花在线,九色在线视频一区,伊人大交九欧美,1769亚洲,黄色成人av

斯坦福大學(xué)怎么講情感分析？

斯坦福大學(xué)怎么講情感分析？

相關(guān)閱讀更多精彩內(nèi)容

友情鏈接更多精彩內(nèi)容

色偷偷精品伊人,欧洲久久精品,欧美综合婷婷骚逼,国产AV主播,国产最新探花在线,九色在线视频一区,伊人大交九 欧美,1769亚洲,黄色成人av

斯坦福大學(xué)怎么講情感分析？

相關(guān)閱讀更多精彩內(nèi)容

友情鏈接更多精彩內(nèi)容

色偷偷精品伊人,欧洲久久精品,欧美综合婷婷骚逼,国产AV主播,国产最新探花在线,九色在线视频一区,伊人大交九欧美,1769亚洲,黄色成人av