分析星球大戰(zhàn)正傳劇本

0 引言

??星球大戰(zhàn)是一部偉大的電影,講述了一段在遙遠(yuǎn)的銀河發(fā)生的故事,對(duì)世界流行文化影響深遠(yuǎn)。Kaggle 上有星球大戰(zhàn)正傳三部曲的劇本,雖然數(shù)據(jù)量不大,也是一次實(shí)現(xiàn)文本分析的有趣嘗試。

1 導(dǎo)入相關(guān)包

# Jupyter 魔法函數(shù),在當(dāng)前頁(yè)面輸出圖像
%matplotlib inline
# 數(shù)據(jù)處理及導(dǎo)入導(dǎo)出
import pandas as pd

# 數(shù)據(jù)可視化基礎(chǔ)庫(kù)
import matplotlib.pyplot as plt
# 更好的可視化效果
import seaborn as sns
sns.set_style("whitegrid") #設(shè)置 seaborn 主題

# 詞云
from wordcloud import WordCloud  
from imageio import imread

# 機(jī)器學(xué)習(xí)
import gensim #構(gòu)建 word2vec
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import PCA
from sklearn.cluster import KMeans

# 去除停用詞,并將字符串轉(zhuǎn)換成列表
import string
from nltk.corpus import stopwords 
stop = stopwords.words('english') #停用詞

2 導(dǎo)入數(shù)據(jù)

??分別從 3 個(gè) txt 文件中,導(dǎo)入星球大戰(zhàn)四、五、六的劇本為數(shù)據(jù)框(DataFrame)。

# 從 txt 文件中導(dǎo)入數(shù)據(jù)
SW_IV = pd.read_table('data/SW_EpisodeIV.txt', delim_whitespace=True, header=0, escapechar='\\')
SW_V = pd.read_table('data/SW_EpisodeV.txt', delim_whitespace=True, header=0, escapechar='\\')
SW_VI = pd.read_table('data/SW_EpisodeVI.txt', delim_whitespace=True, header=0, escapechar='\\')
# 查看數(shù)據(jù)框
SW_IV.sample(10)

3 數(shù)據(jù)處理

??在分析之前,先進(jìn)行簡(jiǎn)單的數(shù)據(jù)處理,剔除 dialogue 列中的停用詞,并將其由字符串轉(zhuǎn)換成列表。

print("停用詞表:\n{0}".format(stop))

停用詞表:
['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', 'too', 'very', 's', 't', 'can', 'will', 'just', 'don', "don't", 'should', "should've", 'now', 'd', 'll', 'm', 'o', 're', 've', 'y', 'ain', 'aren', "aren't", 'couldn', "couldn't", 'didn', "didn't", 'doesn', "doesn't", 'hadn', "hadn't", 'hasn', "hasn't", 'haven', "haven't", 'isn', "isn't", 'ma', 'mightn', "mightn't", 'mustn', "mustn't", 'needn', "needn't", 'shan', "shan't", 'shouldn', "shouldn't", 'wasn', "wasn't", 'weren', "weren't", 'won', "won't", 'wouldn', "wouldn't"]

def prep_text(series):
    """
        去除停用詞,并將字符串轉(zhuǎn)換成列表
        
        Args:
            series: Series
    
        Returns:
            Series
    """
    return series.str.replace('\'', ' ').apply(lambda x: ' '.join([word for word in x.split() if word not in (stop)]).lower().translate(str.maketrans("", "", string.punctuation)).split())
SW_IV['clean_text'] = prep_text(SW_IV['dialogue'])
SW_V['clean_text'] = prep_text(SW_V['dialogue'])
SW_VI['clean_text'] = prep_text(SW_VI['dialogue'])
SW_IV.head()

??將 3 個(gè)數(shù)據(jù)框進(jìn)行合并為一個(gè)新的數(shù)據(jù)框。

SW = pd.concat([SW_IV, SW_V, SW_VI], ignore_index=True)
# 查看數(shù)據(jù)框信息
SW.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2523 entries, 0 to 2522
Data columns (total 3 columns):
character 2523 non-null object
dialogue 2523 non-null object
clean_text 2523 non-null object
dtypes: object(3)
memory usage: 59.2+ KB

4 統(tǒng)計(jì)分析

??接下來(lái),我們將對(duì) 4 個(gè)數(shù)據(jù)框進(jìn)行簡(jiǎn)單的統(tǒng)計(jì)分析,看看三部曲中誰(shuí)的臺(tái)詞最多,并通過(guò)詞云查看哪些詞語(yǔ)的詞頻高。

??新建函數(shù) character_countSWCloud,分別用于查看臺(tái)詞數(shù)前 20 的角色和生成詞云。

def character_count(df):
    """
        展示臺(tái)詞數(shù)前 20 的角色
        
        Args:
            df: DataFrame
    """
    print(df.groupby('character').size().sort_values(ascending=False)[0:20])
    
    top20 = list(df.groupby('character').size().sort_values(ascending=False)[0:20].index)
    
    df_top20 = df[df['character'].isin(top20)]
    
    sns.countplot(y="character", 
                data=df_top20,
                palette="GnBu_d", 
                order = df_top20['character'].value_counts().index);
def SWCloud(df, cloud_mask, ep=''):
    """
        生成詞云,并導(dǎo)出為 jpg 圖片
        
        Args:
            df: DataFrame
            cloud_mask: string, fileName
            ep: string, episode
    """
    text = []

    for line in df['clean_text']:
        text.extend(line)
    
    join_text = " ".join(text)
    
    mask = imread(cloud_mask)
    
    cloud = WordCloud(
        background_color = 'white',
        mask = mask,
        max_words = 1024,
        max_font_size = 100
    )
    
    word_cloud = cloud.generate(join_text)
    word_cloud.to_file('output\SW_' + ep + '_Cloud.jpg')
    
    plt.figure(figsize=(8,8))
    plt.imshow(word_cloud) 
    plt.axis('off');

??先來(lái)查看第四部新希望臺(tái)詞數(shù)前 20 的角色,及其詞云。

character_count(SW_IV)

character
LUKE 254
HAN 153
THREEPIO 119
BEN 82
LEIA 57
VADER 41
RED LEADER 37
BIGGS 34
TARKIN 28
OWEN 25
TROOPER 19
GOLD LEADER 14
WEDGE 14
OFFICER 11
RED TEN 8
GOLD FIVE 7
INTERCOM VOICE 6
GREEDO 6
JABBA 6
FIRST TROOPER 6
dtype: int64

SWCloud(SW_IV, 'img/r2d2.png', 'IV')

??接著是帝國(guó)反擊戰(zhàn),在這一集中,爵爺對(duì)盧克說(shuō)出了那句著名的 "I’m your father"。

character_count(SW_V)

character
HAN 182
LUKE 128
LEIA 114
THREEPIO 92
LANDO 61
VADER 56
YODA 36
PIETT 23
CREATURE 21
BEN 15
RIEEKAN 13
WEDGE 8
DECK OFFICER 7
VEERS 7
ZEV 6
EMPEROR 5
OZZEL 5
NEEDA 5
JANSON 4
DACK 4
dtype: int64

SWCloud(SW_V, 'img/yoda.png', 'V')

??最后是絕地歸來(lái),在這集中,由于爵爺墮入原力的光明面,加上帝國(guó)沒(méi)錢造欄桿,叛軍取得勝利,并承勝追擊,占領(lǐng)了帝國(guó)大半的領(lǐng)土,在此危急存亡之秋,索龍?jiān)獛浢摲f而出...

character_count(SW_VI)

character
HAN 124
LUKE 112
THREEPIO 90
LEIA 56
VADER 43
LANDO 40
EMPEROR 39
JABBA 20
BEN 18
ACKBAR 14
YODA 13
WEDGE 11
PIETT 8
BOUSHH 7
COMMANDER 7
JERJERROD 7
STORMTROOPER 6
BIB 6
NINEDENINE 6
CONTROLLER 5
dtype: int64

SWCloud(SW_VI, 'img/vader.jpg', 'VI')
output_31_0.png

??在正傳三部曲中,臺(tái)詞最多的無(wú)疑是男主盧克天行者,其次是戲份一點(diǎn)也不比男主少的漢·索羅,而話癆 C-3PO 則屈居探花。為廣大人民群眾所喜聞樂(lè)見的楚巴卡和 R2-D2,由于只會(huì)發(fā)出奇怪的聲音,只能活在臺(tái)詞中了。

character_count(SW)
character
LUKE           494
HAN            459
THREEPIO       301
LEIA           227
VADER          140
BEN            115
LANDO          101
YODA            49
EMPEROR         44
RED LEADER      38
BIGGS           34
WEDGE           33
PIETT           31
TARKIN          28
JABBA           26
OWEN            25
CREATURE        22
TROOPER         19
GOLD LEADER     14
ACKBAR          14
dtype: int64
SWCloud(SW, 'img/rebel alliance.png')

5 TF-IDF

??再看看叛軍是否使用與帝國(guó)不同的詞語(yǔ)。這里為了方便起見,將次要角色歸為 3 類,分別是帝國(guó)(Rebels)、叛軍(Rebels)和中立(Neutrals),以新希望為例。

def character_group(name: str) -> str:
    """
        將次要角色歸類
        
        Args:
            name: string, character name
            
        Returns:
            string, main character name & secondary character type
    """
    rebel = ('BASE VOICE', 'CONTROL OFFICER', 'MAN', 'PORKINS', 'REBEL OFFICER', 'RED ELEVEN',
             'RED TEN', 'RED SEVEN', 'RED NINE', 'RED LEADER', 'BIGGS', 'GOLD LEADER',
             'WEDGE', 'GOLD FIVE', 'REBEL', 'DODONNA', 'CHIEF', 'TECHNICIAN', 'WILLARD',
             'GOLD TWO', 'MASSASSI INTERCOM VOICE')
    imperial = ('CAPTAIN', 'CHIEF PILOT', 'TROOPER', 'OFFICER', 'DEATH STAR INTERCOM VOICE',
                'FIRST TROOPER', 'SECOND TROOPER', 'FIRST OFFICER', 'OFFICER CASS', 
                'INTERCOM VOICE', 'MOTTI', 'TAGGE', 'TROOPER VOICE', 'ASTRO-OFFICER',
                'VOICE OVER DEATH STAR INTERCOM', 'SECOND OFFICER', 'GANTRY OFFICER', 
                'WINGMAN', 'IMPERIAL OFFICER', 'COMMANDER', 'VOICE')
    neutral = ('WOMAN', 'BERU', 'CREATURE', 'DEAK', 'OWEN', 'BARTENDER', 'CAMIE', 'JABBA', 
               'AUNT BERU', 'GREEDO', 'NEUTRAL', 'HUMAN', 'FIXER')

    if name in rebel:
        return 'Rebels'
    elif name in imperial:
        return 'Imperials'
    elif name in neutral:
        return 'Neutrals'
    else:
        return name
SW_IV['group_character'] = SW_IV['character'].apply(character_group)
    
print(SW_IV.groupby('group_character').size().sort_values(ascending=False))
    
sns.countplot(y="group_character", data=SW_IV, 
            palette="GnBu_d", 
            order = SW_IV['group_character'].value_counts().index);

group_character
LUKE 254
HAN 153
Rebels 139
THREEPIO 119
BEN 82
Imperials 79
Neutrals 58
LEIA 57
VADER 41
TARKIN 28
dtype: int64

??通過(guò) TF-IDF 方法提取相關(guān)單詞,每個(gè)單詞將在每一行中都有一個(gè)值,表示其重要性。

tfidf_vec = TfidfVectorizer(max_df=0.1, max_features=200, stop_words='english')

features = tfidf_vec.fit_transform(SW_IV['dialogue'])
X = pd.DataFrame(data=features.toarray(), 
                 index=SW_IV.group_character, 
                 columns=tfidf_vec.get_feature_names())
X.sample(10)

??使用 PCA 將每一行顯示在 2D 圖形中。

pca = PCA(n_components=2)
X_reduced = pca.fit_transform(X)

df_reduced = pd.DataFrame(X_reduced)
df_reduced['group_character'] = X.index
df_reduced.head(10)

??為角色分配對(duì)應(yīng)的顏色:

  • 叛軍中的中堅(jiān)分子顯示為藍(lán)色;
  • 叛軍中的其他人員顯示為青色;
  • 爵爺顯示為紅色;
  • 其他帝國(guó)成員為洋紅;
  • 中立設(shè)為黑色。
def character_to_color(name: str):
    """
        返回角色對(duì)應(yīng)的顏色
    
        Args:
            name: string
    
        Returns:
            Series
    """
    color = {'LUKE': 'b', 'HAN': 'b', 'THREEPIO': 'b', 'BEN': 'b', 'LEIA': 'b',
             'VADER': 'r', 'TARKIN': 'r', 
             'Imperials': 'm', 'Rebels': 'c', 'Neutrals': 'k'}
    return color[name]
df_reduced['color'] = df_reduced['group_character'].apply(character_to_color)

plt.figure(figsize=(10, 10))
plt.scatter(x=df_reduced[0], y=df_reduced[1],
            color=df_reduced['color'], alpha=0.5)
plt.savefig('output\displaying_lines.jpg');

??不難看出,藍(lán)色和青色廣泛地分布在平面上,而紅色和洋紅則主要集中在左側(cè)靠下的位置,這意味著叛軍使用的詞匯比帝國(guó)要廣泛。

??值得注意的是,在上方有一點(diǎn)洋紅色,“帝國(guó)中出了一個(gè)叛徒”,將這點(diǎn)找出來(lái)。

df_reduced[(df_reduced[0]>0.1) & (df_reduced[1]>0.55) & (df_reduced[1]<0.6)]
SW_IV.loc[714]

character FIRST TROOPER
dialogue Give me regular reports.
clean_text [give, regular, reports]
group_character Imperials
Name: 714, dtype: object

6 Word2Vec

??使用 gensim 構(gòu)建 Word2Vec

sentences_IV = SW_IV['clean_text']

model = gensim.models.Word2Vec(min_count=3, window=5, iter=20)
model.build_vocab(sentences_IV)
model.train(sentences_IV, total_examples=model.corpus_count, epochs=model.epochs)

(84136, 142920)

??Word2Vec 為語(yǔ)料庫(kù)中的每個(gè)單詞構(gòu)建一個(gè)向量,可以籍此討論不同單詞的接近程度——相似的詞具有接近1的值,而相反的詞具有接近-1的值。

model.wv.most_similar('force')

[('system', 0.9998038411140442),
('he', 0.9998025298118591),
('the', 0.9997953176498413),
('us', 0.9997825622558594),
('going', 0.9997814893722534),
('want', 0.9997798800468445),
('her', 0.9997773170471191),
('main', 0.9997760057449341),
('get', 0.9997740983963013),
('one', 0.9997738599777222)]

model.wv.most_similar(negative=['force'])

[('hello', -0.992935836315155),
('makes', -0.9964836239814758),
('worse', -0.9966601133346558),
('artoodetoo', -0.9969137907028198),
('moving', -0.9970685839653015),
('identification', -0.9971023797988892),
('rock', -0.9971140027046204),
('gonna', -0.9974701404571533),
('cover', -0.9974837899208069),
('over', -0.9975569248199463)]

??創(chuàng)建一個(gè)關(guān)于主要角色的詞匯列表。

characters = SW_IV['group_character'].str.lower().unique()
vocab = list(model.wv.vocab)
vocab = list(filter(lambda x: x in characters, vocab))
vocab

['vader', 'luke', 'ben', 'threepio', 'han', 'leia', 'tarkin']

??創(chuàng)建一個(gè)表示詞匯的向量列表。

X = model[vocab]

??通過(guò) K-Means 算法將數(shù)據(jù)以 3 個(gè)簇為中心進(jìn)行聚類。

cluster_num = 3

kmeans = KMeans(n_clusters=cluster_num, random_state=0).fit(X)
cluster = kmeans.predict(X)

??使用 PCA 來(lái)降低到 2 維,再將其可視化。

pca = PCA(n_components=2, random_state=11, whiten=True)
clf = pca.fit_transform(X)

tmp = pd.DataFrame(clf, index=vocab, columns=['x', 'y'])

tmp.head(3)
tmp['cluster'] = None
tmp['c'] = None

count = 0
for index, row in tmp.iterrows():
    tmp['cluster'][index] = cluster[count]
    tmp['c'][index] = characters[count]
    count += 1
    
for i in range(cluster_num):
    values = tmp[tmp['cluster'] == i]
    plt.scatter(values['x'], values['y'], alpha = 0.5)

for word, row in tmp.iterrows():
    x, y, cat, character = row
    pos = (x, y)
    plt.annotate(character, pos)
    
plt.axis('off')
plt.title('Star Wars Episode IV')
plt.savefig('output\w2v_map.jpg')
plt.show();
最后編輯于
?著作權(quán)歸作者所有,轉(zhuǎn)載或內(nèi)容合作請(qǐng)聯(lián)系作者
【社區(qū)內(nèi)容提示】社區(qū)部分內(nèi)容疑似由AI輔助生成,瀏覽時(shí)請(qǐng)結(jié)合常識(shí)與多方信息審慎甄別。
平臺(tái)聲明:文章內(nèi)容(如有圖片或視頻亦包括在內(nèi))由作者上傳并發(fā)布,文章內(nèi)容僅代表作者本人觀點(diǎn),簡(jiǎn)書系信息發(fā)布平臺(tái),僅提供信息存儲(chǔ)服務(wù)。

相關(guān)閱讀更多精彩內(nèi)容

友情鏈接更多精彩內(nèi)容