以下是一個綜合示例,展示如何使用詞袋模型和TF-IDF進行文本分類。
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score
# 示例文本和標簽
texts = [
"The quick brown fox jumps over the lazy dog",
"I love watching the quick brown fox",
"The dog was lazy and the fox was quick",
"This is a test document about machine learning",
"Another document about deep learning"
]
labels = [0, 0, 0, 1, 1] # 0: Fox, 1: Machine Learning
# 劃分訓練集和測試集
X_train, X_test, y_train, y_test = train_test_split(texts, labels, test_size=0.2, random_state=42)
# 使用詞袋模型
vectorizer = CountVectorizer()
X_train_bow = vectorizer.fit_transform(X_train)
X_test_bow = vectorizer.transform(X_test)
# 使用TF-IDF
vectorizer_tfidf = TfidfVectorizer()
X_train_tfidf = vectorizer_tfidf.fit_transform(X_train)
X_test_tfidf = vectorizer_tfidf.transform(X_test)
# 訓練模型
model_bow = MultinomialNB()
model_bow.fit(X_train_bow, y_train)
model_tfidf = MultinomialNB()
model_tfidf.fit(X_train_tfidf, y_train)
# 預測
y_pred_bow = model_bow.predict(X_test_bow)
y_pred_tfidf = model_tfidf.predict(X_test_tfidf)
# 評估模型
accuracy_bow = accuracy_score(y_test, y_pred_bow)
accuracy_tfidf = accuracy_score(y_test, y_pred_tfidf)
print(f"詞袋模型準確率: {accuracy_bow:.2f}")
print(f"TF-IDF模型準確率: {accuracy_tfidf:.2f}")
注意事項
-
參數(shù)調(diào)整:
CountVectorizer和TfidfVectorizer有許多參數(shù)可以調(diào)整,如max_df、min_df、stop_words等,以優(yōu)化模型性能。 - 特征選擇:在高維特征空間中,可以使用特征選擇方法(如遞歸特征消除)來減少特征數(shù)量,提高模型性能。
- 模型選擇:根據(jù)具體任務(wù)選擇合適的模型,如樸素貝葉斯、支持向量機、隨機森林等。
通過以上示例,你可以了解詞袋模型和TF-IDF的基本用法,并將其應用于文本分類任務(wù)。