<p>?</p><p/><h1>人工智能之核心基礎(chǔ) 機(jī)器學(xué)習(xí)</h1><p/><p>第十八章 經(jīng)典實(shí)戰(zhàn)項(xiàng)目
</p><p class="image-package"><img class="uploaded-img" src="https://upload-images.jianshu.io/upload_images/30827302-98861c397c689f3f.png?imageMogr2/auto-orient/strip%7CimageView2/2/w/1240" width="auto" height="auto"/></p>
<p><strong>18.1 入門項(xiàng)目1:房?jī)r(jià)預(yù)測(cè)(回歸任務(wù))</strong></p><p><strong>?? 任務(wù)目標(biāo)</strong></p><p>根據(jù)房屋特征(面積、房間數(shù)、位置等)預(yù)測(cè)價(jià)格。</p><p><strong>?? 數(shù)據(jù)集:</strong></p><pre>#?1.?加載數(shù)據(jù)from?sklearn.datasets?import?fetch_california_housingimport?pandas?as?pddata?=?fetch_california_housing()X,?y?=?data.data,?data.targetdf?=?pd.DataFrame(X,?columns=data.feature_names)df['MedHouseVal']?=?yprint(df.head())</pre><p><strong>?? 完整流程</strong></p><pre>import?numpy?as?npfrom?sklearn.model_selection?import?train_test_split,?cross_val_scorefrom?sklearn.preprocessing?import?StandardScalerfrom?sklearn.linear_model?import?LinearRegressionfrom?sklearn.ensemble?import?RandomForestRegressorfrom?sklearn.metrics?import?mean_squared_error,?r2_score#?2.?劃分?jǐn)?shù)據(jù)X_train,?X_test,?y_train,?y_test?=?train_test_split(????X,?y,?test_size=0.2,?random_state=42)#?3.?特征縮放(線性模型需要)scaler?=?StandardScaler()X_train_scaled?=?scaler.fit_transform(X_train)X_test_scaled?=?scaler.transform(X_test)#?4.?模型訓(xùn)練與評(píng)估m(xù)odels?=?{????'Linear?Regression':?LinearRegression(),????'Random?Forest':?RandomForestRegressor(n_estimators=100,?random_state=42)}results?=?{}for?name,?model?in?models.items():????#?線性模型用縮放數(shù)據(jù),樹模型不用????X_tr?=?X_train_scaled?if?name?==?'Linear?Regression'?else?X_train????X_te?=?X_test_scaled?if?name?==?'Linear?Regression'?else?X_test????????model.fit(X_tr,?y_train)????y_pred?=?model.predict(X_te)????????rmse?=?np.sqrt(mean_squared_error(y_test,?y_pred))????r2?=?r2_score(y_test,?y_pred)????results[name]?=?{'RMSE':?rmse,?'R2':?r2}????print(f"{name}?→?RMSE:?{rmse:.2f},?R2:?{r2:.2%}")#?5.?模型優(yōu)化(以隨機(jī)森林為例)from?sklearn.model_selection?import?RandomizedSearchCVfrom?scipy.stats?import?randintparam_dist?=?{????'n_estimators':?randint(50,?200),????'max_depth':?[3,?5,?7,?10,?None],????'min_samples_split':?randint(2,?20)}rf?=?RandomForestRegressor(random_state=42)random_search?=?RandomizedSearchCV(????rf,?param_dist,?n_iter=20,?cv=5,?????scoring='neg_root_mean_squared_error',?n_jobs=-1)random_search.fit(X_train,?y_train)print("
優(yōu)化后參數(shù):",?random_search.best_params_)best_rf?=?random_search.best_estimator_y_pred_best?=?best_rf.predict(X_test)print(f"優(yōu)化后?RMSE:?{np.sqrt(mean_squared_error(y_test,?y_pred_best)):.2f}")</pre><blockquote><p>???<strong>關(guān)鍵點(diǎn)</strong>:</p><ul><li><p>? 回歸任務(wù)評(píng)估:<strong>RMSE</strong>(越小越好)、<strong>R2</strong>(越接近1越好)</p></li><li><p>? 樹模型<strong>不需要特征縮放</strong></p></li><li><p>? 隨機(jī)搜索比網(wǎng)格搜索更高效</p></li></ul></blockquote>
<p><strong>18.2 入門項(xiàng)目2:客戶流失預(yù)測(cè)(分類任務(wù))</strong></p><p><strong>?? 任務(wù)目標(biāo)</strong></p><p>預(yù)測(cè)電信用戶是否會(huì)流失(0/1分類)。</p><p><strong>?? 數(shù)據(jù)集:</strong></p><pre>#?模擬數(shù)據(jù)生成(實(shí)際項(xiàng)目請(qǐng)下載真實(shí)數(shù)據(jù))np.random.seed(42)n?=?7000df?=?pd.DataFrame({????'tenure':?np.random.randint(1,?73,?n),??#?在網(wǎng)月數(shù)????'MonthlyCharges':?np.random.uniform(20,?120,?n),????'TotalCharges':?lambda?x:?x['tenure']??x['MonthlyCharges'],????'gender':?np.random.choice(['Male',?'Female'],?n),????'Partner':?np.random.choice(['Yes',?'No'],?n),????'Dependents':?np.random.choice(['Yes',?'No'],?n),????'Churn':?np.random.binomial(1,?0.25,?n)??#?目標(biāo)變量})</pre><p><strong>?? 完整流程</strong></p><pre>from?sklearn.preprocessing?import?LabelEncoderfrom?sklearn.compose?import?ColumnTransformerfrom?sklearn.pipeline?import?Pipelinefrom?sklearn.linear_model?import?LogisticRegressionfrom?sklearn.ensemble?import?RandomForestClassifierfrom?sklearn.metrics?import?classification_report,?confusion_matrix#?1.?數(shù)據(jù)預(yù)處理#?分離數(shù)值和類別特征num_features?=?['tenure',?'MonthlyCharges',?'TotalCharges']cat_features?=?['gender',?'Partner',?'Dependents']#?構(gòu)建預(yù)處理管道preprocessor?=?ColumnTransformer(????transformers=[????????('num',?StandardScaler(),?num_features),????????('cat',?OneHotEncoder(drop='first'),?cat_features)????])#?2.?劃分?jǐn)?shù)據(jù)X?=?df.drop('Churn',?axis=1)y?=?df['Churn']X_train,?X_test,?y_train,?y_test?=?train_test_split(????X,?y,?test_size=0.2,?stratify=y,?random_state=42)#?3.?模型管道m(xù)odels?=?{????'Logistic?Regression':?Pipeline([????????('preprocessor',?preprocessor),????????('classifier',?LogisticRegression(max_iter=1000))????]),????'Random?Forest':?Pipeline([????????('preprocessor',?preprocessor),????????('classifier',?RandomForestClassifier(n_estimators=100,?random_state=42))????])}#?4.?訓(xùn)練與評(píng)估for?name,?pipeline?in?models.items():????pipeline.fit(X_train,?y_train)????y_pred?=?pipeline.predict(X_test)????print(f"{name}?分類報(bào)告:")????print(classification_report(y_test,?y_pred))</pre><blockquote><p>???<strong>關(guān)鍵點(diǎn)</strong>:</p><ul><li><p>? 分類任務(wù)關(guān)注<strong>精確率、召回率、F1值</strong></p></li><li><p>? 使用?<strong>stratify=y</strong>?保證訓(xùn)練/測(cè)試集標(biāo)簽分布一致</p></li><li><p>? 類別特征用?<strong>One-Hot 編碼</strong>(Drop first 防多重共線性)</p></li></ul></blockquote>
<p><strong>18.3 進(jìn)階項(xiàng)目1:文本情感分析</strong></p><p><strong>?? 任務(wù)目標(biāo)</strong></p><p>判斷電影評(píng)論是正面還是負(fù)面。</p><p><strong>?? 數(shù)據(jù)集:</strong></p><pre>from?sklearn.datasets?import?fetch_20newsgroups#?獲取兩類新聞(模擬情感)categories?=?['alt.atheism',?'soc.religion.christian']newsgroups_train?=?fetch_20newsgroups(subset='train',?categories=categories)newsgroups_test?=?fetch_20newsgroups(subset='test',?categories=categories)X_train_text,?y_train?=?newsgroups_train.data,?newsgroups_train.targetX_test_text,?y_test?=?newsgroups_test.data,?newsgroups_test.target</pre><p><strong>?? 文本處理全流程</strong></p><pre>from?sklearn.feature_extraction.text?import?TfidfVectorizerfrom?sklearn.naive_bayes?import?MultinomialNBfrom?sklearn.svm?import?SVC#?1.?文本向量化(TF-IDF)vectorizer?=?TfidfVectorizer(????max_features=5000,??????#?限制詞匯表大小????stop_words='english',???#?去停用詞????ngram_range=(1,?2)??????#?使用1-gram和2-gram)X_train_tfidf?=?vectorizer.fit_transform(X_train_text)X_test_tfidf?=?vectorizer.transform(X_test_text)#?2.?模型訓(xùn)練models?=?{????'Naive?Bayes':?MultinomialNB(alpha=0.1),????'SVM':?SVC(kernel='linear',?C=1.0)}for?name,?model?in?models.items():????model.fit(X_train_tfidf,?y_train)????y_pred?=?model.predict(X_test_tfidf)????acc?=?accuracy_score(y_test,?y_pred)????print(f"{name}?準(zhǔn)確率:?{acc:.2%}")#?3.?查看重要特征(以NB為例)nb?=?MultinomialNB()nb.fit(X_train_tfidf,?y_train)feature_names?=?vectorizer.get_feature_names_out()#?正面類(假設(shè)label=1)的高權(quán)重詞pos_coef?=?nb.coef_[0]top_pos?=?np.argsort(pos_coef)[-10:]print("
正面關(guān)鍵詞:",?[feature_names[i]?for?i?in?top_pos])</pre><blockquote><p>???<strong>關(guān)鍵點(diǎn)</strong>:</p><ul><li><p>? 文本需<strong>向量化</strong>:TF-IDF 是經(jīng)典方法</p></li><li><p>??<strong>樸素貝葉斯</strong>適合高維稀疏文本</p></li><li><p>??<strong>SVM</strong>?在文本分類中表現(xiàn)優(yōu)異</p></li><li><p>??<strong>ngram_range=(1,2)</strong>?捕捉短語(yǔ)信息(如 "not good")</p></li></ul></blockquote>
<p><strong>18.4 進(jìn)階項(xiàng)目2:用戶分群(無監(jiān)督)</strong></p><p><strong>?? 任務(wù)目標(biāo)</strong></p><p>根據(jù)用戶行為將客戶分為不同群體。</p><p><strong>?? 數(shù)據(jù):電商用戶行為(模擬)</strong></p><pre>#?模擬用戶數(shù)據(jù)np.random.seed(42)n_users?=?2000df_users?=?pd.DataFrame({????'age':?np.random.randint(18,?70,?n_users),????'income':?np.random.exponential(50000,?n_users),????'purchase_freq':?np.random.poisson(5,?n_users),????'avg_order_value':?np.random.gamma(2,?50,?n_users)})</pre><p><strong>?? 聚類全流程</strong></p><pre>from?sklearn.cluster?import?KMeans,?DBSCANfrom?sklearn.preprocessing?import?StandardScalerimport?matplotlib.pyplot?as?plt#?1.?數(shù)據(jù)預(yù)處理X?=?df_users.valuesscaler?=?StandardScaler()X_scaled?=?scaler.fit_transform(X)#?2.?確定K值(肘部法則)inertias?=?[]K_range?=?range(2,?10)for?k?in?K_range:????kmeans?=?KMeans(n_clusters=k,?random_state=42)????kmeans.fit(X_scaled)????inertias.append(kmeans.inertia_)plt.plot(K_range,?inertias,?'bo-')plt.xlabel('K')plt.ylabel('Inertia')plt.title('Elbow?Method')plt.show()??#?假設(shè)選擇?K=4#?3.?執(zhí)行聚類kmeans?=?KMeans(n_clusters=4,?random_state=42)clusters?=?kmeans.fit_predict(X_scaled)df_users['cluster']?=?clusters#?4.?分析聚類結(jié)果print(df_users.groupby('cluster').mean())#?5.?可視化(PCA降維到2D)pca?=?PCA(n_components=2)X_pca?=?pca.fit_transform(X_scaled)plt.scatter(X_pca[:,?0],?X_pca[:,?1],?c=clusters,?cmap='viridis')plt.title("用戶分群?(PCA可視化)")plt.show()</pre><blockquote><p>???<strong>關(guān)鍵點(diǎn)</strong>:</p><ul><li><p>? 聚類前<strong>必須標(biāo)準(zhǔn)化</strong>(不同量綱)</p></li><li><p>??<strong>肘部法則</strong>選K,或用輪廓系數(shù)</p></li><li><p>??<strong>DBSCAN</strong>?適合不規(guī)則形狀簇,但需調(diào)?<strong>eps</strong>?和?<strong>min_samples</strong></p></li></ul></blockquote>
<p><strong>18.5 進(jìn)階項(xiàng)目3:半監(jiān)督文本分類</strong></p><p><strong>?? 任務(wù)目標(biāo)</strong></p><p>僅用少量標(biāo)注評(píng)論 + 大量無標(biāo)注評(píng)論,訓(xùn)練高精度分類器。</p><p><strong>?? 數(shù)據(jù):20 Newsgroups(部分標(biāo)注)</strong></p><pre>#?加載全部數(shù)據(jù)all_data?=?fetch_20newsgroups(subset='all',?categories=['comp.graphics',?'sci.med'])X_text_all,?y_all?=?all_data.data,?all_data.target#?模擬:僅5%有標(biāo)簽np.random.seed(42)n_total?=?len(y_all)n_labeled?=?int(0.05??n_total)labeled_idx?=?np.random.choice(n_total,?size=n_labeled,?replace=False)y_semi?=?np.full(n_total,?-1)y_semi[labeled_idx]?=?y_all[labeled_idx]</pre><p><strong>?? 偽標(biāo)簽法全流程</strong></p><pre>#?1.?文本向量化vectorizer?=?TfidfVectorizer(max_features=3000,?stop_words='english')X_tfidf?=?vectorizer.fit_transform(X_text_all)#?2.?初始模型訓(xùn)練(僅用有標(biāo)簽數(shù)據(jù))X_labeled?=?X_tfidf[labeled_idx]y_labeled?=?y_all[labeled_idx]lr?=?LogisticRegression(max_iter=1000)lr.fit(X_labeled,?y_labeled)#?3.?生成偽標(biāo)簽proba?=?lr.predict_proba(X_tfidf)pseudo_labels?=?lr.predict(X_tfidf)high_conf?=?proba.max(axis=1)?>?0.9??#?高置信度#?4.?構(gòu)建增強(qiáng)訓(xùn)練集train_idx?=?np.concatenate([labeled_idx,?np.where(high_conf)[0]])y_train?=?np.concatenate([y_labeled,?pseudo_labels[high_conf]])#?5.?聯(lián)合訓(xùn)練lr_final?=?LogisticRegression(max_iter=1000)lr_final.fit(X_tfidf[train_idx],?y_train)#?6.?評(píng)估(假設(shè)我們有完整測(cè)試集)#?實(shí)際中可用交叉驗(yàn)證或保留部分標(biāo)注數(shù)據(jù)test_acc?=?lr_final.score(X_tfidf,?y_all)print(f"半監(jiān)督準(zhǔn)確率:?{test_acc:.2%}")#?對(duì)比:僅用5%標(biāo)簽的監(jiān)督學(xué)習(xí)lr_baseline?=?LogisticRegression(max_iter=1000)lr_baseline.fit(X_labeled,?y_labeled)baseline_acc?=?lr_baseline.score(X_tfidf,?y_all)print(f"純監(jiān)督?(5%標(biāo)簽)?準(zhǔn)確率:?{baseline_acc:.2%}")</pre><blockquote><p>???<strong>關(guān)鍵點(diǎn)</strong>:</p><ul><li><p>??<strong>置信度閾值</strong>控制偽標(biāo)簽質(zhì)量</p></li><li><p>? 可迭代:多次生成偽標(biāo)簽 → 再訓(xùn)練</p></li><li><p>? 適用于<strong>文本、圖像等高維數(shù)據(jù)</strong></p></li></ul></blockquote>
<p><strong>18.6 進(jìn)階項(xiàng)目4:自監(jiān)督圖像特征提取與分類</strong></p><p><strong>?? 任務(wù)目標(biāo)</strong></p><p>用大量無標(biāo)注圖像預(yù)訓(xùn)練特征提取器,再用少量標(biāo)注數(shù)據(jù)做分類。</p><p><strong>?? 數(shù)據(jù):MNIST(模擬無標(biāo)注+少量標(biāo)注)</strong></p><pre>from?sklearn.datasets?import?fetch_openmlmnist?=?fetch_openml('mnist_784',?version=1,?as_frame=False)X,?y?=?mnist.data?/?255.0,?mnist.target.astype(int)#?模擬:60000張無標(biāo)注,僅1000張有標(biāo)注X_unlabeled?=?X[:60000]X_labeled?=?X[60000:61000]y_labeled?=?y[60000:61000]X_test,?y_test?=?X[61000:],?y[61000:]</pre><p><strong>?? 自編碼器預(yù)訓(xùn)練 + 微調(diào)</strong></p><pre>from?sklearn.neural_network?import?MLPRegressorfrom?sklearn.linear_model?import?LogisticRegression#?步驟1:?自監(jiān)督預(yù)訓(xùn)練(自編碼器)autoencoder?=?MLPRegressor(????hidden_layer_sizes=(128,?64,?128),????activation='relu',????solver='adam',????max_iter=30,????random_state=42)#?訓(xùn)練:輸入=輸出autoencoder.fit(X_unlabeled,?X_unlabeled)#?步驟2:?提取編碼器特征(中間層)#?注意:sklearn?MLP?不直接暴露中間層,需手動(dòng)實(shí)現(xiàn)或用PyTorch#?此處用PCA近似(實(shí)際項(xiàng)目建議用深度框架)from?sklearn.decomposition?import?PCApca?=?PCA(n_components=64)pca.fit(X_unlabeled)??#?模擬自監(jiān)督學(xué)到的特征X_labeled_features?=?pca.transform(X_labeled)X_test_features?=?pca.transform(X_test)#?步驟3:?少量標(biāo)注數(shù)據(jù)上訓(xùn)練分類器clf?=?LogisticRegression(max_iter=1000)clf.fit(X_labeled_features,?y_labeled)#?步驟4:?評(píng)估test_acc?=?clf.score(X_test_features,?y_test)print(f"自監(jiān)督特征?+?線性分類器準(zhǔn)確率:?{test_acc:.2%}")#?對(duì)比:直接在原始像素上訓(xùn)練(無預(yù)訓(xùn)練)clf_raw?=?LogisticRegression(max_iter=1000)clf_raw.fit(X_labeled,?y_labeled)raw_acc?=?clf_raw.score(X_test,?y_test)print(f"無預(yù)訓(xùn)練準(zhǔn)確率:?{raw_acc:.2%}")</pre><blockquote><p>???<strong>現(xiàn)實(shí)做法</strong>(推薦):</p><ul><li><p>? 用?<strong>PyTorch 實(shí)現(xiàn) SimSiam 或 MAE</strong>?進(jìn)行預(yù)訓(xùn)練</p></li><li><p>? 提取?<strong>backbone 特征</strong></p></li><li><p>? 用?<strong>Scikit-learn 訓(xùn)練 SVM/Logistic Regression</strong>?做下游分類</p></li></ul></blockquote><pre>#?偽代碼(實(shí)際需深度學(xué)習(xí)框架)#?features?=?sim_siam_encoder(unlabeled_images)??#?自監(jiān)督預(yù)訓(xùn)練#?clf?=?LogisticRegression().fit(features[labeled_idx],?y_labeled)#?acc?=?clf.score(features[test_idx],?y_test)</pre><blockquote><p>??<strong>優(yōu)勢(shì)</strong>:
在1000個(gè)標(biāo)簽上,自監(jiān)督預(yù)訓(xùn)練可提升準(zhǔn)確率?<strong>10~20%</strong>!</p></blockquote>
<p><strong>?? 本章終極總結(jié):項(xiàng)目選擇指南</strong></p><p/><p/>
<p>???<strong>建議</strong>:</p><ul><li><p>??<strong>部署</strong>:用?<strong>joblib</strong>?保存模型,F(xiàn)lask/Django 提供API</p></li><li><p>??<strong>監(jiān)控</strong>:跟蹤數(shù)據(jù)漂移(Data Drift)</p></li><li><p>??<strong>MLOps</strong>:MLflow 跟蹤實(shí)驗(yàn)</p></li></ul><h1>資料關(guān)注</h1><p>公眾號(hào):咚咚王
gitee:https://gitee.com/wy18585051844/ai_learning</p><p class="image-package"><img class="uploaded-img" src="https://upload-images.jianshu.io/upload_images/30827302-da9ee364cc5b6185.png?imageMogr2/auto-orient/strip%7CimageView2/2/w/1240" width="auto" height="auto"/></p><p>《Python編程:從入門到實(shí)踐》
《利用Python進(jìn)行數(shù)據(jù)分析》
《算法導(dǎo)論中文第三版》
《概率論與數(shù)理統(tǒng)計(jì)(第四版) (盛驟) 》
《程序員的數(shù)學(xué)》
《線性代數(shù)應(yīng)該這樣學(xué)第3版》
《微積分和數(shù)學(xué)分析引論》
《(西瓜書)周志華-機(jī)器學(xué)習(xí)》
《TensorFlow機(jī)器學(xué)習(xí)實(shí)戰(zhàn)指南》
《Sklearn與TensorFlow機(jī)器學(xué)習(xí)實(shí)用指南》
《模式識(shí)別(第四版)》
《深度學(xué)習(xí) deep learning》伊恩·古德費(fèi)洛著 花書
《Python深度學(xué)習(xí)第二版(中文版)【純文本】 (登封大數(shù)據(jù) (Francois Choliet)) (Z-Library)》
《深入淺出神經(jīng)網(wǎng)絡(luò)與深度學(xué)習(xí)+(邁克爾·尼爾森(Michael+Nielsen)》
《自然語(yǔ)言處理綜論 第2版》
《Natural-Language-Processing-with-PyTorch》
《計(jì)算機(jī)視覺-算法與應(yīng)用(中文版)》
《Learning OpenCV 4》
《AIGC:智能創(chuàng)作時(shí)代》杜雨+&+張孜銘
《AIGC原理與實(shí)踐:零基礎(chǔ)學(xué)大語(yǔ)言模型、擴(kuò)散模型和多模態(tài)模型》
《從零構(gòu)建大語(yǔ)言模型(中文版)》
《實(shí)戰(zhàn)AI大模型》
《AI 3.0》</p><p/><p/><p/><p/><p>?</p><p/>