26.python機(jī)器學(xué)習(xí)-MachineLearning

26.1 機(jī)器學(xué)習(xí)概述



機(jī)器學(xué)習(xí)應(yīng)用場(chǎng)景


26.2 數(shù)據(jù)來(lái)源與類型

26.1 數(shù)據(jù)來(lái)源

  • 企業(yè)日益積累的大量數(shù)據(jù)(互聯(lián)網(wǎng)公司更為顯著)
  • 政府掌握的各種數(shù)據(jù)
  • 科研機(jī)構(gòu)的實(shí)驗(yàn)數(shù)據(jù)
  • … ...
    • 離散型數(shù)據(jù):由記錄不同類別個(gè)體的數(shù)目所得到的數(shù)據(jù),又稱計(jì)數(shù)數(shù)據(jù),所有這些數(shù)據(jù)全部都是整數(shù),而且不能再細(xì)分,也不能進(jìn)一步提高他們的精確度。
    • 連續(xù)型數(shù)據(jù):變量可以在某個(gè)范圍內(nèi)取任一數(shù),即變量的取值可以是連續(xù)的,如,長(zhǎng)度、時(shí)間、質(zhì)量值等,這類整數(shù)通常是非整數(shù),含有小數(shù)部分。

26.2 可用數(shù)據(jù)集

26.3 常用數(shù)據(jù)集數(shù)據(jù)的結(jié)構(gòu)組成

結(jié)構(gòu):特征值+目標(biāo)值


26.3 數(shù)據(jù)的特征工程

26.3.1 Scikit-learn庫(kù)介紹

  • Python語(yǔ)言的機(jī)器學(xué)習(xí)工具
    • Scikit-learn包括許多知名的機(jī)器學(xué)習(xí)算法的實(shí)現(xiàn)
    • Scikit-learn文檔完善,容易上手,豐富的API,使其在學(xué)術(shù)界頗受歡迎。

26.3.2 數(shù)據(jù)的特征處理

  • 數(shù)值型數(shù)據(jù):
    • 標(biāo)準(zhǔn)縮放:
      • 歸一化
      • 標(biāo)準(zhǔn)化
      • 缺失值
  • 類別型數(shù)據(jù):one-hot編碼
  • 時(shí)間類型:時(shí)間的切分

26.4 實(shí)驗(yàn)

邏輯回歸
In:

import numpy as np
X = np.random.rand(1000,4) #(1000, 4)
y = np.random.randint(2, size=1000)

In:

#訓(xùn)練集和測(cè)試集拆分
from sklearn.model_selection import train_test_split
# help(train_test_split)
x_train,x_test,y_train,y_test = train_test_split(X,y,test_size=0.2)

In:

from sklearn.linear_model import LogisticRegression

In:

# help(LogisticRegression)
logic = LogisticRegression() #建立模型
# penalty='l2' 懲罰項(xiàng),避免過(guò)擬合的問(wèn)題
#max_iter=100 迭代100次

In:

#訓(xùn)練模型
logic.fit(x_train,y_train) #X是特征,y是目標(biāo)值

out:

D:\ProgramData\Anaconda3\lib\site-packages\sklearn\linear_model\logistic.py:432: FutureWarning: Default solver will be changed to 'lbfgs' in 0.22. Specify a solver to silence this warning.
  FutureWarning)
LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='warn', n_jobs=None, penalty='l2',
                   random_state=None, solver='warn', tol=0.0001, verbose=0,
                   warm_start=False)

In:

#評(píng)估模型
from sklearn.model_selection import cross_val_score
score = cross_val_score(logic,x_test,y_test,cv=4)
print(score.mean())

out:

0.558623449379752
D:\ProgramData\Anaconda3\lib\site-packages\sklearn\linear_model\logistic.py:432: FutureWarning: Default solver will be changed to 'lbfgs' in 0.22. Specify a solver to silence this warning.
  FutureWarning)
D:\ProgramData\Anaconda3\lib\site-packages\sklearn\linear_model\logistic.py:432: FutureWarning: Default solver will be changed to 'lbfgs' in 0.22. Specify a solver to silence this warning.
  FutureWarning)
D:\ProgramData\Anaconda3\lib\site-packages\sklearn\linear_model\logistic.py:432: FutureWarning: Default solver will be changed to 'lbfgs' in 0.22. Specify a solver to silence this warning.
  FutureWarning)
D:\ProgramData\Anaconda3\lib\site-packages\sklearn\linear_model\logistic.py:432: FutureWarning: Default solver will be changed to 'lbfgs' in 0.22. Specify a solver to silence this warning.
  FutureWarning)

In:

#預(yù)測(cè)結(jié)果
y_predict = logic.predict(x_test) #X代表了測(cè)試集的特征

In:

y_predict

out:

array([0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 1, 1, 1, 0, 1, 0, 0, 0, 1, 0, 1,
       1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 1, 1, 1, 0, 1, 0, 0, 0, 1, 1,
       1, 0, 1, 0, 0, 0, 1, 1, 0, 1, 1, 0, 0, 0, 1, 1, 1, 1, 1, 0, 1, 0,
       0, 0, 1, 0, 0, 0, 0, 1, 0, 1, 0, 0, 1, 0, 1, 1, 1, 0, 0, 0, 0, 0,
       0, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 1, 1, 0, 1,
       0, 0, 1, 1, 0, 0, 0, 1, 0, 1, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 1, 1,
       0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1,
       0, 0, 0, 1, 0, 1, 1, 0, 0, 1, 0, 1, 1, 0, 1, 0, 1, 1, 0, 0, 1, 1,
       0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 1, 0, 0, 0, 1, 1, 0, 0, 1, 0,
       1, 0])

In:

logic.score(x_test,y_test)

out:

0.49

決策樹(shù)
In:

from sklearn.tree import DecisionTreeClassifier
  • criterion :gini,entropy
  • max_depth:最大深度
  • min_samples_split:最小切分樣本數(shù)
  • min_samples_leaf:葉子節(jié)點(diǎn)最小樣本數(shù)
  • max_leaf_nodes:最大葉子節(jié)點(diǎn)個(gè)數(shù)
  • max_features:最大特征數(shù)
    In:
#參數(shù)調(diào)優(yōu)
from sklearn.model_selection import GridSearchCV,StratifiedKFold
skcv = StratifiedKFold(n_splits=4,random_state=33)
grid_params = {'max_features':[3,4],'criterion':['gini','entropy']}
# help(GridSearchCV)
gs = GridSearchCV(dt,param_grid=grid_params,cv=skcv)
gs.fit(x_train,y_train)

out:

GridSearchCV(cv=StratifiedKFold(n_splits=4, random_state=33, shuffle=False),
             error_score='raise-deprecating',
             estimator=DecisionTreeClassifier(class_weight=None,
                                              criterion='gini', max_depth=None,
                                              max_features=None,
                                              max_leaf_nodes=None,
                                              min_impurity_decrease=0.0,
                                              min_impurity_split=None,
                                              min_samples_leaf=1,
                                              min_samples_split=2,
                                              min_weight_fraction_leaf=0.0,
                                              presort=False, random_state=None,
                                              splitter='best'),
             iid='warn', n_jobs=None,
             param_grid={'criterion': ['gini', 'entropy'],
                         'max_features': [3, 4]},
             pre_dispatch='2*n_jobs', refit=True, return_train_score=False,
             scoring=None, verbose=0)

In:

gs.best_params_  #最優(yōu)參數(shù)組合

out:

{'criterion': 'gini', 'max_features': 3}

In:

gs.best_score_

out:

0.5

In:

dt_best = DecisionTreeClassifier(criterion='gini',max_features=3)
dt_best.fit(x_train,y_train)
DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
                       max_features=3, max_leaf_nodes=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, presort=False,
                       random_state=None, splitter='best')

In:

dt.score(x_test,y_test)

out:

0.48

隨機(jī)森林
In:

from sklearn.ensemble import RandomForestClassifier
  • n_estimators: 評(píng)估器的數(shù)量
  • criterion :gini,entropy
  • max_depth:最大深度
  • min_samples_split:最小切分樣本數(shù)
  • min_samples_leaf:葉子節(jié)點(diǎn)最小樣本數(shù)
  • max_leaf_nodes:最大葉子節(jié)點(diǎn)個(gè)數(shù)
  • max_features:最大特征數(shù)
    In:
rf = RandomForestClassifier()

In:

rf.fit(x_train,y_train)
D:\ProgramData\Anaconda3\lib\site-packages\sklearn\ensemble\forest.py:245: FutureWarning: The default value of n_estimators will change from 10 in version 0.20 to 100 in 0.22.
  "10 in version 0.20 to 100 in 0.22.", FutureWarning)
RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
                       max_depth=None, max_features='auto', max_leaf_nodes=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, n_estimators=10,
                       n_jobs=None, oob_score=False, random_state=None,
                       verbose=0, warm_start=False)

In:

rf.score(x_test,y_test)

out:

0.51

In:

rf.feature_importances_

out:

array([0.23488722, 0.24854995, 0.25496364, 0.26159919])

XGBOOST
In:

pip install xgboost
Collecting xgboost
  Downloading https://files.pythonhosted.org/packages/29/31/580e1a2cd683fa219b272bd4f52540c987a5f4be5d28ed506a87c551667f/xgboost-1.1.1-py3-none-win_amd64.whl (54.4MB)
Requirement already satisfied: scipy in d:\programdata\anaconda3\lib\site-packages (from xgboost) (1.3.1)
Requirement already satisfied: numpy in d:\programdata\anaconda3\lib\site-packages (from xgboost) (1.16.5)
Installing collected packages: xgboost
Successfully installed xgboost-1.1.1
Note: you may need to restart the kernel to use updated packages.

In:

from xgboost import XGBClassifier
  • max_depth:深度
  • learning_rate:學(xué)習(xí)率
  • gamma
  • C:懲罰因子
    In:
xgb = XGBClassifier()

In:

xgb.fit(x_train,y_train)
XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
              colsample_bynode=1, colsample_bytree=1, gamma=0, gpu_id=-1,
              importance_type='gain', interaction_constraints='',
              learning_rate=0.300000012, max_delta_step=0, max_depth=6,
              min_child_weight=1, missing=nan, monotone_constraints='()',
              n_estimators=100, n_jobs=0, num_parallel_tree=1,
              objective='binary:logistic', random_state=0, reg_alpha=0,
              reg_lambda=1, scale_pos_weight=1, subsample=1,
              tree_method='exact', validate_parameters=1, verbosity=None)

In:

xgb.score(x_test,y_test)

out:

0.535

In:

y_pred = xgb.predict(x_test)

In:

#特征重要程度
from xgboost import plot_importance
plot_importance(xgb)

out:

<matplotlib.axes._subplots.AxesSubplot at 0x18d508fbe88>

In:

#評(píng)價(jià)方法
from sklearn.metrics import classification_report,f1_score
rep = classification_report(y_test,y_pred)
f1 = f1_score(y_test,y_pred)
print(f1)

out:

0.5753424657534247

樸素貝葉斯
In:

from sklearn.naive_bayes import GaussianNB

In:

# help(MultinomialNB)
nb = GaussianNB()
nb.fit(x_train,y_train)

out:

GaussianNB(priors=None, var_smoothing=1e-09)

In:

nb.score(x_test,y_test)

out:

0.61

SVM

調(diào)優(yōu)的參數(shù)

  • C:懲罰因子
  • gamma
  • kernel: linear,rbf
    In:
from sklearn.svm import SVC
svm = SVC()
svm.fit(x_train,y_train)
D:\ProgramData\Anaconda3\lib\site-packages\sklearn\svm\base.py:193: FutureWarning: The default value of gamma will change from 'auto' to 'scale' in version 0.22 to account better for unscaled features. Set gamma explicitly to 'auto' or 'scale' to avoid this warning.
  "avoid this warning.", FutureWarning)
SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
    decision_function_shape='ovr', degree=3, gamma='auto_deprecated',
    kernel='rbf', max_iter=-1, probability=False, random_state=None,
    shrinking=True, tol=0.001, verbose=False)

In:

svm.score(x_test,y_test)

out:

0.6

神經(jīng)網(wǎng)絡(luò)
In:

from sklearn.neural_network import MLPClassifier

重要參數(shù):

  • activation:激活函數(shù),默認(rèn)relu
  • hidden_layer_sizes:隱藏層神經(jīng)元個(gè)數(shù)
  • learning_rate:學(xué)習(xí)率
    In:
mlp = MLPClassifier()
mlp.fit(x_train,y_train)
mlp.score(x_test,y_test)
D:\ProgramData\Anaconda3\lib\site-packages\sklearn\neural_network\_multilayer_perceptron.py:585: ConvergenceWarning: Stochastic Optimizer: Maximum iterations (200) reached and the optimization hasn't converged yet.
  % self.max_iter, ConvergenceWarning)

out:

0.495
最后編輯于
?著作權(quán)歸作者所有,轉(zhuǎn)載或內(nèi)容合作請(qǐng)聯(lián)系作者
【社區(qū)內(nèi)容提示】社區(qū)部分內(nèi)容疑似由AI輔助生成,瀏覽時(shí)請(qǐng)結(jié)合常識(shí)與多方信息審慎甄別。
平臺(tái)聲明:文章內(nèi)容(如有圖片或視頻亦包括在內(nèi))由作者上傳并發(fā)布,文章內(nèi)容僅代表作者本人觀點(diǎn),簡(jiǎn)書系信息發(fā)布平臺(tái),僅提供信息存儲(chǔ)服務(wù)。

友情鏈接更多精彩內(nèi)容