一,邏輯回歸的應(yīng)用場景?
廣告點擊率是否為垃圾郵件是否患病金融詐騙虛假賬號?
二,邏輯回歸的原理?
1,輸入?
邏輯回歸的輸入是線性回歸的結(jié)果:??
2,激活函數(shù)?
1)sigmoid函數(shù)?
? 回歸的結(jié)果輸入到sigmod函數(shù)當中?
輸出結(jié)果:[0,1]區(qū)間中的一個概率值,默認為0.5的門限值?
2)注意:?
邏輯回歸的最終分類是通過某個類別的概率來判斷是否屬于某個類別,并且這個類別默認標記為1(正例),另一個標記為0(反例)。默認目標值少的為正例。?
3,損失函數(shù)?
1)對數(shù)似然損失公式?
邏輯回歸的損失,稱之為對數(shù)似然損失,公式如下:? ??
2)綜合完整損失函數(shù)如下:?
3)理解對數(shù)似然損失示例如下:?
?如上可知,降低損失需要(正例減少sigmoid返回結(jié)果,反例增加sigmod返回結(jié)果)?
4,優(yōu)化方法?
同樣使用梯度下降優(yōu)化算法,去減少損失函數(shù)的值,這樣去更新邏輯回歸前面對應(yīng)算法的權(quán)重參數(shù),提升原本屬于1類別的概率,降低原本為0類別的概率。?
三,邏輯回歸API?
sklearn.linear_model.LogisticRegression(solver=‘liblinear’,penalty=‘i2’,c=1.0)?
solver:優(yōu)化求解方式(默認開源的liblinear庫實現(xiàn),內(nèi)部使用了坐標軸下降法來迭代優(yōu)化損失函數(shù))?
? sag:根據(jù)數(shù)據(jù)集自動選擇,隨機平局梯度下降 penalty:正則化種類c:正則化力度?
?默認將類別數(shù)量少的當正例?
四,案例:癌癥分類預(yù)測?
數(shù)據(jù)源:https://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/?
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
def logisticregression():
? ? '''邏輯回歸癌癥預(yù)測'''
? ? # 確定數(shù)據(jù)columns數(shù)值
? ? columns = ["Sample code number","Clump Thickness","Uniformity of Cell Size","Uniformity of Cell Shape","Marginal Adhesion","Single Epithelial Cell Size","Bare Nuclei","Bland Chromatin","Normal Nucleoli","Mitoses","Class"]
? ? data = pd.read_csv("breast-cancer-wisconsin.data",names=columns)
? ? # 去掉缺失值
? ? data.replace(to_replace="?",value=np.nan,inplace=True)
? ? data.dropna(axis=0,inplace=True,how="any")
? ? # 提取目標值
? ? target = data["Class"]
? ? # 提取特征值
? ? data = data.drop(["Sample code number"],axis=1).iloc[:,:-1]
? ? # 切割訓練集和測試集
? ? x_train,x_test,y_train,y_test = train_test_split(data,target,test_size=0.3)
? ? # 進行標準化
? ? std = StandardScaler()
? ? x_train = std.fit_transform(x_train)
? ? x_test = std.fit_transform(x_test)
? ? # 邏輯回歸進行訓練和預(yù)測
? ? lr = LogisticRegression()
? ? lr.fit(x_train,y_train)
? ? print("邏輯回歸權(quán)重:",lr.coef_)
? ? print("邏輯回歸偏置:",lr.intercept_)
? ? # 邏輯回歸測試集預(yù)測結(jié)果
? ? pre_result = lr.predict(x_test)
? ? print(pre_result)
? ? # 邏輯回歸預(yù)測準確率
? ? sore = lr.score(x_test,y_test)
? ? print(sore)
if __name__ == '__main__':
? ? logisticregression()
五,二分類的評估方法–(精確率(Precision)與召回率(Recall))?
1,精確率:?
預(yù)測結(jié)果為正例樣本中真是為整理的比例(查的準)??
2,召回率:?
真是為正例的樣本中預(yù)測結(jié)果為正例的比例(查的全,對正樣本的區(qū)分能力)??
3,F(xiàn)1-score?
反應(yīng)了模型的穩(wěn)健型??
4,模型評估API?
sklearn.metrics.classification_report(y_true,y_pred,target_names=None)?
y_true: 真實目標值y_pred: 估計器預(yù)測目標值target_names: 目標類名稱return: 每個類別精準率與召回率?
5,代碼?
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report
def logisticregression():
? ? '''邏輯回歸癌癥預(yù)測'''
? ? # 確定數(shù)據(jù)columns數(shù)值
? ? columns = ["Sample code number","Clump Thickness","Uniformity of Cell Size","Uniformity of Cell Shape","Marginal Adhesion","Single Epithelial Cell Size","Bare Nuclei","Bland Chromatin","Normal Nucleoli","Mitoses","Class"]
? ? data = pd.read_csv("breast-cancer-wisconsin.data",names=columns)
? ? # 去掉缺失值
? ? data.replace(to_replace="?",value=np.nan,inplace=True)
? ? data.dropna(axis=0,inplace=True,how="any")
? ? # 提取目標值
? ? target = data["Class"]
? ? # 提取特征值
? ? data = data.drop(["Sample code number"],axis=1).iloc[:,:-1]
? ? # 切割訓練集和測試集
? ? x_train,x_test,y_train,y_test = train_test_split(data,target,test_size=0.3)
? ? # 進行標準化
? ? std = StandardScaler()
? ? x_train = std.fit_transform(x_train)
? ? x_test = std.fit_transform(x_test)
? ? # 邏輯回歸進行訓練和預(yù)測
? ? lr = LogisticRegression()
? ? lr.fit(x_train,y_train)
? ? # 得到訓練集返回數(shù)據(jù)
? ? # print("邏輯回歸權(quán)重:",lr.coef_)
? ? # print("邏輯回歸偏置:",lr.intercept_)
? ? # 邏輯回歸測試集預(yù)測結(jié)果
? ? pre_result = lr.predict(x_test)
? ? # print(pre_result)
? ? # 邏輯回歸預(yù)測準確率
? ? sore = lr.score(x_test,y_test)
? ? print(sore)
? ? # 精確率(Precision)與召回率(Recall)
? ? report = classification_report(y_test,pre_result,target_names=["良性","惡性"])
? ? print(report)
if __name__ == '__main__':
? ? logisticregression()
六,ROC曲線與AUC指標?
?問題:如何衡量樣本不均衡下的評估??
1,ROC曲線與FPR?
TPR = TP / (TP + FN)?
? 所有真實類別為1的樣本中,預(yù)測類別為1的比例 FPR = FP / (TP + FN)?
? 所有真實類別為0的樣本中,預(yù)測類別為1的比例??
2,ROC曲線?
ROC曲線的橫軸就是FPRate,縱軸就是TPRate,當二者相等時,表示的意義則是:對于不論真實類別時1還是0的樣本,分類器預(yù)測為1的概率是相等的,此時AUC為0.5 。??
3,AUC指標?
AUC的概率意義時隨機取一對正負樣本,正樣本得分大于負樣本的概率。AUC的最小值為0.5,最大值為1,取值越高越好。AUC=1,完美分類器,采用這個預(yù)測模型時,不管設(shè)定什么門限值都能得出完美預(yù)測。絕大多數(shù)預(yù)測的場合,不存在完美分類器。0.5<AUC<1,優(yōu)于隨機猜測. 這個分類器(模型)妥善設(shè)定限制的話,能有預(yù)測價值。AUC=0.5,跟隨機猜測一樣(例:丟銅板),模型沒有預(yù)測價值。AUC<0.5,比隨機猜測還差;但只要總是反預(yù)測而行,就優(yōu)于隨機猜測,因此不存在AIC<0.5的情況?
?最終AUC的范圍在[0.5,1],并且越接近1越好。?
4,AUC計算API?
from sklearn.metrics import roc_auc_score?
sklearn.metrics.roc_auc_score(y_true,y_score)?
? 計算ROC曲線面積,即AUC值y_true:每個樣本的真是類別,必須為0(反例),1(正例)標記y_score:每個樣本預(yù)測的概率值??
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report,roc_auc_score
def logisticregression():
? ? '''邏輯回歸癌癥預(yù)測'''
? ? # 確定數(shù)據(jù)columns數(shù)值
? ? columns = ["Sample code number","Clump Thickness","Uniformity of Cell Size","Uniformity of Cell Shape","Marginal Adhesion","Single Epithelial Cell Size","Bare Nuclei","Bland Chromatin","Normal Nucleoli","Mitoses","Class"]
? ? data = pd.read_csv("breast-cancer-wisconsin.data",names=columns)
? ? # 去掉缺失值
? ? data.replace(to_replace="?",value=np.nan,inplace=True)
? ? data.dropna(axis=0,inplace=True,how="any")
? ? # 提取目標值
? ? target = data["Class"]
? ? # 提取特征值
? ? data = data.drop(["Sample code number"],axis=1).iloc[:,:-1]
? ? # 切割訓練集和測試集
? ? x_train,x_test,y_train,y_test = train_test_split(data,target,test_size=0.3)
? ? # 進行標準化
? ? std = StandardScaler()
? ? x_train = std.fit_transform(x_train)
? ? x_test = std.fit_transform(x_test)
? ? # 邏輯回歸進行訓練和預(yù)測
? ? lr = LogisticRegression()
? ? lr.fit(x_train,y_train)
? ? # 得到訓練集返回數(shù)據(jù)
? ? # print("邏輯回歸權(quán)重:",lr.coef_)
? ? # print("邏輯回歸偏置:",lr.intercept_)
? ? # 邏輯回歸測試集預(yù)測結(jié)果
? ? pre_result = lr.predict(x_test)
? ? # print(pre_result)
? ? # 邏輯回歸預(yù)測準確率
? ? sore = lr.score(x_test,y_test)
? ? print(sore)
? ? # 精確率(Precision)與召回率(Recall)
? ? report = classification_report(y_test,pre_result,target_names=["良性","惡性"])
? ? print(report)
? ? # 查看AUC指標
? ? y_test = np.where(y_test>2.5,1,0)
? ? print(y_test)
? ? auc_score = roc_auc_score(y_test,pre_result)
? ? print(auc_score)
if __name__ == '__main__':
? ? logisticregression()
5,總結(jié)?
AUC只能用來評價二分類AUC非常適合評價樣本不平衡中的分類器性能AUC會比較預(yù)測出來的概率,而不僅僅是標簽類AUC指標大于0.7,一般都是比較好的分類器?
七,Scikit-learn的算法實現(xiàn)總結(jié)?
scikit-learn把梯度下降求解的單獨分開,叫SGDclassifier和SGDRegressor,他們的損失都是分類和回歸對應(yīng)的損失,比如分類:有l(wèi)og loss, 和 hingle loss(SVM)的,回歸如:比如 均方誤差, 其它的API是一樣的損失,求解并不是用梯度下降的,所以一般大規(guī)模的數(shù)據(jù)都是用scikit-learn其中SGD的方式求解。