Logistic回歸
優(yōu)點:
計算代價不高,易于實現(xiàn)和理解
缺點:
容易欠擬合,分類精度可能不高
適用數(shù)據(jù)類型:
數(shù)值型和標(biāo)稱型數(shù)據(jù)
二分類問題
在二維平面劃分?jǐn)?shù)據(jù)(http://pan.baidu.com/s/1skSMVXr 密碼:q9av ):

如圖所示,圖上的數(shù)據(jù)點是樣本數(shù)據(jù),可通過一條直線大致劃分?jǐn)?shù)據(jù),我們正是要尋找這樣一條直線,直線公式為:
f(X) = W*X + b
X為特征向量,W為權(quán)重向量,我們正是要求得W和b的值,
將該公式推廣至N維:
f(x1,x2,x3...xn) = w1*x1 + w2*x2 + w3*x3 +...+ wn*xn + b*1
==>f(x) = W*X
#W = [w1,w2,w3...wn,b]
#X = [x1,x2,x3...xb,1]
Sigmoid
sigmoid函數(shù)公式:

sigmoid函數(shù)圖像:

如圖,該函數(shù)極值為1,0,x=0時,其函數(shù)值為0.5,輸入的值越接近0,函數(shù)值變化越大。
將f(x)代入sigmoid,若f(x) > 0,sigmoid(f(x)) > 0.5.
若f(x) < 0,sigmoid(f(x)) < 0.5.我們正是要求得一個W 使得樣本中兩個分類中最接近回歸直線的樣本點在其正確位置上盡量遠(yuǎn)離直線。對于sigmoid函數(shù)值也可看作該樣本點屬于1分類的概率。
代碼實現(xiàn):
import numpy as np
def sigmoid(x):
return 1/(1 + np.exp(-x))
梯度上升算法
梯度上升算法用于尋找最佳的W值
梯度記作▽,f(x,y)的梯度如下表示:



一個實例:

梯度上升算法到達(dá)每個點后都會重新估計移動的方向。從p0開始,計算完該點的梯度,函數(shù)就根據(jù)梯度移動到下一點p1 。在p1點梯度再次被重新計算,并沿新的梯度方向移動到p2。如此循環(huán)迭代,直到滿足停止條件。迭代的過程中,梯度算子總是保證我們能選取到最佳的移動方向。這里所說的是移動方向,而未提到移動量的大小。該量值稱為步長,記做α。用向量來表示的話,梯度算法的迭代公式如下:

梯度上升算法代碼實現(xiàn):
def gradAscent(dataset, labels):
if not isinstance(dataset, np.ndarray):
dataset = np.array(dataset, dtype=np.float32)
if not isinstance(labels, np.ndarray):
labels = np.array(labels, dtype=np.float32)
labels.shape = (labels.shape[0], 1)
m, n = dataset.shape
alpha = 0.001
maxCycles = 500
weights = np.ones((n, 1))
x = [[1],[1],[1]]
y = 0
for i in range(maxCycles):
h = sigmoid(np.dot(dataset, weights))
error = labels - h
weights = weights + alpha * np.dot(dataset.transpose(), error)
x[0].append(weights[0])
x[1].append(weights[1])
x[2].append(weights[2])
y += 1
return weights, x, y
決策邊界

W的訓(xùn)練變化

隨機(jī)梯度上升算法
梯度上升算法在每次更新回歸系數(shù)時都需要遍歷整個數(shù)據(jù)集,如果有數(shù)十億樣本和成千上萬的特征,那么該方法的計算復(fù)雜度就太高了。一種改進(jìn)方法是一次僅用一個樣本點來更新回歸系數(shù),該方法稱為隨機(jī)梯度上升算法。 由于可以在新樣本到來時對分類器進(jìn)行增量式更新,因而隨機(jī)梯度上升算法是一個在線學(xué)習(xí)算法。
隨機(jī)梯度上升算法實現(xiàn)V1
####隨機(jī)梯度上升算法version-1
def stocGradAscent0(dataset, labels):
if not isinstance(dataset, np.ndarray):
dataset = np.array(dataset, dtype=np.float32)
if not isinstance(labels, np.ndarray):
labels = np.array(labels, dtype=np.float32)
m, n = dataset.shape
alpha = 0.001
weights = np.ones(n)
x = [[1],[1],[1]]
y = 0
for i in range(m):
h = sigmoid(np.sum(dataset[i] * weights))
error = labels[i] - h
weights = weights + alpha * error * dataset[i]
x[0].append(weights[0])
x[1].append(weights[1])
x[2].append(weights[2])
y += 1
return weights, x, y
繪制決策邊界V1

由于數(shù)據(jù)量太小,只有100條,其分類效果較為差勁,還未達(dá)到收斂。
W的訓(xùn)練變化

隨機(jī)梯度上升算法實現(xiàn)V2
####隨機(jī)梯度上升算法version-2
def stocGradAscent1(dataset, labels, numIter = 150):
from random import sample
if not isinstance(dataset, np.ndarray):
dataset = np.array(dataset, dtype=np.float32)
if not isinstance(labels, np.ndarray):
labels = np.array(labels, dtype=np.float32)
m, n = dataset.shape
weights = np.ones(n)
x = [[1],[1],[1]]
y = 0
for j in range(numIter):
index = sample(range(m), m)
for i in range(m):
alpha = 4/(1.0+i+j) + 0.01
h = sigmoid(np.sum(dataset[index[i]] * weights))
error = labels[index[i]] - h
weights = weights + alpha * error * dataset[index[i]]
x[0].append(weights[0])
x[1].append(weights[1])
x[2].append(weights[2])
y += 1
return weights, x, y
繪制決策邊界

可以看出,效果與梯度上升算法相仿,但是消耗的計算資源較少。
W的訓(xùn)練變化

使用樣本隨機(jī)選擇和alpha動態(tài)減少機(jī)制的隨機(jī)梯度上升算法,該方法比采用固定alpha的方法收斂速度更快
畫圖函數(shù)
def loadDataset():
dataset = []
labels = []
with open('testSet.txt') as f:
for line in f:
line = line.strip().split()
dataset.append([1.0, float(line[0]), float(line[1])])
labels.append(int(line[2]))
return dataset, labels
def plotBestFit(dataset=None, labels=None):
import matplotlib.pyplot as plt
if dataset is None:
dataset, labels = loadDataset()
dataset = np.array(dataset, dtype=np.float32)
num = dataset.shape[0]
# weights, wx, wy = gradAscent(dataset, labels)
# weights, wx, wy = stocGradAscent0(dataset, labels)
weights, wx, wy = stocGradAscent1(dataset, labels)
xcord1 = []
ycord1 = []
xcord0 = []
ycord0 = []
for i in range(num):
if labels[i] == 1:
xcord1.append(dataset[i][1])
ycord1.append(dataset[i][2])
else:
xcord0.append(dataset[i][1])
ycord0.append(dataset[i][2])
fig = plt.figure()
ax = fig.add_subplot(111)
ax.scatter(xcord0, ycord0, s=30, c='red', marker='s')
ax.scatter(xcord1, ycord1, s=30, c='green')
x = np.arange(-3.0, 3.0, 0.1)
y = (-weights[0] - weights[1]*x) / weights[2]
ax.plot(x, y)
plt.xlabel('X1')
plt.ylabel('X2')
plt.show()
plotW(wx, wy)
def plotW(x, y):
import matplotlib.pyplot as plt
plt.figure(1)
ax1 = plt.subplot(311)
ax2 = plt.subplot(312)
ax3 = plt.subplot(313)
ax1.plot(range(y+1), x[0])
ax2.plot(range(y+1), x[1])
ax3.plot(range(y+1), x[2])
plt.show()
分類函數(shù)
def classify(inx, weights):
prob = sigmoid(sum(inx*weights))
if prob > 0.5:
return 1.0
else:
return 0.0