一、算法簡介

K最近鄰(kNN，k-NearestNeighbor)分類算法是數(shù)據(jù)挖掘分類技術(shù)中最簡單的方法之一。K近鄰的意思是每個樣本都可以用它最接近的k個鄰居來代表。

kNN算法的核心思想是如果一個樣本在特征空間中的k個最相鄰的樣本中的大多數(shù)屬于某一個類別，則該樣本也屬于這個類別，并具有這個類別上樣本的特性。因此KNN是通過測量不同特征值之間的距離進行分類。

KNN輸入基于實例的學(xué)習(xí)，屬于懶惰學(xué)習(xí)，即它沒有顯式的學(xué)習(xí)過程，也就是說沒有訓(xùn)練階段，數(shù)據(jù)集事先已有了分類和特征值，待收到新樣本后直接進行處理.

二、算法實現(xiàn)步驟

1）計算測試數(shù)據(jù)與各個訓(xùn)練數(shù)據(jù)之間的距離；

2）按照距離的遞增關(guān)系進行排序；

3）選取距離最小的K個點；

4）確定前K個點所在類別的出現(xiàn)頻率；

5）返回前K個點中出現(xiàn)頻率最高的類別作為測試數(shù)據(jù)的預(yù)測分類

三、注意事項

1、K值的選取

K值的選取非常重要。當K的取值過小時，一旦有噪聲成分存在將會對預(yù)測產(chǎn)生比較大影響；如果K的值取的過大，就相當于用較大鄰域中的訓(xùn)練實例進行預(yù)測，學(xué)習(xí)的近似誤差會增大。

常用的取k方法是從k=1開始，使用檢驗集估計分類器的誤差率。重復(fù)該過程，每次K增值1，允許增加一個近鄰。選取產(chǎn)生最小誤差率的K。一般k的取值不超過20，上限是n的開方，隨著數(shù)據(jù)集的增大，K的值也要增大。此外，K一般取奇數(shù)來減少平局的產(chǎn)生。

2、距離的選取

常用的是歐式距離。兩個樣本點之間歐式距離的平方是樣本點各個維度差的平方和。

四、算法評價

優(yōu)點

1.簡單，易于理解，易于實現(xiàn)，無需估計參數(shù)，無需訓(xùn)練；

2. 適合對稀有事件進行分類；

3.特別適合于多分類問題(multi-modal,對象具有多個類別標簽)， kNN比SVM的表現(xiàn)要好。

缺點

1.該算法在分類時有個主要的不足是，當樣本不平衡時，如一個類的樣本容量很大，而其他類樣本容量很小時，有可能導(dǎo)致當輸入一個新樣本時，該樣本的K個鄰居中大容量類的樣本占多數(shù)。該算法只計算“最近的”鄰居樣本，某一類的樣本數(shù)量很大，那么或者這類樣本并不接近目標樣本，或者這類樣本很靠近目標樣本。無論怎樣，數(shù)量并不能影響運行結(jié)果。

2.該方法的另一個不足之處是計算量較大，因為對每一個待分類的文本都要計算它到全體已知樣本的距離，才能求得它的K個最近鄰點。

3.可理解性差，無法給出像決策樹那樣的規(guī)則。

五、Python實現(xiàn)knn算法

import numpy as np

import matplotlib.pyplot as plt

import operator

#類的封裝

class KNN(object):

? ? def __init__(self, k=3):

? ? ? ? self.k = k

? ? def fit(self,x,y):

? ? ? ? self.x = x

? ? ? ? self.y = y

#計算距離的平方

? ? def _square_distance(self,v1,v2):

? ? ? ? return np.sum(np.square(v1-v2))

#投票

? ? def _vote(self,ys):

? ? ? ? ys_unique = np.unique(ys)

? ? ? ? vote_dict = {}

? ? ? ? for y in ys:

? ? ? ? ? ? if y not in vote_dict.keys():

? ? ? ? ? ? ? ? vote_dict[y] = 1

? ? ? ? ? ? else:

? ? ? ? ? ? ? ? vote_dict[y] += 1

? ? ? ? sorted_vote_dict = sorted(vote_dict.items(), key=operator.itemgetter(1),reverse=True)

? ? ? ? return sorted_vote_dict[0][0]

#建立模型

? ? def predict(self,x):

? ? ? ? y_pred = []

? ? ? ? for i in range(len(x)):

? ? ? ? ? ? dist_arr = [self._square_distance(x[i],self.x[j])for j in range(len(self.x))]

? ? ? ? ? ? sorted_index = np.argsort(dist_arr)

? ? ? ? ? ? top_k_index = sorted_index[:self.k]

? ? ? ? ? ? y_pred.append(self._vote(ys=self.y[top_k_index]))

? ? ? ? return np.array(y_pred)

#模型評分

? ? def score(self,y_true=None, y_pred =None):

? ? ? ? if y_true is None or y_pred is None:

? ? ? ? ? ? y_pred = self.predict(self.x)

? ? ? ? ? ? y_true = self.y

? ? ? ? score = 0

? ? ? ? for i in range(len(y_true)):

? ? ? ? ? ? if y_true[i] == y_pred[i]:

? ? ? ? ? ? ? ? score += 1

? ? ? ? score /= len(y_true)

? ? ? ? return score

#生成數(shù)據(jù)

np.random.seed(666)

data_size_1 = 300? #生成兩組數(shù)據(jù)，第一組樣本點為300

x1_1 = np.random.normal(loc=5, scale=1, size=data_size_1)#樣本點的一個維度

x2_1 = np.random.normal(loc=4, scale=1, size=data_size_1)#樣本點的另一個維度

y_1 = [0 for i in range(data_size_1)]

data_size_2 = 400? #

x1_2 = np.random.normal(loc=10, scale=2, size=data_size_2)

x2_2 = np.random.normal(loc=8, scale=2, size=data_size_2)

y_2 = [1 for j in range(data_size_2)]

#數(shù)據(jù)的拼接

x1 = np.concatenate((x1_1, x1_2), axis=0)

x2 = np.concatenate((x2_1, x2_2), axis=0)

x = np.hstack((x1.reshape(-1, 1), x2.reshape(-1, 1)))

y = np.concatenate((y_1, y_2), axis=0)

#數(shù)據(jù)洗牌

data_size_all = data_size_2+data_size_1

shuffled_index = np.random.permutation(data_size_all)

x = x[shuffled_index]

y = y[shuffled_index]

#切分訓(xùn)練集和測試集

split_index = int(data_size_all * 0.7)

x_train = x[:split_index]

y_train = y[:split_index]

x_test = x[split_index:]

y_test = y[split_index:]

#數(shù)據(jù)微化

x_train = (x_train - np.min(x_train, axis=0))/(np.max(x_train, axis=0)-np.min(x_train,axis=0))

x_test = (x_test - np.min(x_test, axis=0))/(np.max(x_test, axis=0)-np.min(x_test,axis=0))

clf = KNN(k=3)

clf.fit(x_train,y_train)

score_train = clf.score()

print('Train Accuracy: {:.3}'.format(score_train))

y_test_pred = clf.predict(x_test)

print('Test Accuracy:{:.3}'.format(clf.score(y_test,y_test_pred)))

輸出結(jié)果為:

Train Accuracy: 0.988

Test Accuracy:0.991

#代碼實現(xiàn)主要參考自up主:rocktsunami

色偷偷精品伊人,欧洲久久精品,欧美综合婷婷骚逼,国产AV主播,国产最新探花在线,九色在线视频一区,伊人大交九欧美,1769亚洲,黄色成人av

機器學(xué)習(xí)入門第一周－－KNN算法原理及實現(xiàn)

機器學(xué)習(xí)入門第一周－－KNN算法原理及實現(xiàn)

一、算法簡介

二、算法實現(xiàn)步驟

三、注意事項

四、算法評價

五、Python實現(xiàn)knn算法

友情鏈接更多精彩內(nèi)容

色偷偷精品伊人,欧洲久久精品,欧美综合婷婷骚逼,国产AV主播,国产最新探花在线,九色在线视频一区,伊人大交九 欧美,1769亚洲,黄色成人av

機器學(xué)習(xí)入門第一周－－KNN算法原理及實現(xiàn)

一、算法簡介

二、算法實現(xiàn)步驟

三、注意事項

四、算法評價

五、Python實現(xiàn)knn算法

友情鏈接更多精彩內(nèi)容

色偷偷精品伊人,欧洲久久精品,欧美综合婷婷骚逼,国产AV主播,国产最新探花在线,九色在线视频一区,伊人大交九欧美,1769亚洲,黄色成人av

一、算法簡介

二、算法實現(xiàn)步驟

四、算法評價

五、Python實現(xiàn)knn算法