2024-03-27 k-means聚類

簡(jiǎn)介

聚類算法是一種無監(jiān)督機(jī)器學(xué)習(xí)模型,它直接從數(shù)據(jù)的內(nèi)在性質(zhì)中內(nèi)在性質(zhì)中學(xué)習(xí)最優(yōu)的劃分結(jié)果或者確定離散標(biāo)簽類型。
最簡(jiǎn)單的k-means聚類算法:

  • cluster center,該簇所有數(shù)據(jù)點(diǎn)的算術(shù)平均值
  • 每個(gè)點(diǎn)到自己cluster center的距離比到其他cluster centers近
from sklearn.datasets import make_blobs
X, y_true = make_blobs(n_samples=300, centers=4, cluster_std=0.6, random_state=0)
plt.scatter(X[:,0],X[:,1],s=50);

from sklearn.cluster import KMeans
kmeans = KMeans(n_clusters=4)
kmeans.fit(X)
y_kmeans = kmeans.predict(X)
plt.figure()
plt.scatter(X[:,0],X[:,1], c=y_kmeans, s=50, cmap='viridis')
centers = kmeans.cluster_centers_
plt.scatter(centers[:, 0], centers[:, 1], c='black', s=200, alpha=0.5);
a25a1053989d4880a87a10f344703761.png

k-means可以自動(dòng)完成4個(gè)簇的識(shí)別。它使用了期望最大化算法:

  1. 猜測(cè)一些簇中心
  2. 重復(fù)直到收斂:
    • E-step 期望步驟: 分配點(diǎn)到最近的簇中心
    • M-step 最大化步驟: 更新簇中心為所有點(diǎn)平均值
from sklearn.metrics import pairwise_distances_argmin

def find_clusters(X, n_clusters, rseed=2):
    # 1. Randomly choose clusters
    rng = np.random.RandomState(rseed)
    i = rng.permutation(X.shape[0])[:n_clusters]
    centers = X[i]
    
    while True:
        # 2a. Assign labels based on closest center
        labels = pairwise_distances_argmin(X, centers)
        
        # 2b. Find new centers from means of points
        new_centers = np.array([X[labels == i].mean(0)
                                for i in range(n_clusters)])
        
        # 2c. Check for convergence
        if np.all(centers == new_centers):
            break
        centers = new_centers
    
    return centers, labels

centers, labels = find_clusters(X, 4)
plt.scatter(X[:, 0], X[:, 1], c=labels,
            s=50, cmap='viridis');

k-means的缺點(diǎn):

  • 不一定是全局最優(yōu)
  • 需要事先指定簇?cái)?shù)量
  • 只能確定線性邊界
  • 數(shù)據(jù)量大時(shí)速度慢
    非線性邊界可以使用核變換投影到高維空間,使用最近鄰圖來計(jì)算數(shù)據(jù)的高維表示:
from sklearn.datasets import make_moons
X, y = make_moons(200, noise=.05, random_state=0)

labels = KMeans(2, random_state=0).fit_predict(X)
plt.scatter(X[:, 0], X[:, 1], c=labels,
            s=50, cmap='viridis');
from sklearn.cluster import SpectralClustering
model = SpectralClustering(n_clusters=2,affinity='nearest_neighbors', assign_labels='kmeans')
labels = model.fit_predict(X)
plt.figure()
plt.scatter(X[:, 0], X[:, 1], c=labels, s=50, cmap='viridis');
d9b4847ec2514ae59e7747d1bde9146d.png

案例:手寫數(shù)字

將1767個(gè)64維數(shù)據(jù),分為10個(gè)類。顯示簇中心、準(zhǔn)確率、混淆矩陣。

from sklearn.datasets import load_digits
digits = load_digits()
print(digits.data.shape)
kmeans = KMeans(n_clusters=10, random_state=0)
clusters = kmeans.fit_predict(digits.data)
print(kmeans.cluster_centers_.shape)
fig, ax = plt.subplots(2, 5, figsize=(8, 3))
centers = kmeans.cluster_centers_.reshape(10, 8, 8)
for axi, center in zip(ax.flat, centers):
    axi.set(xticks=[], yticks=[])
    axi.imshow(center, interpolation='nearest', cmap=plt.cm.binary)

from scipy.stats import mode
labels = np.zeros_like(clusters)
for i in range(10):
    mask = (clusters == i)
    labels[mask] = mode(digits.target[mask])[0]

from sklearn.metrics import accuracy_score
print(accuracy_score(digits.target, labels))
plt.figure()
from sklearn.metrics import confusion_matrix
mat = confusion_matrix(digits.target, labels)
sns.heatmap(mat.T, square=True, annot=True, fmt='d', cbar=False,
            xticklabels=digits.target_names,
            yticklabels=digits.target_names)
plt.xlabel('true label')
plt.ylabel('predicted label');
1026bc2780e74abb99b9d12e2914d33c.png

使用t-分布鄰域嵌入算法進(jìn)行預(yù)處理(64維降到2維),提高準(zhǔn)確率

from sklearn.manifold import TSNE

tsne = TSNE(n_components=2, init='pca', random_state=0)
digits_proj = tsne.fit_transform(digits.data)

kmeans = KMeans(n_clusters=10, random_state=0)
clusters = kmeans.fit_predict(digits_proj)

labels = np.zeros_like(clusters)
for i in range(10):
    mask = (clusters == i)
    labels[mask] = mode(digits.target[mask])[0]
print(accuracy_score(digits.target, labels))

案例: 圖像色彩壓縮

該圖像存儲(chǔ)在一個(gè)(height,width,RGB)的三維數(shù)組中,每個(gè)元素以0~255的整數(shù)表示紅綠藍(lán)信息。具體維度(427,640,3)
對(duì)像素空間(特征矩陣)使用k-means聚類,將255^3\simeq1600萬種顏色縮減到16種。使用了MiniBatchKmeans算法對(duì)數(shù)據(jù)集的子集進(jìn)行計(jì)算,速度更快。

from sklearn.datasets import load_sample_image
china = load_sample_image("china.jpg")
print(china.shape)

data = china /255
data=data.reshape(427*640,3)
print(data.shape)

from sklearn.cluster import MiniBatchKMeans
kmeans = MiniBatchKMeans(16)
kmeans.fit(data)
new_colors = kmeans.cluster_centers_[kmeans.predict(data)]
china_recolored = new_colors.reshape(china.shape)
fig, ax = plt.subplots(1, 2, figsize=(16, 6),
                       subplot_kw=dict(xticks=[], yticks=[]))
fig.subplots_adjust(wspace=0.05)
ax[0].imshow(china)
ax[0].set_title('Original Image', size=16)
ax[1].imshow(china_recolored)
ax[1].set_title('16-color Image', size=16);

參考:
[1]美 萬托布拉斯 (VanderPlas, Jake).Python數(shù)據(jù)科學(xué)手冊(cè)[M].人民郵電出版社,2018.

?著作權(quán)歸作者所有,轉(zhuǎn)載或內(nèi)容合作請(qǐng)聯(lián)系作者
【社區(qū)內(nèi)容提示】社區(qū)部分內(nèi)容疑似由AI輔助生成,瀏覽時(shí)請(qǐng)結(jié)合常識(shí)與多方信息審慎甄別。
平臺(tái)聲明:文章內(nèi)容(如有圖片或視頻亦包括在內(nèi))由作者上傳并發(fā)布,文章內(nèi)容僅代表作者本人觀點(diǎn),簡(jiǎn)書系信息發(fā)布平臺(tái),僅提供信息存儲(chǔ)服務(wù)。

相關(guān)閱讀更多精彩內(nèi)容

友情鏈接更多精彩內(nèi)容