簡(jiǎn)介
聚類算法是一種無監(jiān)督機(jī)器學(xué)習(xí)模型,它直接從數(shù)據(jù)的內(nèi)在性質(zhì)中內(nèi)在性質(zhì)中學(xué)習(xí)最優(yōu)的劃分結(jié)果或者確定離散標(biāo)簽類型。
最簡(jiǎn)單的k-means聚類算法:
- cluster center,該簇所有數(shù)據(jù)點(diǎn)的算術(shù)平均值
- 每個(gè)點(diǎn)到自己cluster center的距離比到其他cluster centers近
from sklearn.datasets import make_blobs
X, y_true = make_blobs(n_samples=300, centers=4, cluster_std=0.6, random_state=0)
plt.scatter(X[:,0],X[:,1],s=50);
from sklearn.cluster import KMeans
kmeans = KMeans(n_clusters=4)
kmeans.fit(X)
y_kmeans = kmeans.predict(X)
plt.figure()
plt.scatter(X[:,0],X[:,1], c=y_kmeans, s=50, cmap='viridis')
centers = kmeans.cluster_centers_
plt.scatter(centers[:, 0], centers[:, 1], c='black', s=200, alpha=0.5);

a25a1053989d4880a87a10f344703761.png
k-means可以自動(dòng)完成4個(gè)簇的識(shí)別。它使用了期望最大化算法:
- 猜測(cè)一些簇中心
- 重復(fù)直到收斂:
- E-step 期望步驟: 分配點(diǎn)到最近的簇中心
- M-step 最大化步驟: 更新簇中心為所有點(diǎn)平均值
from sklearn.metrics import pairwise_distances_argmin
def find_clusters(X, n_clusters, rseed=2):
# 1. Randomly choose clusters
rng = np.random.RandomState(rseed)
i = rng.permutation(X.shape[0])[:n_clusters]
centers = X[i]
while True:
# 2a. Assign labels based on closest center
labels = pairwise_distances_argmin(X, centers)
# 2b. Find new centers from means of points
new_centers = np.array([X[labels == i].mean(0)
for i in range(n_clusters)])
# 2c. Check for convergence
if np.all(centers == new_centers):
break
centers = new_centers
return centers, labels
centers, labels = find_clusters(X, 4)
plt.scatter(X[:, 0], X[:, 1], c=labels,
s=50, cmap='viridis');
k-means的缺點(diǎn):
- 不一定是全局最優(yōu)
- 需要事先指定簇?cái)?shù)量
- 只能確定線性邊界
- 數(shù)據(jù)量大時(shí)速度慢
非線性邊界可以使用核變換投影到高維空間,使用最近鄰圖來計(jì)算數(shù)據(jù)的高維表示:
from sklearn.datasets import make_moons
X, y = make_moons(200, noise=.05, random_state=0)
labels = KMeans(2, random_state=0).fit_predict(X)
plt.scatter(X[:, 0], X[:, 1], c=labels,
s=50, cmap='viridis');
from sklearn.cluster import SpectralClustering
model = SpectralClustering(n_clusters=2,affinity='nearest_neighbors', assign_labels='kmeans')
labels = model.fit_predict(X)
plt.figure()
plt.scatter(X[:, 0], X[:, 1], c=labels, s=50, cmap='viridis');

d9b4847ec2514ae59e7747d1bde9146d.png
案例:手寫數(shù)字
將1767個(gè)64維數(shù)據(jù),分為10個(gè)類。顯示簇中心、準(zhǔn)確率、混淆矩陣。
from sklearn.datasets import load_digits
digits = load_digits()
print(digits.data.shape)
kmeans = KMeans(n_clusters=10, random_state=0)
clusters = kmeans.fit_predict(digits.data)
print(kmeans.cluster_centers_.shape)
fig, ax = plt.subplots(2, 5, figsize=(8, 3))
centers = kmeans.cluster_centers_.reshape(10, 8, 8)
for axi, center in zip(ax.flat, centers):
axi.set(xticks=[], yticks=[])
axi.imshow(center, interpolation='nearest', cmap=plt.cm.binary)
from scipy.stats import mode
labels = np.zeros_like(clusters)
for i in range(10):
mask = (clusters == i)
labels[mask] = mode(digits.target[mask])[0]
from sklearn.metrics import accuracy_score
print(accuracy_score(digits.target, labels))
plt.figure()
from sklearn.metrics import confusion_matrix
mat = confusion_matrix(digits.target, labels)
sns.heatmap(mat.T, square=True, annot=True, fmt='d', cbar=False,
xticklabels=digits.target_names,
yticklabels=digits.target_names)
plt.xlabel('true label')
plt.ylabel('predicted label');

1026bc2780e74abb99b9d12e2914d33c.png
使用t-分布鄰域嵌入算法進(jìn)行預(yù)處理(64維降到2維),提高準(zhǔn)確率
from sklearn.manifold import TSNE
tsne = TSNE(n_components=2, init='pca', random_state=0)
digits_proj = tsne.fit_transform(digits.data)
kmeans = KMeans(n_clusters=10, random_state=0)
clusters = kmeans.fit_predict(digits_proj)
labels = np.zeros_like(clusters)
for i in range(10):
mask = (clusters == i)
labels[mask] = mode(digits.target[mask])[0]
print(accuracy_score(digits.target, labels))
案例: 圖像色彩壓縮
該圖像存儲(chǔ)在一個(gè)(height,width,RGB)的三維數(shù)組中,每個(gè)元素以0~255的整數(shù)表示紅綠藍(lán)信息。具體維度(427,640,3)
對(duì)像素空間(特征矩陣)使用k-means聚類,將萬種顏色縮減到16種。使用了MiniBatchKmeans算法對(duì)數(shù)據(jù)集的子集進(jìn)行計(jì)算,速度更快。
from sklearn.datasets import load_sample_image
china = load_sample_image("china.jpg")
print(china.shape)
data = china /255
data=data.reshape(427*640,3)
print(data.shape)
from sklearn.cluster import MiniBatchKMeans
kmeans = MiniBatchKMeans(16)
kmeans.fit(data)
new_colors = kmeans.cluster_centers_[kmeans.predict(data)]
china_recolored = new_colors.reshape(china.shape)
fig, ax = plt.subplots(1, 2, figsize=(16, 6),
subplot_kw=dict(xticks=[], yticks=[]))
fig.subplots_adjust(wspace=0.05)
ax[0].imshow(china)
ax[0].set_title('Original Image', size=16)
ax[1].imshow(china_recolored)
ax[1].set_title('16-color Image', size=16);
參考:
[1]美 萬托布拉斯 (VanderPlas, Jake).Python數(shù)據(jù)科學(xué)手冊(cè)[M].人民郵電出版社,2018.