開(kāi)始

本文使用
■Python2.7
■numpy 1.11
■scipy 0.17
■scikit-learn 0.18
■matplotlib 1.5
■seaborn 0.7
■pandas 0.17。

本文所用代碼都在jupyter notebook上確認(rèn)沒(méi)有問(wèn)題。（用jupyter notebook的時(shí)候請(qǐng)指定%matplotlib inline）。
本文參考了scikit-learn的《流形學(xué)習(xí)》的內(nèi)容。

我們用scikit-learn的手寫(xiě)數(shù)字識(shí)別示例來(lái)說(shuō)明所謂流形學(xué)習(xí)的方法。特別是可以用來(lái)做高維數(shù)據(jù)可視化的方法，比如t-SNE方法在Kaggle競(jìng)賽中有時(shí)就會(huì)用到。但是這些方法并不是只用在可視化方面，當(dāng)這些方法結(jié)合了原始數(shù)據(jù)和壓縮后的數(shù)據(jù)，可以提高單純的分類問(wèn)題的精度。

from time import time

import numpy as np
import matplotlib.pyplot as plt
from matplotlib import offsetbox
from sklearn import (manifold, datasets, decomposition, ensemble,
                     discriminant_analysis, random_projection)

digits = datasets.load_digits(n_class=6)
X = digits.data
y = digits.target
n_samples, n_features = X.shape
n_neighbors = 30

n_img_per_row = 20
img = np.zeros((10 * n_img_per_row, 10 * n_img_per_row))
for i in range(n_img_per_row):
    ix = 10 * i + 1
    for j in range(n_img_per_row):
        iy = 10 * j + 1
        img[ix:ix + 8, iy:iy + 8] = X[i * n_img_per_row + j].reshape((8, 8))

plt.imshow(img, cmap=plt.cm.binary)
plt.xticks([])
plt.yticks([])
plt.title('A selection from the 64-dimensional digits dataset')

image.png

準(zhǔn)備Digit數(shù)據(jù)映射用的函數(shù)。
下面的代碼跟文章的主題關(guān)系不大，如果理解起來(lái)困難的話，可以跳過(guò)。

def plot_embedding(X, title=None):
    x_min, x_max = np.min(X, 0), np.max(X, 0)
    X = (X - x_min) / (x_max - x_min)

    plt.figure()
    ax = plt.subplot(111)
    for i in range(X.shape[0]):
        plt.text(X[i, 0], X[i, 1], str(digits.target[i]),
                 color=plt.cm.Set1(y[i] / 10.),
                 fontdict={'weight': 'bold', 'size': 9})

    if hasattr(offsetbox, 'AnnotationBbox'):
        # only print thumbnails with matplotlib > 1.0
        shown_images = np.array([[1., 1.]])  # just something big
        for i in range(digits.data.shape[0]):
            dist = np.sum((X[i] - shown_images) ** 2, 1)
            if np.min(dist) < 4e-3:
                # don't show points that are too close
                continue
            shown_images = np.r_[shown_images, [X[i]]]
            imagebox = offsetbox.AnnotationBbox(
                offsetbox.OffsetImage(digits.images[i], cmap=plt.cm.gray_r),
                X[i])
            ax.add_artist(imagebox)
    plt.xticks([]), plt.yticks([])
    if title is not None:
        plt.title(title)

2. 著眼于線性成分的降維

這里介紹的方法，因?yàn)橛?jì)算成本較低，通用性高，經(jīng)常要用到。
例如，用PCA算法抽出數(shù)據(jù)之間的相關(guān)性非常方便。這里的方法很多資料有詳細(xì)的說(shuō)明，本文中就不在展開(kāi)介紹了。

2.1. Random Projection
在第1節(jié)中已經(jīng)取得64維的手寫(xiě)數(shù)字?jǐn)?shù)據(jù)。
這樣高維數(shù)據(jù)進(jìn)行映射的最基本方法是Random Projection，中文叫隨機(jī)投影。
非常簡(jiǎn)單的方法，用隨機(jī)數(shù)來(lái)設(shè)定將M維的數(shù)據(jù)映射到N維的矩陣R中的元素。這樣做的好處是計(jì)算量非常小。

print("Computing random projection")
rp = random_projection.SparseRandomProjection(n_components=2, random_state=42)
X_projected = rp.fit_transform(X)
plot_embedding(X_projected, "Random Projection of the digits")
#plt.scatter(X_projected[:, 0], X_projected[:, 1])

image.png

2.2. PCA
用PCA映射，是一般的維度壓縮方法?？梢猿槌鲎兞块g的相關(guān)成分。
這里我們使用scikit-learn中提供的函數(shù)TruncatedSVD。這個(gè)函數(shù)跟PCA不同的地方好像只是輸入數(shù)據(jù)是否經(jīng)過(guò)正規(guī)化。

print("Computing PCA projection")
t0 = time()
X_pca = decomposition.TruncatedSVD(n_components=2).fit_transform(X)
plot_embedding(X_pca,
               "Principal Components projection of the digits (time %.2fs)" %(time() - t0))

image.png

2.3 Linear Discriminant Analysis
可以通過(guò)Linear Discriminant Analysis（線性判別分析）來(lái)進(jìn)行降維。前提是各個(gè)變量都服從多元正態(tài)分布，同一組變量具有相同的協(xié)方差矩陣。使用的距離是馬氏距離（Mahalanobis distance）。
跟PCA相似，這個(gè)算法也屬于監(jiān)督學(xué)習(xí)。

print("Computing Linear Discriminant Analysis projection")
X2 = X.copy()
X2.flat[::X.shape[1] + 1] += 0.01  # Make X invertible
t0 = time()
X_lda = discriminant_analysis.LinearDiscriminantAnalysis(n_components=2).fit_transform(X2, y)
plot_embedding(X_lda,
               "Linear Discriminant projection of the digits (time %.2fs)" %
               (time() - t0))

image.png

3. 考慮非線性成分的降維

剛剛介紹的3種方法，對(duì)于具有層級(jí)構(gòu)造的數(shù)據(jù)和含有非線性成分的數(shù)據(jù)進(jìn)行降維是不適用的。比如這里介紹一下像swiss roll一樣的線性無(wú)關(guān)數(shù)據(jù)的2維圖像。

image.png

3.1 Isomap
Isomap是考慮非線性數(shù)據(jù)進(jìn)行降維和聚類的方法之一。
沿著流形的形狀計(jì)算叫做測(cè)地距離的距離，根據(jù)這個(gè)距離來(lái)進(jìn)行多尺度縮放。
這里可以用Scikit-learn的Isomap函數(shù)。計(jì)算距離的階段，可以使用BallTree函數(shù)。

print("Computing Isomap embedding")
t0 = time()
X_iso = manifold.Isomap(n_neighbors, n_components=2).fit_transform(X)
print("Done.")
plot_embedding(X_iso,
               "Isomap projection of the digits (time %.2fs)" %
               (time() - t0))

image.png

3.2 Locally Linear Embedding（LLE）
即使從全局來(lái)看，包含非線性的流形，從局部著眼來(lái)看的話也有基于直覺(jué)的線性降維方法。下面的例子中，可以看到標(biāo)簽0被突出的顯示出來(lái)，其他的標(biāo)簽還是混亂的。

print("Computing LLE embedding")
clf = manifold.LocallyLinearEmbedding(n_neighbors, n_components=2,
                                      method='standard')
t0 = time()
X_lle = clf.fit_transform(X)
print("Done. Reconstruction error: %g" % clf.reconstruction_error_)
plot_embedding(X_lle,
               "Locally Linear Embedding of the digits (time %.2fs)" %
               (time() - t0))

image.png

3.3. Modified Locally Linear Embedding
LLE本身存在問(wèn)題，所以就有了用正規(guī)化來(lái)來(lái)改良的算法。
從改良的算法結(jié)果可以看出標(biāo)簽0被清楚地分類，標(biāo)簽4,1,5也被清晰地映射出來(lái)。

print("Computing modified LLE embedding")
clf = manifold.LocallyLinearEmbedding(n_neighbors, n_components=2,
                                      method='modified')
t0 = time()
X_mlle = clf.fit_transform(X)
print("Done. Reconstruction error: %g" % clf.reconstruction_error_)
plot_embedding(X_mlle,
               "Modified Locally Linear Embedding of the digits (time %.2fs)" %
               (time() - t0))

image.png

3.4 Hessian LLE Embedding
LLE本身存在問(wèn)題，所以就有了用正規(guī)化來(lái)來(lái)改良的算法。
這個(gè)是第二個(gè)修改方案。

print("Computing Hessian LLE embedding")
clf = manifold.LocallyLinearEmbedding(n_neighbors, n_components=2,
                                      method='hessian')
t0 = time()
X_hlle = clf.fit_transform(X)
print("Done. Reconstruction error: %g" % clf.reconstruction_error_)
plot_embedding(X_hlle,
               "Hessian Locally Linear Embedding of the digits (time %.2fs)" %
               (time() - t0))

image.png

3.5 Spectral Embedding
這是也可以稱為拉普拉斯特征映射（Laplacian Eigenmaps）的一種壓縮方法。
這里沒(méi)有對(duì)專門的內(nèi)容進(jìn)行調(diào)查，好像其中使用了光譜圖理論。
映射的形式與LLE，MLLE，HLLE都不同，標(biāo)簽組之間的距離以及密集度似乎表現(xiàn)上類似。

print("Computing Spectral embedding")
embedder = manifold.SpectralEmbedding(n_components=2, random_state=0,
                                      eigen_solver="arpack")
t0 = time()
X_se = embedder.fit_transform(X)

plot_embedding(X_se,
               "Spectral embedding of the digits (time %.2fs)" %
               (time() - t0))

image.png

3.6 Local Tangent Space Alignment
這個(gè)算法的結(jié)果很像MLLE，HLLE的反轉(zhuǎn)。分類結(jié)果很相似的感覺(jué)。

print("Computing LTSA embedding")
clf = manifold.LocallyLinearEmbedding(n_neighbors, n_components=2,
                                      method='ltsa')
t0 = time()
X_ltsa = clf.fit_transform(X)
print("Done. Reconstruction error: %g" % clf.reconstruction_error_)
plot_embedding(X_ltsa,
               "Local Tangent Space Alignment of the digits (time %.2fs)" %
               (time() - t0))

image.png

3.7 Multi-dimensional Scaling (MDS)
這是一種叫做多維縮放分析的降維方法。
MDS本身是多種方法的總稱，這里不做詳細(xì)說(shuō)明。

print("Computing MDS embedding")
clf = manifold.MDS(n_components=2, n_init=1, max_iter=100)
t0 = time()
X_mds = clf.fit_transform(X)
print("Done. Stress: %f" % clf.stress_)
plot_embedding(X_mds,
               "MDS embedding of the digits (time %.2fs)" %
               (time() - t0))

image.png

3.8 t-distributed Stochastic Neighbor Embedding (t-SNE)
將各個(gè)點(diǎn)之間的歐幾里得距離變換成條件概率而不是相似度，并將其映射到低維。
有一種叫Barnes-Hut t-SNE的方法，犧牲精度換取降低計(jì)算成本。在Sklearn中，可以選擇exact（重視精度）和Barnes-Hut兩種選擇。默認(rèn)是選擇Barnes-Hut方法?？梢酝ㄟ^(guò)設(shè)定參數(shù)angle選項(xiàng)的值來(lái)調(diào)參。
Kaggle比賽中頻繁使用的一種方法。

print("Computing t-SNE embedding")
tsne = manifold.TSNE(n_components=2, init='pca', random_state=0)
t0 = time()
X_tsne = tsne.fit_transform(X)

plot_embedding(X_tsne,
               "t-SNE embedding of the digits (time %.2fs)" %
               (time() - t0))

image.png

3.9 Random Forest Embedding

print("Computing Totally Random Trees embedding")
hasher = ensemble.RandomTreesEmbedding(n_estimators=200, random_state=0,
                                       max_depth=5)
t0 = time()
X_transformed = hasher.fit_transform(X)
pca = decomposition.TruncatedSVD(n_components=2)
X_reduced = pca.fit_transform(X_transformed)

plot_embedding(X_reduced,
               "Random forest embedding of the digits (time %.2fs)" %
               (time() - t0))

image.png

色偷偷精品伊人,欧洲久久精品,欧美综合婷婷骚逼,国产AV主播,国产最新探花在线,九色在线视频一区,伊人大交九欧美,1769亚洲,黄色成人av

高維數(shù)據(jù)的降維和2維可視化方法

高維數(shù)據(jù)的降維和2維可視化方法

開(kāi)始

目錄

1. 生成數(shù)據(jù)

2. 關(guān)注線性屬性的降維

3. 考慮非線性成分的降維

1. 生成數(shù)據(jù)

2. 著眼于線性成分的降維

3. 考慮非線性成分的降維

相關(guān)閱讀更多精彩內(nèi)容

友情鏈接更多精彩內(nèi)容

色偷偷精品伊人,欧洲久久精品,欧美综合婷婷骚逼,国产AV主播,国产最新探花在线,九色在线视频一区,伊人大交九 欧美,1769亚洲,黄色成人av

高維數(shù)據(jù)的降維和2維可視化方法

開(kāi)始

目錄

1. 生成數(shù)據(jù)

2. 關(guān)注線性屬性的降維

3. 考慮非線性成分的降維

1. 生成數(shù)據(jù)

2. 著眼于線性成分的降維

3. 考慮非線性成分的降維

相關(guān)閱讀更多精彩內(nèi)容

友情鏈接更多精彩內(nèi)容

色偷偷精品伊人,欧洲久久精品,欧美综合婷婷骚逼,国产AV主播,国产最新探花在线,九色在线视频一区,伊人大交九欧美,1769亚洲,黄色成人av