91资源在线,内射后入一区在线观看

剛好最近經(jīng)?？匆恍﹚ord2vec的文章，在最后往往看到作者說用t-SNE可視化結(jié)果，也即把高維度的數(shù)據(jù)降維并可視化。很奇怪作者為何不用PCA或者LDA，深挖下去挖出了一個(gè)未曾了解過的可視化算法領(lǐng)域

降維，所有人都知道就是把特征維度降低后并力求保留有用的信息。

說起降維，大部分人知道PCA(Principal Components Analysis)

說起降維，部分人知道LDA(Linear Discriminant Analysis)

說起降維，少部分人知道一般分為線性降維和非線性降：

1，線性降維：PCA(Principal Components Analysis)

LDA(Linear Discriminant Analysis)

MDS(Classical Multidimensional Scaling)

2，非線性降維：

Isomap(Isometric Mapping)

LLE(Locally Linear Embedding)

LE(Laplacian Eigenmaps)

t-SNE(t-Distributed Stochastic Neighbor Embedding)

大家可能對(duì)線性降維中的一些方法比較熟悉了，但是對(duì)非線性降維并不了解，非線性降維中用到的方法大多屬于流形學(xué)習(xí)范疇，本文主要通過聊聊怎么使用其中的t-SNE來入門下流形學(xué)習(xí)。

我們的少部分人知道t-SNE算法已經(jīng)成為了Scikit-learn的功能模塊，主要用于可視化和理解高維數(shù)據(jù)。在此文中，將學(xué)習(xí)其基本算法核心思想，并結(jié)合些例子演示如何使用Scikit-learn來調(diào)用t-SNE。

t-SNE sklearn實(shí)現(xiàn)

1）t-SNE是什么？

t-SNE字面上的理解是t分布與SNE結(jié)合，所以問題變成了SNE是什么，為何要和t分布結(jié)合。

SNE：即stochastic neighbor embedding，由Hinton于2002年提出的可視化算法：在高維空間相似的數(shù)據(jù)點(diǎn)，映射到低維空間距離也是相似的。

t-SNE：2008年Maaten 和Hinton結(jié)合t分布改進(jìn)了SNE，解決了SNE棘手的問題之一：擁擠問題，即不同類別邊緣不清晰，當(dāng)然t-SNE還有其他缺點(diǎn)，如大數(shù)據(jù)可視化有點(diǎn)吃力。

2）首先使用Iris dataset可視化說明t-SNE與PCA的線性與非線性可視化：

如下圖所示：類別只有三類的情況下，我們看到t-SNE和PCA都可以較好的分類并進(jìn)行可視化。

三類可視化對(duì)比

from sklearn.manifold import TSNE
from sklearn.decomposition import PCA

我們注意到PCA在sklearn使用decomposition模塊，而t-SNE使用manifold模塊，那manifold模塊（翻譯成流形學(xué)習(xí)）的功能是什么呢，首先得了解什么是流形學(xué)習(xí)：

流形學(xué)習(xí)為拓?fù)鋵W(xué)與機(jī)器學(xué)習(xí)的結(jié)合，可以用來對(duì)高維數(shù)據(jù)降維,如果將維度降到2維或3維,我們就能將原始數(shù)據(jù)可視化,從而對(duì)數(shù)據(jù)的分布有直觀的了解,發(fā)現(xiàn)一些可能存在的規(guī)律。

流形學(xué)習(xí)的前提假設(shè)，即某些高維數(shù)據(jù)，實(shí)際是一種低維的流形結(jié)構(gòu)嵌入在高維空間中。流形學(xué)習(xí)的目的是將其映射回低維空間中，揭示其本質(zhì)。如下圖所示：右圖為左圖的降維后的展示，是不是更直觀，更“流形”。（圖片來源知乎）

image.png

Manifold learning is an approach to non-linear dimensionality reduction。傳統(tǒng)的線性降維（PCA，LDA）會(huì)經(jīng)常學(xué)不到重要的非線性數(shù)據(jù)特征，也即官網(wǎng)所說： it learns the high-dimensional structure of the data from the data itself, without the use of predetermined classifications，主要流行算法如下圖所示：

主要流行算法

3）當(dāng)維度較高時(shí)，且數(shù)據(jù)較稀缺的時(shí)候我們比較下PCA和t-SNE的可視化效果

我們使用sklearn關(guān)于20newsgroups，如官網(wǎng)介紹：

The 20 newsgroups dataset comprises around 18000 newsgroups posts on 20 topics split in two subsets: one for training (or development) and the other one for testing (or for performance evaluation).

我們可以通過設(shè)置參數(shù)subset調(diào)用對(duì)應(yīng)的訓(xùn)練或測(cè)試數(shù)據(jù)或者全部數(shù)據(jù)
subset='train'
subset='test'
subset='all'

選擇所有數(shù)據(jù)，類目限制四個(gè)主題文章，并調(diào)用TfidfVectorizer抽取文本為特征向量。

categories = ['alt.atheism','talk.religion.misc', 'comp.graphics', 'sci.space']
newsgroups = fetch_20newsgroups(subset="all", categories=categories)
vectorizer = TfidfVectorizer()
vectors = vectorizer.fit_transform(newsgroups.data)

理解所使用的實(shí)驗(yàn)數(shù)據(jù)，我們發(fā)現(xiàn)數(shù)據(jù)非常稀缺，稀缺度為0.42%

計(jì)算樣本數(shù)2588個(gè)

print(newsgroups.filenames.shape)

計(jì)算向量維度38137維向量

print(vectors.shape)

計(jì)算非0共406180個(gè)

print(vectors.nnz)

計(jì)算每個(gè)樣本有多少個(gè)非0features，共156個(gè)。

print(vectors.nnz / float(vectors.shape[0]))

計(jì)算非0特征占比，判斷特征稀缺程度，結(jié)果為0.42%，非常稀缺

print(vectors.nnz / float(vectors.shape[0])/38137)

通過help（TSNE）我們發(fā)現(xiàn)其往往會(huì)根據(jù)稀缺程度來決定選擇PCA還是truncatedSVD先降維至50維。由上可知，數(shù)據(jù)非常稀缺，擇TruncatedSVD先降維至50維，我們發(fā)現(xiàn)t-SNE可視化效果要明顯好于PCA。

TSNE可視化：

reduced = TruncatedSVD(n_components=50).fit_transform(vectors)
embedded = TSNE(n_components=2, perplexity=30).fit_transform(reduced)

fig = plt.figure(figsize=(15, 5))
plt.subplot(131)
plt.title('The 20 newsgroups dataset  with TSNE')
plt.scatter(embedded[:, 0], embedded[:, 1],c=newsgroups.target, marker="x")
![image.png](https://upload-images.jianshu.io/upload_images/1676597-8216faad09544183.png?imageMogr2/auto-orient/strip%7CimageView2/2/w/1240)

PCA可視化（注意要先轉(zhuǎn)換vectors.todense()）

reduced_with_PCA = PCA(n_components=2).fit_transform(vectors.todense())
plt.subplot(133)
plt.title('The 20 newsgroups dataset with PCA')
plt.scatter(reduced_with_PCA[:, 0], reduced_with_PCA[:, 1],\
c=newsgroups.target, marker="x")

image.png


from sklearn.manifold import TSNE
import matplotlib.pyplot as plt
from sklearn.decomposition import PCA
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import TruncatedSVD

#選擇所有數(shù)據(jù)，并抽取四個(gè)主題文章
categories = ['alt.atheism','talk.religion.misc', 'comp.graphics', 'sci.space']
newsgroups = fetch_20newsgroups(subset="all", categories=categories)
vectorizer = TfidfVectorizer()
vectors = vectorizer.fit_transform(newsgroups.data)
#計(jì)算樣本數(shù)2588個(gè)
print(newsgroups.filenames.shape)
#計(jì)算向量維度38137維向量
print(vectors.shape)
#計(jì)算非0共406180個(gè)
print(vectors.nnz)
#計(jì)算每個(gè)樣本有多少個(gè)非0features，共156個(gè)。
print(vectors.nnz / float(vectors.shape[0]))
#計(jì)算非0特征占比，判斷特征稀缺程度，結(jié)果為0.42%，非常稀缺
print(vectors.nnz / float(vectors.shape[0])/38137)

#由于非常稀缺，根據(jù) help(t-SNE),選擇TruncatedSVD先降維至50維
#reduced = TruncatedSVD(n_components=50).fit_transform(vectors)
reduced = PCA(n_components=50).fit_transform(vectors.todense())

#傳入TSNE中
embedded = TSNE(n_components=2, perplexity=30).fit_transform(reduced)
#可視化

fig = plt.figure(figsize=(15, 5))

plt.subplot(131)
plt.title('The 20 newsgroups dataset  with TSNE')
plt.scatter(embedded[:, 0], embedded[:, 1],c=newsgroups.target, marker="x")
reduced_with_TruncatedSVD = TruncatedSVD(n_components=2).fit_transform(vectors)
plt.subplot(132)

plt.title('The 20 newsgroups dataset with TruncatedSVD')
plt.scatter(reduced_with_TruncatedSVD[:, 0], reduced_with_TruncatedSVD[:, 1],c=newsgroups.target, marker="x")

reduced_with_PCA = PCA(n_components=2).fit_transform(vectors.todense())
plt.subplot(133)
plt.title('The 20 newsgroups dataset with PCA')
plt.scatter(reduced_with_PCA[:, 0], reduced_with_PCA[:, 1],\
c=newsgroups.target, marker="x")

參考：
從SNE到t-SNE再到LargeVis
http://bindog.github.io/blog/2016/06/04/from-sne-to-tsne-to-largevis?utm_source=tuicool&utm_medium=referral
http://nbviewer.jupyter.org/urls/gist.githubusercontent.com/AlexanderFabisch/1a0c648de22eff4a2a3e/raw/59d5bc5ed8f8bfd9ff1f7faa749d1b095aa97d5a/t-SNE.ipynb

色偷偷精品伊人,欧洲久久精品,欧美综合婷婷骚逼,国产AV主播,国产最新探花在线,九色在线视频一区,伊人大交九欧美,1769亚洲,黄色成人av

從Word2vec可視化算法t-SNE談起

從Word2vec可視化算法t-SNE談起

t-SNE sklearn實(shí)現(xiàn)

選擇所有數(shù)據(jù)，類目限制四個(gè)主題文章，并調(diào)用TfidfVectorizer抽取文本為特征向量。

計(jì)算樣本數(shù)2588個(gè)

計(jì)算向量維度38137維向量

計(jì)算非0共406180個(gè)

計(jì)算每個(gè)樣本有多少個(gè)非0features，共156個(gè)。

計(jì)算非0特征占比，判斷特征稀缺程度，結(jié)果為0.42%，非常稀缺

TSNE可視化：

PCA可視化（注意要先轉(zhuǎn)換vectors.todense()）

友情鏈接更多精彩內(nèi)容

色偷偷精品伊人,欧洲久久精品,欧美综合婷婷骚逼,国产AV主播,国产最新探花在线,九色在线视频一区,伊人大交九 欧美,1769亚洲,黄色成人av

從Word2vec可視化算法t-SNE談起

t-SNE sklearn實(shí)現(xiàn)

選擇所有數(shù)據(jù)，類目限制四個(gè)主題文章，并調(diào)用TfidfVectorizer抽取文本為特征向量。

計(jì)算樣本數(shù)2588個(gè)

計(jì)算向量維度38137維向量

計(jì)算非0共406180個(gè)

計(jì)算每個(gè)樣本有多少個(gè)非0features，共156個(gè)。

計(jì)算非0特征占比，判斷特征稀缺程度，結(jié)果為0.42%，非常稀缺

TSNE可視化：

PCA可視化（注意要先轉(zhuǎn)換vectors.todense()）

友情鏈接更多精彩內(nèi)容

色偷偷精品伊人,欧洲久久精品,欧美综合婷婷骚逼,国产AV主播,国产最新探花在线,九色在线视频一区,伊人大交九欧美,1769亚洲,黄色成人av

選擇所有數(shù)據(jù)，類目限制四個(gè)主題文章，并調(diào)用TfidfVectorizer抽取文本為特征向量。

計(jì)算每個(gè)樣本有多少個(gè)非0features，共156個(gè)。

計(jì)算非0特征占比，判斷特征稀缺程度，結(jié)果為0.42%，非常稀缺