關于t-SNE降維方法

在Differential Dynamics of the Maternal Immune System in Healthy Pregnancy and Preeclampsia 這篇論文中用到了t-SNE的降維方法進行可視化。

論文原圖是這樣的：

image.png

1. 什么是t-SNE:

全名是t-distributed Stochastic Neighbor Embedding(t-SNE)，翻譯過來應該可以叫學生t分布的隨機鄰點嵌入法。

t-SNE將數(shù)據(jù)點之間的相似度轉(zhuǎn)換為概率。原始空間中的相似度由高斯聯(lián)合概率表示，嵌入空間的相似度由“學生t分布”表示。t-SNE在一些降維方法中表現(xiàn)得比較好。因為t-SNE主要是關注數(shù)據(jù)的局部結(jié)構(gòu)。

通過原始空間和嵌入空間的聯(lián)合概率的Kullback-Leibler（KL）散度來評估可視化效果的好壞，也就是說用有關KL散度的函數(shù)作為loss函數(shù)，然后通過梯度下降最小化loss函數(shù)，最終獲得收斂結(jié)果。

正式點來描述就是：

給定一組 $N$ 個點 $x_1, \cdots, x_N \in \mathbb{R}^d$ , t-SNE 首先計算 $x_i$ 和 $x_j$ 之間的相似度 $p_{ij}$ 。這個相似度公式定義為：

$p_{ij} = (p_{i \mid j} + p_{j \mid i})/(2N)$

對于每個 $i$ 都有 $p_{j \mid i} \propto \exp(\|x_i-x_j\|^2/\sigma_i^2)$ ，這里就是用的高斯核了，只涉及到一個參數(shù) $\sigma_i$ .

for some parameter $\sigma_i$ . Intuitively, the value $p_{ij}$ measures the `similarity' between points $x_i$ and $x_j$ . t-SNE then aims to learn the lower dimensional points $y_1, \cdots, y_N \in \mathbb{R}^2$ such that if $q_{ij} \propto (1+\|y_i-y_j \|_2^2)^{-1}$ , then $q_{ij}$ minimizes the Kullback–Leibler divergence of the distribution $\{q_{ij}\}$ from the distribution $\{p_{ij}\}$ . For a more detailed explanation of the t-SNE algorithm, see \cite{orig_tsne}.

直觀地講，該值 $p_{ij}$ 衡量點與點之間的相似性。然后t-SNE學習較低的維度，以便在低維空間中將分布的KL散度最小化。

這里簡單列一下t-SNE的算法：

image.png

關于t-SNE的更多內(nèi)容參考另外一篇論文Visualizing Data using t-SNE .

2. t-SNE的問題

t-SNE的計算復雜度很高，在數(shù)百萬個樣本數(shù)據(jù)集中可能需要幾個小時，而PCA可以在幾秒鐘或幾分鐘內(nèi)完成
只能限于二維或三維嵌入。
算法是隨機的，具有不同種子的多次實驗可以產(chǎn)生不同的結(jié)果。雖然選擇loss最小的結(jié)果就行，但可能需要多次實驗以選擇超參數(shù)。

3. t-SNE的參數(shù)

這里列一下在TensorFlow中t-SNE相關的參數(shù)，其實參數(shù)很多，但是TensorFlow做了很多自動化處理，所以只考慮下面這幾個：

Dimension: 這個只是考慮輸出結(jié)果是二維空間還是三維空間。
Perplexity：這個可以叫困惑度，他說明了如何在數(shù)據(jù)的本地和全局方面之間取得平衡。也就是說通過困惑度去猜測某個點的鄰居有多少個。這個對鄰點數(shù)量的猜測可以對最后結(jié)果的圖片復雜度影響很大。我們通過調(diào)整Perplexity可以分析出很多不同結(jié)果的降維圖片。在上面的那篇原始論文（Visualizing Data using t-SNE .）中提到：“The performance of SNE is fairly robust to changes in the perplexity, and typical values are between 5 and 50”，也就是t-SNE的魯棒性很不錯，一般在5到50之間調(diào)整就可以了
Learning Rate: 這個學習率是在運行t-SNE算法的時候，進行梯度更新的步長。學習率的設置可以自由調(diào)整，但是也要根據(jù)樣本量的大小來調(diào)整，樣本量少學習率就調(diào)小一點。
Supervise: 這個參數(shù)是調(diào)整標簽的重要程度的，我們輸入的數(shù)據(jù)都是成對的（數(shù)據(jù)，標簽），這個參數(shù)可以從0（不用標簽）到10（全部標簽都用到）進行調(diào)整。具體細節(jié)可以參考來自IBN的這篇文章Interactive supervision with TensorBoard

4. 實驗工具:TensorBoard

TensorFlow是一個開源軟件庫，在機器學習，深度學習以及強化學習這些領域應用最廣泛的最基礎的工具，他好處之一是自動計算微分，然后很容易搭建神經(jīng)網(wǎng)絡并進行訓練。

TensorBoard 是用于可視化 TensorFlow 模型的訓練過程的工具（the flow of tensors，在你安裝 TensorFlow 的時候就已經(jīng)安裝了 TensorBoard。

他的組成比較復雜，功能比較多，參考這個結(jié)構(gòu)：

image.png

我們主要只是使用EMBEDDINGS功能。

5. TensorFlow安裝

安裝Python 3.7.2
pip安裝tensorflow 1.13.1

6. 實驗設定

這次實驗我們只做做基礎的使用MNIST觀察下t-SNE的結(jié)果，只用了1000張手寫數(shù)字圖片。

MNIST是一個簡單的計算機視覺數(shù)據(jù)集。它包含如下所示的手寫數(shù)字的圖片集:

image.png

MNIST數(shù)據(jù)保存在了Yann LeCun的網(wǎng)站,

下載的數(shù)據(jù)包含兩部分，6萬個訓練數(shù)據(jù)(mnist.train)和10萬各測試數(shù)據(jù)(mnist.test)。這個區(qū)分是非常重要的：在機器學習領域，我們分離出一部分數(shù)據(jù)，這部分數(shù)據(jù)我們不用來作訓練，以此來保證我們學到的是通用的規(guī)則。

就像前面提到的那樣，每個MNIST數(shù)據(jù)有都有兩個部分：一個手寫數(shù)字的圖片，以及一個相應的標簽。我們用"xs"表示圖片，用"ys"表示標簽。訓練數(shù)據(jù)和測試數(shù)據(jù)都有xs和ys，以測試數(shù)據(jù)舉例來說，訓練圖片是mnist.train.images，訓練標簽是mnist.train.labels。

每個圖片都是28X28像素，我們可以將它理解為一個大數(shù)組。

image.png

這個數(shù)組可以flatten為一個包含28X28=784個數(shù)字的向量,如何flatten這個數(shù)組是沒有關系的，因為我們在所有圖片中都是相同的。從這個角度看，MNIST圖片只是一堆784維空間中的一個點

我們將mnist看作是一個[60000,784]的張量(an n-dimensional array) 。第一個維度是圖片的索引，第二個維度圖片上像素的索引。這個張量的每個實體都是某個圖片，某個像素的，像素亮度用0到1之間的數(shù)字表示

MNIST對應的標簽是0～9的數(shù)字，描述了給定圖片是哪個數(shù)字。基于本教程的目的，我們想要我們的標簽作為"one-hot vectors",one-hot vector是一個大多數(shù)維度都是0，只有一個維度為1的向量。在我們這個例子里，第n個數(shù)字會表示為一個第n位為1的向量，比如0會表示為[1,0,0,0,0,01,0,0,0,0,0,0,0,0,0,0],相應的，mnist.train.labels是一個[60000,10]的float數(shù)組。

7. 實驗的代碼

我們實驗的代碼如下

這里我們主要使用的網(wǎng)絡架構(gòu)是這樣的，輸入圖片進入第一層卷積神經(jīng)網(wǎng)絡，使用32個[5,5]形狀的濾波器，并使用補償為2的最大池化操作。然后進入第二層卷積神經(jīng)網(wǎng)絡，使用64個[5,5]形狀的濾波器，并使用補償為2的最大池化操作。最后把得到的張量鋪平成一個一維度的向量，通過一個512個cell的全連接網(wǎng)絡，最后再進入soft輸出結(jié)果是數(shù)字1-10的概率。

使用cross-entropy作為損失函數(shù)，并使用優(yōu)化算法

#導入相關的庫
import tensorflow as tf
from tensorflow.examples.tutorials.mnist import input_data
import os
import time
from tensorflow.contrib.tensorboard.plugins import projector
import matplotlib.pyplot as plt
import numpy as np
#這里用slim這個API來進行卷積網(wǎng)絡構(gòu)建
slim = tf.contrib.slim

#定義卷積神經(jīng)網(wǎng)絡模型
#網(wǎng)絡架構(gòu)是卷積網(wǎng)絡--最大池化--卷積網(wǎng)絡--最大池化---flatten---MLP-softmax的全連接MLP
def model(inputs, is_training, dropout_rate, num_classes, scope='Net'):
    inputs = tf.reshape(inputs, [-1, 28, 28, 1])
    with tf.variable_scope(scope):
        with slim.arg_scope([slim.conv2d, slim.fully_connected],
                            normalizer_fn=slim.batch_norm):
            net = slim.conv2d(inputs, 32, [5, 5], padding='SAME', scope='conv1')
            net = slim.max_pool2d(net, 2, stride=2, scope='maxpool1')
            tf.summary.histogram("conv1", net)

            net = slim.conv2d(net, 64, [5, 5], padding='SAME', scope='conv2')
            net = slim.max_pool2d(net, 2, stride=2, scope='maxpool2')
            tf.summary.histogram("conv2", net)

            net = slim.flatten(net, scope='flatten')
            fc1 = slim.fully_connected(net, 1024, scope='fc1')
            tf.summary.histogram("fc1", fc1)

            net = slim.dropout(fc1, dropout_rate, is_training=is_training, scope='fc1-dropout')
            net = slim.fully_connected(net, num_classes, scope='fc2')

            return net, fc1


def create_sprite_image(images):
    """更改圖片的shape"""
    if isinstance(images, list):
        images = np.array(images)
    img_h = images.shape[1]
    img_w = images.shape[2]
    n_plots = int(np.ceil(np.sqrt(images.shape[0])))

    sprite_image = np.ones((img_h * n_plots, img_w * n_plots))

    for i in range(n_plots):
        for j in range(n_plots):
            this_filter = i * n_plots + j
            if this_filter < images.shape[0]:
                this_img = images[this_filter]
                sprite_image[i * img_h:(i + 1) * img_h,
                j * img_w:(j + 1) * img_w] = this_img

    return sprite_image


def vector_to_matrix_mnist(mnist_digits):
    """把正常的mnist數(shù)字圖片(batch,28*28)這個格式，轉(zhuǎn)換為新的張量形狀(batch,28,28)"""
    return np.reshape(mnist_digits, (-1, 28, 28))


def invert_grayscale(mnist_digits):
    """處理下圖片顏色，黑色變白，白色邊黑"""
    return 1 - mnist_digits


if __name__ == "__main__":
    # 定義參數(shù)
    #學習率
    learning_rate = 1e-4
    #定義迭代參數(shù)
    total_epoch = 500
    #定義批量
    batch_size = 200
    #程序運行中打印頻率
    display_step = 20
    #程序運行中保存結(jié)果的頻率
    save_step = 100
    load_checkpoint = False
    checkpoint_dir = "checkpoint"
    checkpoint_name = 'model.ckpt'
    #結(jié)果存放的路徑
    logs_path = "logs"
    #定義我們使用多少個圖片
    test_size = 1000
    #定義第二層路徑
    projector_path = 'projector'

    # 網(wǎng)絡參數(shù)
    n_input = 28 * 28   # 每個圖片是28*28個像素，也就是784個特征
    n_classes = 10  # MNIST數(shù)據(jù)集有0-9是個類別的結(jié)果
    dropout_rate = 0.5  # Dropout的比率

    mnist = input_data.read_data_sets('MNIST-data', one_hot=True)

    # 定義計算圖
    x = tf.placeholder(tf.float32, [None, n_input], name='InputData')
    y = tf.placeholder(tf.float32, [None, n_classes], name='LabelData')
    is_training = tf.placeholder(tf.bool, name='IsTraining')
    keep_prob = dropout_rate

    logits, fc1 = model(x, is_training, keep_prob, n_classes)

    with tf.name_scope('Loss'):
        loss = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(logits=logits, labels=y))
    tf.summary.scalar("loss", loss)

    with tf.name_scope('Accuracy'):
        correct_pred = tf.equal(tf.argmax(logits, 1), tf.argmax(y, 1))
        accuracy = tf.reduce_mean(tf.cast(correct_pred, tf.float32))
    tf.summary.scalar("accuracy", accuracy)

    optimizer = tf.train.AdamOptimizer(learning_rate=learning_rate).minimize(loss)

    projector_dir = os.path.join(logs_path, projector_path)
    path_metadata = os.path.join(projector_dir,'metadata.tsv')
    path_sprites = os.path.join(projector_dir, 'mnistdigits.png')
    # 檢查結(jié)果目錄的狀態(tài)
    if not os.path.exists(projector_dir):
        os.makedirs(projector_dir)

    # 這里進行嵌入
    mnist_test = input_data.read_data_sets('MNIST-data', one_hot=False)
    batch_x_test = mnist_test.test.images[:test_size]
    batch_y_test = mnist_test.test.labels[:test_size]

    embedding_var = tf.Variable(tf.zeros([test_size, 1024]), name='embedding')
    assignment = embedding_var.assign(fc1)

    config = projector.ProjectorConfig()
    embedding = config.embeddings.add()
    embedding.tensor_name = embedding_var.name
    embedding.metadata_path = os.path.join(projector_path,'metadata.tsv')
    embedding.sprite.image_path = os.path.join(projector_path, 'mnistdigits.png')
    embedding.sprite.single_image_dim.extend([28,28])

    # 初始化變量
    init = tf.global_variables_initializer()

    # 'Saver' op to save and restore all the variables
    saver = tf.train.Saver()
    merged_summary_op = tf.summary.merge_all()

    # 運行計算圖
    with tf.Session() as sess:
        sess.run(init)
        # Restore model weights from previously saved model
        prev_model = tf.train.get_checkpoint_state(checkpoint_dir)
        if load_checkpoint:
            if prev_model:
                saver.restore(sess, prev_model.model_checkpoint_path)
                print('Checkpoint found, {}'.format(prev_model))
            else:
                print('No checkpoint found')

        summary_writer = tf.summary.FileWriter(logs_path, graph=tf.get_default_graph())
        projector.visualize_embeddings(summary_writer, config)
        start_time = time.time()
        # 開始訓練
        for epoch in range(total_epoch):
            batch_x, batch_y = mnist.train.next_batch(batch_size)
            # reshapeX = np.reshape(batch_x, [-1, 28, 28, 1])
            # 開始反向傳播算法
            sess.run(optimizer, feed_dict={x: batch_x, y: batch_y,
                                           is_training: True})
            if epoch % display_step == 0:
                # 計算損失和精度
                cost, acc, summary = sess.run([loss, accuracy, merged_summary_op],
                                              feed_dict={x: batch_x,
                                                         y: batch_y,
                                                         is_training: False})
                elapsed_time = time.time() - start_time
                start_time = time.time()
                print('epoch {}, training accuracy: {:.4f}, loss: {:.5f}, time: {}'
                      .format(epoch, acc, cost, elapsed_time))
                summary_writer.add_summary(summary, epoch)
            if epoch % save_step == 0:
                # 保存訓練的結(jié)果
                sess.run(assignment, feed_dict={x: mnist.test.images[:test_size],
                                                y: mnist.test.labels[:test_size], is_training: False})
                checkpoint_path = os.path.join(checkpoint_dir, checkpoint_name)
                save_path = saver.save(sess, checkpoint_path)
                print("Model saved in file: {}".format(save_path))

        # 保存結(jié)果
        saver.save(sess, os.path.join(logs_path, "model.ckpt"), 1)
        # 創(chuàng)建可視化的圖片
        to_visualise = batch_x_test
        to_visualise = vector_to_matrix_mnist(to_visualise)
        to_visualise = invert_grayscale(to_visualise)
        sprite_image = create_sprite_image(to_visualise)
        # 保存可視化的圖片
        plt.imsave(path_sprites, sprite_image, cmap='gray')
        # 寫文件
        with open(path_metadata, 'w') as f:
            f.write("Index\tLabel\n")
            for index, label in enumerate(batch_y_test):
                f.write("%d\t%d\n" % (index, label))

        print("訓練完成")

8. 實驗結(jié)果

在上面的代碼訓練完成之后，我們把結(jié)果保存在了代碼中定義的目錄logs中。我們直接進入logs目錄的父目錄然后在CMD中啟動tensorboard即可：

? tensorboard --logdir logs

啟動過程如下所示：

image.png

然后我們通過chrome瀏覽器訪問tensorboard，

http://localhost:6006

我們在PROJECTOR可以看到t-SNE的結(jié)果：

image.png

這個記過看起來這1000個圖片在三維空間上確實是把不同的圖片分開了。Perplexity,learning rate與supervise都隨意調(diào)的，可以仔細調(diào)整看看。然后這里只跑了184次迭代，因為發(fā)現(xiàn)這里效果看起來不錯，訓練的迭代次數(shù)往后越多，這些堆分開得越遠。效果很好，但是不好看。

我們看看在二維空間的可視化結(jié)果：

image.png

同樣的參數(shù)下也只訓練了182個迭代，效果不錯的。

此外最后我們再看看PCA降維的效果，感覺和T-SNE差別比較大，沒有深入研究，可能TSNE是在動態(tài)訓練，PCA這里只是靜態(tài)的。注意PCA在三維上是需要手動選擇特征的。

image.png

色偷偷精品伊人,欧洲久久精品,欧美综合婷婷骚逼,国产AV主播,国产最新探花在线,九色在线视频一区,伊人大交九欧美,1769亚洲,黄色成人av

關于t-SNE降維方法