總結(jié)-在極客云使用GPU進(jìn)行深度學(xué)習(xí)訓(xùn)練

圖形處理單元 (GPU) 可顯著加快許多深度學(xué)習(xí)模型的訓(xùn)練過程。用于圖片分類、視頻分析和自然語言處理等任務(wù)的訓(xùn)練模型涉及計(jì)算密集型矩陣乘法以及其他可利用 GPU 大規(guī)模并行架構(gòu)的操作。

對于需要對超大數(shù)據(jù)集執(zhí)行密集計(jì)算任務(wù)的深度學(xué)習(xí)模型,可能需要在單個(gè)處理器上運(yùn)行數(shù)日才能完成訓(xùn)練。但是,如果將這些任務(wù)分流到一個(gè)或多個(gè) GPU,則可以將訓(xùn)練時(shí)間從數(shù)日縮短至數(shù)小時(shí)。

但是好的GPU過于昂貴,一般學(xué)生(比如我)很難有能力購買,于是我在網(wǎng)上沖浪多次之后發(fā)現(xiàn)了一個(gè)非常好用,性價(jià)比較高的云服務(wù)器—極客云 。同阿里云、騰訊云等云服務(wù)器相比,其價(jià)格更加便宜,更加有針對性。極客云網(wǎng)站打出的標(biāo)語是:同等算力價(jià)格便宜3倍以上。由此,大家可以自行體會(huì)一下它的價(jià)格。同時(shí)使用極客云服務(wù)器最大的方便之處是自帶很多計(jì)算框架。只需要專注于深度學(xué)習(xí)本身,無需安裝任何深度學(xué)習(xí)環(huán)境,零設(shè)置開啟深度學(xué)習(xí)之旅(這對于我這種安裝軟件,配置環(huán)境老出現(xiàn)各種各樣莫名其妙的問題的人來說,簡直是超大福音)。它只需簡單幾步操作即可測試和訓(xùn)練深度學(xué)習(xí)模型。

以下為極客云提供的部分GPU截圖:

如何使用:

1. 注冊(常規(guī)操作)

2.創(chuàng)建GPU實(shí)例(可以選擇你想創(chuàng)建什么型號(hào)的GPU主機(jī)。我創(chuàng)建了一個(gè)GTX 1080Ti虛擬主機(jī)(這個(gè)能限時(shí)免費(fèi)體驗(yàn)還不需要排隊(duì)。還需要根據(jù)自己的需求選擇預(yù)裝框架(tensorflow、caffe-gpu、fastai等)、預(yù)裝框架版本、python版本。

3.點(diǎn)擊“創(chuàng)建”按鈕并等待片刻后,會(huì)回到“我的云主機(jī)”頁面,此時(shí)可以看到創(chuàng)建的云主機(jī)已經(jīng)顯示在列表界面里面了。現(xiàn)在列表界面有Jupyter Notebook的鏈接了。點(diǎn)這個(gè)鏈接就可以進(jìn)入Jupyter Notebook了,使用起來很方便。

極客云的云主機(jī)上自帶了手寫字體識(shí)別的python代碼,名稱為mnist_deep.py。它的位置在Jupyter Notebook的root文件夾下。你可以將其復(fù)制粘貼在你新建的python文件中感受一下GPU的速度。這里我將其提供的mnist_deep.py文件放過來,方便觀察代碼信息:

# Copyright 2015 The TensorFlow Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# ==============================================================================

"""A deep MNIST classifier using convolutional layers.

See extensive documentation at
https://www.tensorflow.org/get_started/mnist/pros
"""
# Disable linter warnings to maintain consistency with tutorial.
# pylint: disable=invalid-name
# pylint: disable=g-bad-import-order

from __future__ import absolute_import
from __future__ import division
from __future__ import print_function

import argparse
import sys
import tempfile

from tensorflow.examples.tutorials.mnist import input_data

import tensorflow as tf

FLAGS = None


def deepnn(x):
  """deepnn builds the graph for a deep net for classifying digits.

  Args:
    x: an input tensor with the dimensions (N_examples, 784), where 784 is the
    number of pixels in a standard MNIST image.

  Returns:
    A tuple (y, keep_prob). y is a tensor of shape (N_examples, 10), with values
    equal to the logits of classifying the digit into one of 10 classes (the
    digits 0-9). keep_prob is a scalar placeholder for the probability of
    dropout.
  """
  # Reshape to use within a convolutional neural net.
  # Last dimension is for "features" - there is only one here, since images are
  # grayscale -- it would be 3 for an RGB image, 4 for RGBA, etc.
  with tf.name_scope('reshape'):
    x_image = tf.reshape(x, [-1, 28, 28, 1])

  # First convolutional layer - maps one grayscale image to 32 feature maps.
  with tf.name_scope('conv1'):
    W_conv1 = weight_variable([5, 5, 1, 32])
    b_conv1 = bias_variable([32])
    h_conv1 = tf.nn.relu(conv2d(x_image, W_conv1) + b_conv1)

  # Pooling layer - downsamples by 2X.
  with tf.name_scope('pool1'):
    h_pool1 = max_pool_2x2(h_conv1)

  # Second convolutional layer -- maps 32 feature maps to 64.
  with tf.name_scope('conv2'):
    W_conv2 = weight_variable([5, 5, 32, 64])
    b_conv2 = bias_variable([64])
    h_conv2 = tf.nn.relu(conv2d(h_pool1, W_conv2) + b_conv2)

  # Second pooling layer.
  with tf.name_scope('pool2'):
    h_pool2 = max_pool_2x2(h_conv2)

  # Fully connected layer 1 -- after 2 round of downsampling, our 28x28 image
  # is down to 7x7x64 feature maps -- maps this to 1024 features.
  with tf.name_scope('fc1'):
    W_fc1 = weight_variable([7 * 7 * 64, 1024])
    b_fc1 = bias_variable([1024])

    h_pool2_flat = tf.reshape(h_pool2, [-1, 7*7*64])
    h_fc1 = tf.nn.relu(tf.matmul(h_pool2_flat, W_fc1) + b_fc1)

  # Dropout - controls the complexity of the model, prevents co-adaptation of
  # features.
  with tf.name_scope('dropout'):
    keep_prob = tf.placeholder(tf.float32)
    h_fc1_drop = tf.nn.dropout(h_fc1, keep_prob)

  # Map the 1024 features to 10 classes, one for each digit
  with tf.name_scope('fc2'):
    W_fc2 = weight_variable([1024, 10])
    b_fc2 = bias_variable([10])

    y_conv = tf.matmul(h_fc1_drop, W_fc2) + b_fc2
  return y_conv, keep_prob


def conv2d(x, W):
  """conv2d returns a 2d convolution layer with full stride."""
  return tf.nn.conv2d(x, W, strides=[1, 1, 1, 1], padding='SAME')


def max_pool_2x2(x):
  """max_pool_2x2 downsamples a feature map by 2X."""
  return tf.nn.max_pool(x, ksize=[1, 2, 2, 1],
                        strides=[1, 2, 2, 1], padding='SAME')


def weight_variable(shape):
  """weight_variable generates a weight variable of a given shape."""
  initial = tf.truncated_normal(shape, stddev=0.1)
  return tf.Variable(initial)


def bias_variable(shape):
  """bias_variable generates a bias variable of a given shape."""
  initial = tf.constant(0.1, shape=shape)
  return tf.Variable(initial)


def main(_):
  # Import data
  mnist = input_data.read_data_sets(FLAGS.data_dir, one_hot=True)

  # Create the model
  x = tf.placeholder(tf.float32, [None, 784])

  # Define loss and optimizer
  y_ = tf.placeholder(tf.float32, [None, 10])

  # Build the graph for the deep net
  y_conv, keep_prob = deepnn(x)

  with tf.name_scope('loss'):
    cross_entropy = tf.nn.softmax_cross_entropy_with_logits(labels=y_,
                                                            logits=y_conv)
  cross_entropy = tf.reduce_mean(cross_entropy)

  with tf.name_scope('adam_optimizer'):
    train_step = tf.train.AdamOptimizer(1e-4).minimize(cross_entropy)

  with tf.name_scope('accuracy'):
    correct_prediction = tf.equal(tf.argmax(y_conv, 1), tf.argmax(y_, 1))
    correct_prediction = tf.cast(correct_prediction, tf.float32)
  accuracy = tf.reduce_mean(correct_prediction)

  graph_location = tempfile.mkdtemp()
  print('Saving graph to: %s' % graph_location)
  train_writer = tf.summary.FileWriter(graph_location)
  train_writer.add_graph(tf.get_default_graph())

  with tf.Session() as sess:
    sess.run(tf.global_variables_initializer())
    for i in range(20000):
      batch = mnist.train.next_batch(50)
      if i % 100 == 0:
        train_accuracy = accuracy.eval(feed_dict={
            x: batch[0], y_: batch[1], keep_prob: 1.0})
        print('step %d, training accuracy %g' % (i, train_accuracy))
      train_step.run(feed_dict={x: batch[0], y_: batch[1], keep_prob: 0.5})

    print('test accuracy %g' % accuracy.eval(feed_dict={
        x: mnist.test.images, y_: mnist.test.labels, keep_prob: 1.0}))

if __name__ == '__main__':
  parser = argparse.ArgumentParser()
  parser.add_argument('--data_dir', type=str,
                      default='/tmp/tensorflow/mnist/input_data',
                      help='Directory for storing input data')
  FLAGS, unparsed = parser.parse_known_args()
  tf.app.run(main=main, argv=[sys.argv[0]] + unparsed)

關(guān)于自己的數(shù)據(jù)集:

可以通過菜單 我的 -> 數(shù)據(jù) 進(jìn)入數(shù)據(jù)頁面來上傳自己的數(shù)據(jù)集。

在 我的數(shù)據(jù) 頁面上傳的數(shù)據(jù),創(chuàng)建的所有云主機(jī)都能訪問。

當(dāng)數(shù)據(jù)集文件數(shù)量不大的時(shí)候,推薦使用上傳目錄功能。 如果您的數(shù)據(jù)集包含大量文件,推薦打成壓縮包再上傳??梢怨?jié)省很多上傳時(shí)間。

上傳結(jié)束后,您將會(huì)在云主機(jī)的/data目錄下看見所有您上傳的文件。

由于 /data 目錄是網(wǎng)絡(luò)存儲(chǔ),讀寫速度受限于網(wǎng)絡(luò),直接在 /data 讀取數(shù)據(jù)進(jìn)行訓(xùn)練的話,速度會(huì)很慢,所以推薦先把數(shù)據(jù) 從 /data 拷貝到 /input 或 /root 然后再訓(xùn)練。具體做法請參照網(wǎng)站詳解。

附極客云網(wǎng)站鏈接:極客云網(wǎng)站。

另:網(wǎng)上有網(wǎng)友反映說極客云的Gpu如果跑的時(shí)間長了的話會(huì)自己斷開。這個(gè)我目前還只是跑了兩個(gè)小項(xiàng)目,并沒有遇到這種情況。大家可以依照自己的情況做取舍。官方目前做出的說明如下:如果訓(xùn)練任務(wù)需要跑很長時(shí)間(一天以上),強(qiáng)烈建議定時(shí)保存checkpoint,即使是欠費(fèi)停機(jī)了,下次開機(jī)仍然可以接著上次的進(jìn)度繼續(xù)跑,數(shù)據(jù)也不會(huì)丟失。

tensorflow框架保存的方法請參照 :如何使用Tensorflow加載預(yù)訓(xùn)練模型和保存模型。

注:大家不跑項(xiàng)目的時(shí)候把你的云主機(jī)(實(shí)例)給關(guān)掉啊啊啊,不然是要一直扣費(fèi)的。

?著作權(quán)歸作者所有,轉(zhuǎn)載或內(nèi)容合作請聯(lián)系作者
【社區(qū)內(nèi)容提示】社區(qū)部分內(nèi)容疑似由AI輔助生成,瀏覽時(shí)請結(jié)合常識(shí)與多方信息審慎甄別。
平臺(tái)聲明:文章內(nèi)容(如有圖片或視頻亦包括在內(nèi))由作者上傳并發(fā)布,文章內(nèi)容僅代表作者本人觀點(diǎn),簡書系信息發(fā)布平臺(tái),僅提供信息存儲(chǔ)服務(wù)。

相關(guān)閱讀更多精彩內(nèi)容

友情鏈接更多精彩內(nèi)容