介紹
深度學(xué)習(xí)的飛速發(fā)展重新喚起了吃瓜群眾們對未來高階段人工智能大規(guī)模使用帶來新的生產(chǎn)力巨大躍升這一美好前途的憧憬。
可事實情況是我們搞CNN的專家們?yōu)榱似疵非鬁?zhǔn)確率已經(jīng)越來越癡迷于構(gòu)建層次更深、參數(shù)更多、計算結(jié)構(gòu)更復(fù)雜的網(wǎng)絡(luò)(突然想到了那些活躍在厲害國各個角落忙著大興土木,熱火朝天搞建設(shè)的人民工仆們,他們似乎對GDP也異常執(zhí)著,一點(diǎn)也不亞于AI experts們對CNN分類準(zhǔn)確率的追求。至于造出來的大樓,公路,商場是否有人住已經(jīng)不管了,他們想的是如何讓東西顯得漂亮,足夠得高端、大氣、上檔次以顯得自己領(lǐng)導(dǎo)有方,政績顯著。。呃,跑題了好像。。)
想想自Alexnet以來,我們相繼又有了VGG/Googlenet/Resnet,網(wǎng)絡(luò)層數(shù)、訓(xùn)練參數(shù)是一路高歌猛進(jìn)。Resnet系列網(wǎng)絡(luò)更是一度將CNN層數(shù)狂飆到上千層(真是人有多大膽,CNN就敢有多少層啊)。。可這些實驗室里孵化出的網(wǎng)絡(luò)一旦部署到生產(chǎn)實際當(dāng)中就會遇到種種意想不到的困難。首先網(wǎng)絡(luò)算起來太慢,尤其對于本身計算資源有效的移動設(shè)備(如手機(jī)/Pad)而言。設(shè)想我們拍了張照片,然后讓帶有AI驅(qū)動的APP幫我們識別出照片上都有些什么東西,結(jié)果它思考個十幾秒鐘,才給我們慢吞吞地返回個準(zhǔn)確率只有60%的結(jié)果,并稍帶著將我們手機(jī)上已有的電量耗了一半,這樣的APP你會用嗎?另外這些復(fù)雜CNN網(wǎng)絡(luò)往往有著龐大的可訓(xùn)練參數(shù)需要我們部署時一并輸入到內(nèi)存當(dāng)中,對于本身內(nèi)存就不大的移動設(shè)備而言,內(nèi)存被爆倉機(jī)率比炒A股來得還要高些,想象一下有這么個神奇的APP,我們使用十次會導(dǎo)致手機(jī)崩潰、重啟個三、四次,幸免于難的那幾次還會讓你感覺手機(jī)奇慢無比,體驗差得讓人恨不得也想像“昆山龍哥”那樣拿刀出去揮揮(當(dāng)然出門前保險買好)。。這樣的APP你會想用嗎?
總之,精減CNN網(wǎng)絡(luò)是這么一種更為實際的想法。那就是讓CNN的強(qiáng)大跟具體的生產(chǎn)實際結(jié)合起來,讓它變得更切實、可用。目前大致有兩類方法,一種是得到原生CNN網(wǎng)絡(luò)的訓(xùn)練權(quán)重后,在真正進(jìn)行模型部署時進(jìn)行模型結(jié)構(gòu)(Pruning)或權(quán)重參數(shù)(Compression)精減以使得我們能夠以一種現(xiàn)實能接受的方式進(jìn)行模型推理;另外一種則是直接訓(xùn)練出一種計算復(fù)雜度更低、訓(xùn)練參數(shù)更少的網(wǎng)絡(luò)以滿足實際生產(chǎn)環(huán)境部署的需求。
MobileNet屬于上面說的第二種方法。
MobileNet結(jié)構(gòu)
Depthwise和Pointwise組成的新卷積結(jié)構(gòu)
首先我們介紹一個典型的卷積計算結(jié)構(gòu)。假設(shè)其輸入為DF x DF x M的feature map,這里DF為輸入feature map的長、寬(簡單考慮假設(shè)長寬相同),M則為input channels數(shù)目;然后假設(shè)它的輸出為DG x DG x N的feature map,這里DG為輸出feature map的長、寬大小,N則為output channels數(shù)目。這樣的一個典型conv結(jié)構(gòu)的kernel通常為DK x Dk x M x N。
它的輸出與輸入之間的計算公式如下:
它的計算消耗為:Dk * Dk * M * N * DF * DF。
然后我們再看下Depthwise與Pointwise conv所組成的新的卷積結(jié)構(gòu)。首選Depthwise與Pointwise都是conv操作,尤其是Pointwise更是典型的1x1 conv操作。Depthwise conv則是一種一個input channel對應(yīng)一個conv filter進(jìn)行卷積的操作,顯然它輸出的output channels數(shù)目與input channels數(shù)目也會相等。Pointwise conv在Depthwise conv操作之后進(jìn)行,它使用1x1的conv來將之前的IC(input channels)個feature maps進(jìn)行融合,整理最終輸出OC(Output channels)個特征的feature maps。
這樣Depthwise的計算公式為:
它的計算復(fù)雜度為: Dk * Dk * M * DF * DF。
而Pointwise的計算復(fù)雜度則為:DF * DF * M * N。
最終這種由Depthwise與Pointwise組合起來的新conv結(jié)構(gòu)的總計算復(fù)雜度為:Dk * Dk * M * DF * DF + DF * DF * M * N。
通過與典型conv操作的計算復(fù)雜度相比,如下??煽闯鲂碌腸onv結(jié)構(gòu)可節(jié)省大量計算與參數(shù)。
MobileNet網(wǎng)絡(luò)構(gòu)成
下圖為Depthwise與Pointwise組成的新卷積結(jié)構(gòu)的層次組合表示。
下圖為MobileNet的網(wǎng)絡(luò)構(gòu)成。它的95%的時間是在1x1 conv層上消耗的,另外1x1的conv參數(shù)也占了所有可訓(xùn)練參數(shù)的75%。
MobileNet訓(xùn)練
Googlers們使用RMSprop+Async gradient更新的方式進(jìn)行網(wǎng)絡(luò)訓(xùn)練。作者發(fā)現(xiàn)像Mobilenet這么小的模型不大適宜使用過多的Regularization操作(因為它可訓(xùn)練參數(shù)不多,不大容易出現(xiàn)過擬合的情況)。為此他們在進(jìn)行訓(xùn)練時并沒有使用像inception v3中那樣的side head/smooth labeling及過多的image data augmentations等操作。
Width_multiplier: Thinner models
作者試圖在節(jié)省計算與accuracy之間尋找平衡,為此他們使用alpha參數(shù)來調(diào)節(jié)每層的寬度,它可用來影響input channels M及output channels N的數(shù)目。若施加了alpha參數(shù),那么在真正計算時所用的M與N將分別為alpha x M與alpha x N。它又叫縮減參數(shù)。
Resolution Multiplier: Reduced representation
同樣為了節(jié)省計算、內(nèi)存開銷的考慮,作者使用了beta參數(shù)來調(diào)節(jié)feature maps的大小,即如果輸入或輸出feature map本來的長寬為D,那么調(diào)整后將為D x beta。
下表中反映了alpha與beta參數(shù)可節(jié)省的計算及內(nèi)存資源。
下面兩表中反映了施加alpha與beta等縮減參數(shù)對最終模型分類精度及計算與內(nèi)存開銷的影響。
實驗結(jié)果
下表中我們可看出MobileNet與其它流行模型像VGG/Inception之間的比較??梢钥闯鏊跍p少巨大計算及內(nèi)存開銷的同時,分類精度表現(xiàn)不俗。
代碼分析
以下為它訓(xùn)練時的基本配置參量。
flags.DEFINE_string('master', '', 'Session master')
flags.DEFINE_integer('task', 0, 'Task')
flags.DEFINE_integer('ps_tasks', 0, 'Number of ps')
flags.DEFINE_integer('batch_size', 64, 'Batch size')
flags.DEFINE_integer('num_classes', 1001, 'Number of classes to distinguish')
flags.DEFINE_integer('number_of_steps', None,
'Number of training steps to perform before stopping')
flags.DEFINE_integer('image_size', 224, 'Input image resolution')
flags.DEFINE_float('depth_multiplier', 1.0, 'Depth multiplier for mobilenet')
flags.DEFINE_bool('quantize', False, 'Quantize training')
flags.DEFINE_string('fine_tune_checkpoint', '',
'Checkpoint from which to start finetuning.')
flags.DEFINE_string('checkpoint_dir', '',
'Directory for writing training checkpoints and logs')
flags.DEFINE_string('dataset_dir', '', 'Location of dataset')
flags.DEFINE_integer('log_every_n_steps', 100, 'Number of steps per log')
flags.DEFINE_integer('save_summaries_secs', 100,
'How often to save summaries, secs')
flags.DEFINE_integer('save_interval_secs', 100,
'How often to save checkpoints, secs')
以下為它各個層的結(jié)果與參數(shù)等信息。
"""
75% Mobilenet V1 (base) with input size 128x128:
See mobilenet_v1_075()
Layer params macs
--------------------------------------------------------------------------------
MobilenetV1/Conv2d_0/Conv2D: 648 2,654,208
MobilenetV1/Conv2d_1_depthwise/depthwise: 216 884,736
MobilenetV1/Conv2d_1_pointwise/Conv2D: 1,152 4,718,592
MobilenetV1/Conv2d_2_depthwise/depthwise: 432 442,368
MobilenetV1/Conv2d_2_pointwise/Conv2D: 4,608 4,718,592
MobilenetV1/Conv2d_3_depthwise/depthwise: 864 884,736
MobilenetV1/Conv2d_3_pointwise/Conv2D: 9,216 9,437,184
MobilenetV1/Conv2d_4_depthwise/depthwise: 864 221,184
MobilenetV1/Conv2d_4_pointwise/Conv2D: 18,432 4,718,592
MobilenetV1/Conv2d_5_depthwise/depthwise: 1,728 442,368
MobilenetV1/Conv2d_5_pointwise/Conv2D: 36,864 9,437,184
MobilenetV1/Conv2d_6_depthwise/depthwise: 1,728 110,592
MobilenetV1/Conv2d_6_pointwise/Conv2D: 73,728 4,718,592
MobilenetV1/Conv2d_7_depthwise/depthwise: 3,456 221,184
MobilenetV1/Conv2d_7_pointwise/Conv2D: 147,456 9,437,184
MobilenetV1/Conv2d_8_depthwise/depthwise: 3,456 221,184
MobilenetV1/Conv2d_8_pointwise/Conv2D: 147,456 9,437,184
MobilenetV1/Conv2d_9_depthwise/depthwise: 3,456 221,184
MobilenetV1/Conv2d_9_pointwise/Conv2D: 147,456 9,437,184
MobilenetV1/Conv2d_10_depthwise/depthwise: 3,456 221,184
MobilenetV1/Conv2d_10_pointwise/Conv2D: 147,456 9,437,184
MobilenetV1/Conv2d_11_depthwise/depthwise: 3,456 221,184
MobilenetV1/Conv2d_11_pointwise/Conv2D: 147,456 9,437,184
MobilenetV1/Conv2d_12_depthwise/depthwise: 3,456 55,296
MobilenetV1/Conv2d_12_pointwise/Conv2D: 294,912 4,718,592
MobilenetV1/Conv2d_13_depthwise/depthwise: 6,912 110,592
MobilenetV1/Conv2d_13_pointwise/Conv2D: 589,824 9,437,184
--------------------------------------------------------------------------------
Total: 1,800,144 106,002,432
"""
以下為模型的大致構(gòu)建過程。當(dāng)然它只是用來建圖的,真正的conv運(yùn)算或者depthwise conv運(yùn)算都是在底層C++ code實現(xiàn)的operators上來完成的。
_CONV_DEFS = [
Conv(kernel=[3, 3], stride=2, depth=32),
DepthSepConv(kernel=[3, 3], stride=1, depth=64),
DepthSepConv(kernel=[3, 3], stride=2, depth=128),
DepthSepConv(kernel=[3, 3], stride=1, depth=128),
DepthSepConv(kernel=[3, 3], stride=2, depth=256),
DepthSepConv(kernel=[3, 3], stride=1, depth=256),
DepthSepConv(kernel=[3, 3], stride=2, depth=512),
DepthSepConv(kernel=[3, 3], stride=1, depth=512),
DepthSepConv(kernel=[3, 3], stride=1, depth=512),
DepthSepConv(kernel=[3, 3], stride=1, depth=512),
DepthSepConv(kernel=[3, 3], stride=1, depth=512),
DepthSepConv(kernel=[3, 3], stride=1, depth=512),
DepthSepConv(kernel=[3, 3], stride=2, depth=1024),
DepthSepConv(kernel=[3, 3], stride=1, depth=1024)
]
with tf.variable_scope(scope, 'MobilenetV1', [inputs]):
with slim.arg_scope([slim.conv2d, slim.separable_conv2d], padding=padding):
# The current_stride variable keeps track of the output stride of the
# activations, i.e., the running product of convolution strides up to the
# current network layer. This allows us to invoke atrous convolution
# whenever applying the next convolution would result in the activations
# having output stride larger than the target output_stride.
current_stride = 1
# The atrous convolution rate parameter.
rate = 1
net = inputs
for i, conv_def in enumerate(conv_defs):
end_point_base = 'Conv2d_%d' % i
if output_stride is not None and current_stride == output_stride:
# If we have reached the target output_stride, then we need to employ
# atrous convolution with stride=1 and multiply the atrous rate by the
# current unit's stride for use in subsequent layers.
layer_stride = 1
layer_rate = rate
rate *= conv_def.stride
else:
layer_stride = conv_def.stride
layer_rate = 1
current_stride *= conv_def.stride
if isinstance(conv_def, Conv):
end_point = end_point_base
if use_explicit_padding:
net = _fixed_padding(net, conv_def.kernel)
net = slim.conv2d(net, depth(conv_def.depth), conv_def.kernel,
stride=conv_def.stride,
normalizer_fn=slim.batch_norm,
scope=end_point)
end_points[end_point] = net
if end_point == final_endpoint:
return net, end_points
elif isinstance(conv_def, DepthSepConv):
end_point = end_point_base + '_depthwise'
# By passing filters=None
# separable_conv2d produces only a depthwise convolution layer
if use_explicit_padding:
net = _fixed_padding(net, conv_def.kernel, layer_rate)
net = slim.separable_conv2d(net, None, conv_def.kernel,
depth_multiplier=1,
stride=layer_stride,
rate=layer_rate,
normalizer_fn=slim.batch_norm,
scope=end_point)
end_points[end_point] = net
if end_point == final_endpoint:
return net, end_points
end_point = end_point_base + '_pointwise'
net = slim.conv2d(net, depth(conv_def.depth), [1, 1],
stride=1,
normalizer_fn=slim.batch_norm,
scope=end_point)
end_points[end_point] = net
if end_point == final_endpoint:
return net, end_points
else:
raise ValueError('Unknown convolution type %s for layer %d'
% (conv_def.ltype, i))
raise ValueError('Unknown final endpoint %s' % final_endpoint)
參考文獻(xiàn)
- MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications, Andrew-G.-Howard, 2017
- https://github.com/tensorflow/models/tree/master/research/slim/nets
- https://github.com/Zehaos/MobileNet