有點慚愧,讀這里的代碼的初衷是因為同學說,連Retinanet都不知道你還在搞深度學習。希望ta沒看見這篇博客吧。。。。
論文地址
tensorflow代碼(我解讀的)
tf-RetinaNet這個項目已經完成了,作者還提供了中文操作文檔,希望沒把你們帶到坑里去。
Retinanet這篇論文的最大特點就是提出了一個無人能敵的loss優(yōu)化函數(shù)。
所以論文題目就叫Focal Loss for Dense Object Detection,就是專門為了在目標檢測領域削減類別不平問題,從而設計了Focal Loss。
網絡簡介
這里的主干網絡類似FPN的特征金字塔結構。畢竟論文的重點不在這里啊,主要是loss。
-
實驗網絡
loss
現(xiàn)在的目標檢測分為兩個派別:two-stage detector和one-stage detector。前者是指類似Faster RCNN,RFCN這樣需要region proposal的檢測算法,這類算法可以達到很高的準確率,但是速度較慢。雖然可以通過減少proposal的數(shù)量或降低輸入圖像的分辨率等方式達到提速,但是速度并沒有質的提升。后者是指類似YOLO,SSD這樣不需要region proposal,直接回歸的檢測算法,這類算法速度很快,但是準確率不如前者。作者提出focal loss的出發(fā)點是希望one-stage detector可以達到two-stage detector的準確率,同時不影響原有的速度。ok 就是說這里的 Focal Loss專門對one-stage detector設計的,沒two-stage detector什么事。
來看看損失函數(shù):
-
二分類CE損失
針對二分類,y的值是正1或負1,p的范圍為0到1。當真實label是1,也就是y=1時,假如某個樣本x預測為1這個類的概率p=0.6,那么損失就是-log(0.6),注意這個損失是大于等于0的。如果p=0.9,那么損失就是-log(0.9),所以p=0.6的損失要大于p=0.9的損失。
-
加權CE損失:
增加了一個系數(shù)a,跟p的定義類似,當label=1的時候,系數(shù)是a;當label=-1的時候,系數(shù)是1-a,a的范圍是0到1。因此可以通過設定a的值(一般而言假如1這個類的樣本數(shù)比-1這個類的樣本數(shù)多很多,那么a會取0到0.5來增加-1這個類的樣本的權重)來控制正負樣本對總的loss的共享權重。這里的加權可以控制正負樣本的權重,但是沒法控制容易分類和難分類樣本的權重。
-
Focal Loss

這里的稱作focusing parameter,
>=0。
- 1.當一個樣本被分錯的時候,
是很小的(比如當y=1時,p要小于0.5才是錯分類,此時
就比較小,反之亦然),因此調制系數(shù)就趨于1,也就是說相比原來的loss是沒有什么大的改變的。當p趨于1的時候(此時分類正確而且是易分類樣本),調制系數(shù)趨于0,也就是對于總的loss的貢獻很小。
- 2.當
=0的時候,focal loss就是二分類的交叉熵損失,當
增加的時候,調制系數(shù)也會增加。
- focal loss的兩個性質算是核心,其實就是用一個合適的函數(shù)去度量難分類和易分類樣本對總的損失的貢獻。
作者在實驗中采用的是的focal loss,這樣既能調整正負樣本的權重,又能控制難易分類樣本的權重。
增加的時候,a需要減小一點(實驗中
=2,a=0.25的效果最好),a=0.5就表示傳統(tǒng)的交叉熵
這里還需要好好研究。。。。
開始代碼:
作者的這個項目好像還沒完成,我只能從主要流程上分析這里的代碼,畢竟這是github上tf版星星最多的一個項目,這里的項目和Google Object Detection API 有點類似,這里生成tfrecord好像也是用到是哪里面的代碼。我并沒有調試這里的代碼。。。。。。。。。。。。。。。。。

- data: 存放數(shù)據的目錄
- object_detecion :主要的組建
- maodel_main.py: 函數(shù)的入口
model_main.py
def main(unused_argv):
flags.mark_flag_as_required('model_dir')
flags.mark_flag_as_required('label_map_path')
flags.mark_flag_as_required('train_file_pattern')
flags.mark_flag_as_required('eval_file_pattern')
config = tf.estimator.RunConfig(model_dir=FLAGS.model_dir)
run_config = {"label_map_path": FLAGS.label_map_path,
"num_classes": FLAGS.num_classes}
if FLAGS.finetune_ckpt:
run_config["finetune_ckpt"] = FLAGS.finetune_ckpt
# 創(chuàng)建檢測模型
model_fn = create_model_fn(run_config)
# 使用評估器對模型進行操作
estimator = tf.estimator.Estimator(model_fn=model_fn, config=config)
# 創(chuàng)建訓練輸入函數(shù),對數(shù)據進行處理
train_input_fn = create_input_fn(FLAGS.train_file_pattern, True, FLAGS.image_size, FLAGS.batch_size)
eval_input_fn = create_input_fn(FLAGS.eval_file_pattern, False, FLAGS.image_size)
# 創(chuàng)建預測輸入函數(shù),
prediction_fn = create_prediction_input_fn()
# 創(chuàng)建訓練,驗證評估器
train_spec, eval_spec = create_train_and_eval_specs(train_input_fn, eval_input_fn, prediction_fn, FLAGS.num_train_steps)
tf.estimator.train_and_evaluate(estimator, train_spec, eval_spec)
主要流程如上所示:
- 1.創(chuàng)建檢測模型
- 2.讀取數(shù)據
- 3.托管訓練和評估
在這里面出現(xiàn)了兩個比較重要的函數(shù),
- create_model_fn
- create_train_and_eval_specs
這兩個函數(shù)都是從model導入的。
model.py
開始第一個函數(shù)create_model_fn這是創(chuàng)建檢測模型的函數(shù)。只是貼了主要的流程。
def create_model_fn(run_config, default_params=DEFAULT_PARAMS):
def model_fn(features, labels, mode, params):
# 創(chuàng)建RetinaNetModel模型
model = RetinaNetModel(is_training=is_training, num_classes=num_classes)
if mode == tf.estimator.ModeKeys.TRAIN:
# load pretrained model for checkpoint
ckpt_file = run_config.get("finetune_ckpt")
if ckpt_file:
# 獲取預訓練的初值
asg_map = model.restore_map()
available_var_map = (_get_variables_available_in_ckpt(asg_map, ckpt_file))
tf.train.init_from_checkpoint(ckpt_file, available_var_map)
# predict
images = features["image"]
keys = features["key"]
# 使用進行檢測
predictions_dict = model.predict(images)
# postprocess
if mode in (tf.estimator.ModeKeys.EVAL, tf.estimator.ModeKeys.PREDICT):
# 模型的后處理,進行softmax操作
detections = model.postprocess(predictions_dict, score_thres=default_params.get("score_thres"))
# unstack gt info
if mode in (tf.estimator.ModeKeys.TRAIN, tf.estimator.ModeKeys.EVAL):
# 如果是在進行訓練,需要獲取標記
unstacked_labels = unstack_batch(labels)
gt_boxes_list = unstacked_labels["gt_boxes"]
gt_labels_list = unstacked_labels["gt_labels"]
# -1 due to label offset
# 進行one_hot操作
gt_labels_onehot_list = [tf.one_hot(tf.squeeze(tf.cast(gt_labels-label_offset, tf.int32), 1), num_classes)
for gt_labels in gt_labels_list]
# 計算loss
reg_loss, cls_loss, box_weights, cls_weights = model.loss(predictions_dict, gt_boxes_list, gt_labels_onehot_list)
# 對location loss添加box_loss_weight
losses = [reg_loss * default_params.get("box_loss_weight"), cls_loss]
total_loss_dict = {"Loss/classification_loss": cls_loss, "Loss/localization_loss": reg_loss}
# add regularization loss
# 添加正則損失
regularization_loss = tf.get_collection(tf.GraphKeys.REGULARIZATION_LOSSES)
if regularization_loss:
regularization_loss = tf.add_n(regularization_loss, name='regularization_loss')
losses.append(regularization_loss)
total_loss_dict["Loss/regularization_loss"] = regularization_loss
total_loss = tf.add_n(losses, name='total_loss')
total_loss_dict["Loss/total_loss"] = total_loss
# optimizer
# 構建優(yōu)化器
if mode == tf.estimator.ModeKeys.TRAIN:
lr = learning_rate_schedule(default_params.get("total_train_steps"))
optimizer = tf.train.MomentumOptimizer(lr, momentum=default_params.get("momentum"))
# batch norm need update_ops
update_ops = tf.get_collection(tf.GraphKeys.UPDATE_OPS)
with tf.control_dependencies(update_ops):
train_op = optimizer.minimize(total_loss, tf.train.get_global_step())
else:
train_op = None
# predict mode
# 保存預測模型
if mode == tf.estimator.ModeKeys.PREDICT:
export_outputs = {tf.saved_model.signature_constants.PREDICT_METHOD_NAME: detections}
eval_metric_ops = {}
# just for debugging
# 打印信息
logging_hook = [tf.train.LoggingTensorHook({"gt_labels": gt_labels_list[0], "gt_boxes": gt_boxes_list[0],
'norm_box_loss': reg_loss, 'norm_cls_loss': cls_loss,
"pred_box": predictions_dict["box_pred"],
"pred_cls": predictions_dict["cls_pred"]},
every_n_iter=50)]
if mode == tf.estimator.ModeKeys.EVAL:
logging_hook = [tf.train.LoggingTensorHook({"gt_labels": gt_labels_list[0], "gt_boxes": gt_boxes_list[0],
"detection_boxes": detections["detection_boxes"],
"detection_classes": detections["detection_classes"],
"scores": detections["detection_scores"],
"num_detections": detections["num_detections"]},
every_n_iter=50)]
eval_dict = _result_dict_for_single_example(images[0:1], keys[0], detections,
gt_boxes_list[0], tf.reshape(gt_labels_list[0], [-1]))
if run_config["label_map_path"] is None:
raise RuntimeError("label map file must be defined first!")
else:
category_index = create_categories_from_labelmap(run_config["label_map_path"])
coco_evaluator = CocoDetectionEvaluator(categories=category_index)
eval_metric_ops = coco_evaluator.get_estimator_eval_metric_ops(eval_dict)
eval_metric_ops["classification_loss"] = tf.metrics.mean(cls_loss)
eval_metric_ops["localization_loss"] = tf.metrics.mean(reg_loss)
# 托管訓練
return tf.estimator.EstimatorSpec(mode=mode,
predictions=detections,
loss=total_loss,
train_op=train_op,
eval_metric_ops=eval_metric_ops,
training_hooks=logging_hook,
export_outputs=export_outputs,
evaluation_hooks=logging_hook)
return model_fn
這個函數(shù)就是進行訓練的主要函數(shù)。流程如下
- 1.創(chuàng)建RetinaNetModel模型
- 獲取預訓練的初值
- 3.使用模型進行檢測
- 4.預測值的的后處理,進行softmax操作
- 5.如果是在進行訓練,需要獲取圖片標記
- 6.計算loss
- 7.構建優(yōu)化器
這里創(chuàng)建模型的函數(shù)就是RetinaNetModel,這是下來的重點。
create_train_and_eval_specs
def create_train_and_eval_specs(train_input_fn,
eval_input_fn,
predict_fn,
train_steps):
"""
Create a TrainSpec and EvalSpec
"""
# 訓練
train_spec = tf.estimator.TrainSpec(input_fn=train_input_fn,
max_steps=train_steps)
eval_spec_name = "0"
exported_name = "{}_{}".format('Servo', eval_spec_name)
# 驗證
exporter = tf.estimator.FinalExporter(name=exported_name, serving_input_receiver_fn=predict_fn)
eval_spec = tf.estimator.EvalSpec(name=eval_spec_name, input_fn=eval_input_fn, steps=None, exporters=exporter)
return train_spec, eval_spec
接下就是RetinaNetModel。
在create_model_fn中model出現(xiàn)的順序是
- 1.model = RetinaNetModel(is_training=is_training, num_classes=num_classes)
- 2.model.restore_map()
- 3.model.predict(images)
- 4.model.postprocess(predictions_dict, score_thres=default_params.get("score_thres"))
- 5.model.loss(predictions_dict, gt_boxes_list, gt_labels_onehot_list)
下面就是RetinaNetModel的初始化函數(shù)
class RetinaNetModel():
"""RetinaNet mode constructor"""
def __init__(self, is_training, num_classes, params=DEFAULT_PARAMS):
"""
Args:
is_training: indicate training or not
num_classes: number of classes for prediction
params: parameters for model definition
resnet_arch: name of which resnet architecture used
"""
self._is_training = is_training
self._num_classes = num_classes
self._nms_fn = post_processing.batch_multiclass_non_max_suppression
self._score_convert_fn = tf.sigmoid
self._params = params
# self._unmatched_class_label = tf.constant([1] + (self._num_classes) * [0], tf.float32)
self._unmatched_class_label = tf.constant((self._num_classes + 1) * [0], tf.float32)
# 創(chuàng)建box匹配器
self._target_assigner = create_target_assigner(unmatched_cls_target=self._unmatched_class_label)
self._anchors = None
self._anchor_generator = None
# FasterRcnn的坐標編碼
self._box_coder = faster_rcnn_box_coder.FasterRcnnBoxCoder()
初始化函數(shù)沒有做太多事情,只是拿到了主要的參數(shù)和兩個功能函數(shù)。
按照順序
接下來是model.restore_map()
def restore_map(self):
variables_to_restore = {}
# 獲取全局訓練的參數(shù)
for variable in tf.global_variables():
var_name = variable.op.name
if var_name.startswith("retinanet"):
variables_to_restore[var_name] = variable
# 返回參數(shù)字典
return variables_to_restore
其實這里的邏輯把我弄糊涂了,因為到這里其實還沒有進行模型的構建,基礎CNN模型還沒構建呢,這里會由參數(shù)名稱嗎???。。。希望有人指點哈
model.predict(images)
def predict(self, inputs):
"""
Perform predict from batched input tensor.
During this time, anchors must be constructed before post-process or loss function called
Args:
inputs: a [batch_size, height, width, channels] image tensor
Returns:
prediction_dict: dict with items:
inputs: [batch_size, height, width, channels] image tensor
box_pred: [batch_size, num_anchors, 4] tensor containing predicted boxes
cls_pred: [batch_size, num_anchors, num_classes+1] tensor containing class predictions
feature_maps: a list of feature map tensor
anchors: [num_anchors, 4] tensor containing anchors in normalized coordinates
"""
# 獲取num_scales和box數(shù)目
num_anchors_per_loc = self._params.get("num_scales") * len(self._params.get("aspect_ratios"))
# 構建基礎的retinanet模型,返回一個字典
#dict(box_pred=tf.concat(box_pred, axis=1),
# cls_pred=tf.concat(class_pred, axis=1),
# feature_map_list=feature_map_list)
prediction_dict = retinanet(inputs, self._num_classes, num_anchors_per_loc, is_training=self._is_training)
# generate anchors
# 從feature_map_list中得到每張feature map的shape,這里沒有訓練的參數(shù)
feature_map_shape_list = self._get_feature_map_shape(prediction_dict["feature_map_list"])
#
image_shape = shape_utils.combined_static_and_dynamic_shape(inputs)
# initialize anchor generator
if self._anchor_generator is None:
# 在每個feature map對應shape 上生成固定的box
self._anchor_generator = Anchor(feature_map_shape_list=feature_map_shape_list,
img_size=(image_shape[1], image_shape[2]),
anchor_scale=self._params.get("anchor_scale"),
aspect_ratios=self._params.get("aspect_ratios"),
scales_per_octave=self._params.get("num_scales"))
self._anchors = self._anchor_generator.boxes
prediction_dict["inputs"] = inputs
prediction_dict["anchors"] = self._anchors
return prediction_dict
這里使用FPN的網絡進行預測:
- 1.構建網絡,這是全卷積的網絡由retinanet完成,得到box和classes分類(21)。這里retinanet類似FPN
- 2.在每個feature 上生成固定的box
- 3.輸出
接下來就是進行后處理了:
model.postprocess(predictions_dict, score_thres=default_params.get("score_thres"))
def postprocess(self, prediction_dict, score_thres=1e-8):
"""
Convert prediction tensors to final detection by slicing the bg class, decoding box predictions,
applying nms and clipping to image window
Args:
prediction_dict: dict returned by self.predict function
score_thres: threshold for score to remove low confident boxes
Returns:
detections: a dict with these items:
detection_boxes: [batch_size, max_detection, 4]
detection_scores: [batch_size, max_detections]
detection_classes: [batch_size, max_detections]
"""
with tf.name_scope('Postprocessor'):
box_pred = prediction_dict["box_pred"]
cls_pred = prediction_dict["cls_pred"]
# decode box
# 對預測的坐標進行解碼
detection_boxes = self._batch_decode(box_pred)
detection_boxes = tf.expand_dims(detection_boxes, axis=2)
# sigmoid function to calculate score from feature
# 對classes進行概率化
detection_scores_with_bg = tf.sigmoid(cls_pred, name="converted_scores")
# slice detection scores without score
detection_scores = tf.slice(detection_scores_with_bg, [0, 0, 1], [-1, -1, -1])
clip_window = tf.constant([0, 0, 1, 1], dtype=tf.float32)
(nms_boxes, nms_scores, nms_classes,
# 進行極大值抑制
num_detections) = post_processing.batch_multiclass_non_max_suppression(detection_boxes,
detection_scores,
score_thresh=score_thres,
iou_thresh=self._params.get("iou_thres"),
max_size_per_class=self._params.get("max_detections_per_class"),
max_total_size=self._params.get("max_detections_total"),
clip_window=clip_window)
return dict(detection_boxes=nms_boxes,
detection_scores=nms_scores,
detection_classes=nms_classes,
num_detections=num_detections)
這個函數(shù)比較簡單:
- 1.對所有的box進行解碼
- 2.把預測值進行softmax
- 3.使用極大值抑制
進行l(wèi)oss計算,也是本文的創(chuàng)新點。之后慢慢
model.loss(predictions_dict, gt_boxes_list, gt_labels_onehot_list)
def loss(self, prediction_dict, gt_boxes_list, gt_labels_list):
"""
Compute loss between prediction tensor and gt
Args:
prediction_dict: dict of following items
box_encodings: a [batch_size, num_anchors, 4] containing predicted boxes
cls_pred_with_bg: a [batch_size, num_anchors, num_classes+1] containing predicted classes
gt_boxes_list: a list of 2D gt box tensor with shape [num_boxes, 4]
gt_labels_list: a list of 2-D gt one-hot class tensor with shape [num_boxes, num_classes]
Returns:
a dictionary with localization_loss and classification_loss
"""
with tf.name_scope(None, 'Loss', prediction_dict.values()):
# 獲取目標值的標記
(batch_cls_targets, batch_cls_weights, batch_reg_targets, batch_reg_weights,
match_list) = self._assign_targets(gt_boxes_list, gt_labels_list)
# num_positives = [tf.reduce_sum(tf.cast(tf.not_equal(matches.match_results, -1), tf.float32))
# for matches in match_list]
# 對gt_boxes_list, match_list進行統(tǒng)計,算入tensorboard
self._summarize_target_assignment(gt_boxes_list, match_list)
# 計算location loss
reg_loss = regression_loss(prediction_dict["box_pred"], batch_reg_targets, batch_reg_weights)
# 計算作者發(fā)明的focal_loss
cls_loss = focal_loss(prediction_dict["cls_pred"], batch_cls_targets, batch_cls_weights)
# normalize loss by num of matches
# num_pos_anchors = [tf.reduce_sum(tf.cast(tf.not_equal(match.match_results, -1), tf.float32))
# for match in match_list]
normalizer = tf.maximum(tf.to_float(tf.reduce_sum(batch_reg_weights)), 1.0)
# normalize reg loss by box codesize (here is 4)
# 對loss進行正則化
reg_normalizer = normalizer * 4
normalized_reg_loss = tf.multiply(reg_loss, 1.0/reg_normalizer, name="regression_loss")
normalized_cls_loss = tf.multiply(cls_loss, 1.0/normalizer, name="classification_loss")
return normalized_reg_loss, normalized_cls_loss, batch_reg_weights, batch_cls_weights
這里的loss計算比較簡單:
- 1.拿到真實的標記
- 2.進行兩個loss計算
- 3.對loss加上正則化
上面介紹的是RetinaNetModel是使用的主要流程。在RetinaNetModel里面有幾比較重要的部分,在其他文件。
- retinanet
- focal_loss
- regression_loss
- faster_rcnn_box_coder
- Anchor
雖然不會一一講解,但這些確實比較重要,也是要TODO
retinanet.py
這里面只用的retinanet這個函數(shù)
這里的retinanet函數(shù),其實比較簡單。
def retinanet(images, num_classes, num_anchors_per_loc, resnet_arch='resnet50', is_training=True):
"""
Get box prediction features and class prediction features from given images
Args:
images: input batch of images with shape (batch_size, h, w, 3)
num_classes: number of classes for prediction
num_anchors_per_loc: number of anchors at each feature map spatial location
resnet_arch: name of which resnet architecture used
is_training: indicate training or not
return:
prediciton dict: holding following items:
box_predictions tensor from each feature map with shape (batch_size, num_anchors, 4)
class_predictions_with_bg tensor from each feature map with shape (batch_size, num_anchors, num_class+1)
feature_maps: list of tensor of feature map
"""
assert resnet_arch in list(RESNET_ARCH_BLOCK.keys()), "resnet architecture not defined"
with tf.variable_scope('retinanet'):
batch_size = combined_static_and_dynamic_shape(images)[0]
#features = {3: p3,
# 4: p4,
# 5: l5,
# 6: p6,
# 7: p7}
features = retinanet_fpn(images, block_layers=RESNET_ARCH_BLOCK[resnet_arch], is_training=is_training)
class_pred = []
box_pred = []
feature_map_list = []
num_slots = num_classes + 1
# 對所有的features層特征進行classes分類
with tf.variable_scope('class_net', reuse=tf.AUTO_REUSE):
for level in features.keys():
class_outputs = share_weight_class_net(features[level], level,
num_slots,
num_anchors_per_loc,
is_training=is_training)
#class_output.shape=[batch_size,num_classes*num_anchors_per_loc ]
class_outputs = tf.reshape(class_outputs, shape=[batch_size, -1, num_slots])
class_pred.append(class_outputs)
feature_map_list.append(features[level])
# 對所有的features層特征進行box坐標
with tf.variable_scope('box_net', reuse=tf.AUTO_REUSE):
for level in features.keys():
box_outputs = share_weight_box_net(features[level], level, num_anchors_per_loc, is_training=is_training)
# box_outputs.shape=[batch_size,4*num_anchors_per_loc]
box_outputs = tf.reshape(box_outputs, shape=[batch_size, -1, 4])
box_pred.append(box_outputs)
return dict(box_pred=tf.concat(box_pred, axis=1),
cls_pred=tf.concat(class_pred, axis=1),
feature_map_list=feature_map_list)
基本上沿用了FPN的流程:
- 1.構建基礎的FPN流程,拿到五個經過特征融合的feature map
- 2.對拿到的特征圖進行box坐標預測
- 3.對拿到的特征圖進行目標表分類
這里使用的幾個函數(shù)都比較簡單。
- retinanet_fpn
- share_weight_class_net
- share_weight_box_net
retinanet_fpn函數(shù):
def retinanet_fpn(inputs,
block_layers,
depth=256,
is_training=True,
scope=None):
"""
Generator for RetinaNet FPN models. A small modification of initial FPN model for returning layers
{P3, P4, P5, P6, P7}. See paper Focal Loss for Dense Object Detection. arxiv: 1708.02002
P2 is discarded and P6 is obtained via 3x3 stride-2 conv on c5; P7 is computed by applying ReLU followed by
3x3 stride-2 conv on P6. P7 is to improve large object detection
Returns:
5 feature map tensors: {P3, P4, P5, P6, P7}
"""
with tf.variable_scope(scope, 'retinanet_fpn', [inputs]) as sc:
net = conv2d_same(inputs, 64, kernel_size=7, strides=2, scope='conv1')
net = _bn(net, is_training)
net = tf.nn.relu6(net)
net = tf.layers.max_pooling2d(net, pool_size=3, strides=2, padding='SAME', name='pool1')
# Bottom up
# block 1, down-sampling is done in conv3_1, conv4_1, conv5_1
p2 = stack_bottleneck(net, layers=block_layers[0], depth=64, strides=1, is_training=is_training)
# block 2
p3 = stack_bottleneck(p2, layers=block_layers[1], depth=128, strides=2, is_training=is_training)
# block 3
p4 = stack_bottleneck(p3, layers=block_layers[2], depth=256, strides=2, is_training=is_training)
# block 4
p5 = stack_bottleneck(p4, layers=block_layers[3], depth=512, strides=2, is_training=is_training)
p5 = tf.identity(p5, name="p5")
# coarser FPN feature
# p6
p6 = tf.layers.conv2d(p5, filters=depth, kernel_size=3, strides=2, name='conv6', padding='SAME')
p6 = _bn(p6, is_training)
p6 = tf.nn.relu6(p6)
p6 = tf.identity(p6, name="p6")
# P7
p7 = tf.layers.conv2d(p6, filters=depth, kernel_size=3, strides=2, name='conv7', padding='SAME')
p7 = _bn(p7, is_training)
p7 = tf.identity(p7, name="p7")
# lateral layer
l3 = tf.layers.conv2d(p3, filters=depth, kernel_size=1, strides=1, name='l3', padding='SAME')
l4 = tf.layers.conv2d(p4, filters=depth, kernel_size=1, strides=1, name='l4', padding='SAME')
l5 = tf.layers.conv2d(p5, filters=depth, kernel_size=1, strides=1, name='l5', padding='SAME')
# Top dow
# 上采樣函數(shù),這里是融合l5和l4
t4 = nearest_neighbor_upsampling(l5, 2) + l4
p4 = tf.layers.conv2d(t4, filters=depth, kernel_size=3, strides=1, name='t4', padding='SAME')
p4 = _bn(p4, is_training)
p4 = tf.identity(p4, name="p4")
# 上采樣函數(shù),這里是融合l4和l3
t3 = nearest_neighbor_upsampling(t4, 2) + l3
p3 = tf.layers.conv2d(t3, filters=depth, kernel_size=3, strides=1, name='t3', padding='SAME')
p3 = _bn(p3, is_training)
p3 = tf.identity(p3, name="p3")
features = {3: p3,
4: p4,
5: l5,
6: p6,
7: p7}
return features
這個函數(shù)就是構建基礎的FPN網絡的函數(shù)。這里只用P3和P4進行了特征金字塔融合,詳細的去看論文吧。
share_weight_box_net函數(shù):
def share_weight_box_net(inputs, level, num_anchors_per_loc, num_layers_before_predictor=4, is_training=True):
"""
Similar to class_net with output feature shape (batch_size, h, w, num_anchors*4)
"""
for i in range(num_layers_before_predictor):
inputs = tf.layers.conv2d(inputs, filters=256, kernel_size=3, strides=1,
kernel_initializer=tf.random_normal_initializer(stddev=0.01),
padding="SAME",
name='box_{}'.format(i))
inputs = _bn(inputs, is_training, name="box_{}_bn_level_{}".format(i, level))
inputs = tf.nn.relu6(inputs)
# 對每個feature map上的點生成4*num_anchors_per_loc個值
outputs = tf.layers.conv2d(inputs,
filters=4*num_anchors_per_loc,
kernel_size=3,
kernel_initializer=tf.random_normal_initializer(stddev=0.01),
padding="SAME",
name="box_pred")
return outputs
這個函數(shù)是進行box預測的函數(shù)。
share_weight_class_net函數(shù):
def share_weight_class_net(inputs, level, num_classes, num_anchors_per_loc, num_layers_before_predictor=4, is_training=True):
"""
net for predicting class labels
NOTE: Share same weights when called more then once on different feature maps
Args:
inputs: feature map with shape (batch_size, h, w, channel)
level: which feature map
num_classes: number of predicted classes
num_anchors_per_loc: number of anchors at each spatial location in feature map
num_layers_before_predictor: number of the additional conv layers before the predictor.
is_training: is in training or not
returns:
feature with shape (batch_size, h, w, num_classes*num_anchors)
"""
for i in range(num_layers_before_predictor):
inputs = tf.layers.conv2d(inputs, filters=256, kernel_size=3, strides=1,
kernel_initializer=tf.random_normal_initializer(stddev=0.01),
padding="SAME",
name='class_{}'.format(i))
inputs = _bn(inputs, is_training, name="class_{}_bn_level_{}".format(i, level))
inputs = tf.nn.relu(inputs)
# 對每個個點生成num_classes*num_anchors_per_loc個值
outputs = tf.layers.conv2d(inputs,
filters=num_classes*num_anchors_per_loc,
kernel_size=3,
kernel_initializer=tf.random_normal_initializer(stddev=0.01),
padding="SAME",
name="class_pred")
return outputs
這個函數(shù)是進行classes分類的函數(shù)
這就是這個retinanet.py的主要類容了
下面看看loss.py
這個文件就只有兩個loss函數(shù)。
focal_loss:
def focal_loss(logits, onehot_labels, weights, alpha=0.25, gamma=2.0):
"""
Compute sigmoid focal loss between logits and onehot labels: focal loss = -(1-pt)^gamma*log(pt)
Args:
onehot_labels: onehot labels with shape (batch_size, num_anchors, num_classes)
logits: last layer feature output with shape (batch_size, num_anchors, num_classes)
weights: weight tensor returned from target assigner with shape [batch_size, num_anchors]
alpha: The hyperparameter for adjusting biased samples, default is 0.25
gamma: The hyperparameter for penalizing the easy labeled samples, default is 2.0
Returns:
a scalar of focal loss of total classification
"""
with tf.name_scope("focal_loss"):
logits = tf.cast(logits, tf.float32)
onehot_labels = tf.cast(onehot_labels, tf.float32)
ce = tf.nn.sigmoid_cross_entropy_with_logits(labels=onehot_labels, logits=logits)
predictions = tf.sigmoid(logits)
predictions_pt = tf.where(tf.equal(onehot_labels, 1), predictions, 1.-predictions)
# add small value to avoid 0
alpha_t = tf.scalar_mul(alpha, tf.ones_like(onehot_labels, dtype=tf.float32))
alpha_t = tf.where(tf.equal(onehot_labels, 1.0), alpha_t, 1-alpha_t)
weighted_loss = ce * tf.pow(1-predictions_pt, gamma) * alpha_t * tf.expand_dims(weights, axis=2)
return tf.reduce_sum(weighted_loss)
結合上面提到的公式一起看吧,和公式是匹配的=2,a=0.25的效果是默認參數(shù)。
這是坐標回歸loss
def regression_loss(pred_boxes, gt_boxes, weights, delta=1.0):
"""
Regression loss (Smooth L1 loss: also known as huber loss)
Args:
pred_boxes: [batch_size, num_anchors, 4]
gt_boxes: [batch_size, num_anchors, 4]
weights: Tensor of weights multiplied by loss with shape [batch_size, num_anchors]
delta: delta for smooth L1 loss
Returns:
a box regression loss scalar
"""
loss = tf.reduce_sum(tf.losses.huber_loss(predictions=pred_boxes,
labels=gt_boxes,
delta=delta,
weights=tf.expand_dims(weights, axis=2),
scope='box_loss',
reduction=tf.losses.Reduction.NONE))
return loss
借用這里huber_loss推到圖

其他兩個文件faster_rcnn_box_coder和Anchor都是沿用faster_rcnn的流程,沒多大變化。。。

參考:
RetinaNet的理解
Focal Loss-RetinaNet算法解析
Retinanet訓練Pascal VOC 2007
Note on RetinaNet(中文)



