SSD:Single Shot Multibox Detector:第二部分-代碼與細(xì)節(jié)實(shí)現(xiàn)


作 者: 心有寶寶人自圓

聲 明: 歡迎轉(zhuǎn)載本文中的圖片或文字,請(qǐng)說(shuō)明出處

寫在前面

受到前輩們的啟發(fā),決定應(yīng)該寫些文章記錄一下學(xué)習(xí)的內(nèi)容了

之前也讀過(guò)一些文章、寫過(guò)一些代碼,以后再慢慢填坑吧 ??

現(xiàn)在把最近讀的學(xué)習(xí)與大家分享一下

在此分享一下自己的理解和心得,如有錯(cuò)誤或理解不當(dāng)敬請(qǐng)指出 ??

這篇文章是SSD:Single Shot Multibox Detector:第一部分-論文閱讀的后續(xù)內(nèi)容,努力填坑......

論文地址:SSD: Single Shot MultiBox Detector

我們的目標(biāo)是:用Pytorch實(shí)現(xiàn)SSD ??

我使用的是python-3.6+ pytorch-1.3.0+torchvision-0.4.1

訓(xùn)練集:VOC2007 trainval ,VOC2012 trainval

測(cè)試集:VOC2007 test

其中目標(biāo)類別如下,共20個(gè)類別+1(背景類)

('aeroplane', 'bicycle', 'bird', 'boat', 'bottle', 'bus', 'car', 'cat', 
'chair', 'cow', 'diningtable','dog', 'horse', 'motorbike', 'person',
 'pottedplant', 'sheep', 'sofa', 'train', 'tvmonitor')
  • 以下圖片為detect的結(jié)果,訓(xùn)練了45個(gè)epochs,比著作者的200+epochs差的挺多,但效果還行把(關(guān)鍵有點(diǎn)耗時(shí)間??),隨機(jī)展示了測(cè)試集中的一些圖片檢測(cè)效果??,看看怎么樣


0.論文重要概念的回顧

  • single-shot vs two-stage:典型的two-stage模型(R-CNN系列)一般有SSD論文提及的那個(gè)pipeline,大量的多尺度的提議區(qū)域,卷積神經(jīng)網(wǎng)絡(luò)提取特征,高質(zhì)量分類器進(jìn)行分類,用回歸方法預(yù)測(cè)邊界框的位置,blablabla......總之它存在準(zhǔn)確率-速度權(quán)衡,大量的計(jì)算資源消耗使它不適合真實(shí)世界的即時(shí)目標(biāo)檢測(cè)任務(wù);SSD將最耗時(shí)的提議區(qū)域的選擇與重采樣去除,轉(zhuǎn)而使用封裝在了模型內(nèi)部的固定錨框,是我們能又快又準(zhǔn)的進(jìn)行目標(biāo)檢測(cè)
  • 固定的錨框(fixed邊界框,priors):在我之前寫的論文閱讀部分中,大量的準(zhǔn)備工作都是對(duì)錨框進(jìn)行的,錨框的設(shè)計(jì)對(duì)模型的訓(xùn)練至關(guān)重要,因?yàn)樗鼘⒈辉O(shè)計(jì)成ground truth標(biāo)記(offset+label)。錨框是預(yù)先在SSD模型中固定下來(lái)的(priors),以(aspect ratio, scale)來(lái)標(biāo)識(shí)。由于錨框與不同層次的feature map對(duì)應(yīng),所以高層的 scale大,低層的 scale?。A(yù)測(cè)是基于每一個(gè)priors)
  • 多尺度特征圖與預(yù)測(cè)器:SSD在不同層次的特征圖上進(jìn)行預(yù)測(cè),并將預(yù)測(cè)結(jié)果加到截?cái)嗟腷ase net之后。低層主要用來(lái)檢測(cè)較小的目標(biāo),高層主要用來(lái)檢測(cè)較大的目標(biāo),不同尺度的預(yù)測(cè)器學(xué)習(xí)去預(yù)測(cè)該尺度下的目標(biāo)。由于不同的尺度特征上,一個(gè)像素的感受野在高層更大,這一特性使得卷積核被設(shè)定成固定的大小的小卷積核
  • Hard Negative Mining:SSD在訓(xùn)練時(shí)往往會(huì)存在大量的負(fù)類,這將導(dǎo)致訓(xùn)練數(shù)據(jù)的正負(fù)類嚴(yán)重不平衡,所以我們需要顯式選擇一定比例負(fù)類信度高的預(yù)測(cè)結(jié)果去計(jì)算損失,而不使用全部的負(fù)類
  • 非極大值抑制:只留下信度最高的預(yù)測(cè)框,刪除交疊、冗余的數(shù)據(jù)框
整體的工作量還是很大的,我盡量把注釋寫的清楚 ??

記得定義全局變量

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

1. 從錨框(論文中固定邊界框、default boxes,之后的Prior)開始

import matplotlib.pyplot as plt
def show_box(box, color):
   """
   使用matplotlib展示邊界框
   :param box: 邊界框,(xmin, ymin, xmax, ymax)
   :return: matplotlib.patches.Rectangle
   """
    return plt.Rectangle(xy=(box[0], box[1]), width=box[2] - box[0], height=box[3] - box[1], fill=False,edgecolor=color, linewidth=2)

通常來(lái)說(shuō),目標(biāo)(不論是哪個(gè)種類)在圖像中的位置分布十分散亂,大小尺寸各不一致。從概率上來(lái)說(shuō),目標(biāo)可能出現(xiàn)在任何地方,所以我們只能將這種概率空間離散化,這樣我們至少能得出一個(gè)概率值了......??我們就讓錨框盡可能的普遍整個(gè)特征圖(離散化的概率空間?)。

錨框是先驗(yàn)的、固定的方框,它們共同代表了這個(gè)類別可能性和近似的方框的概率空間之后為了突出先驗(yàn)性,給它起個(gè)英文名:Prior。

1.1 好吧Prior

  • 這些錨框需要人工選定且大小、尺度符合訓(xùn)練數(shù)據(jù)的特點(diǎn),想要Prior代表概率空間就需要它們以每個(gè)像素塊生成
  • 和之前論文閱讀中講的一樣,低層采樣較小的scale(檢測(cè)較小的目標(biāo)),高層采用較大的scale(檢測(cè)較大的目標(biāo))。因?yàn)閟cale采用比例表示,從特征圖還原到原始空間上尺度具有一致性
    10x10特征圖上某一location對(duì)應(yīng)的6個(gè)priors(其他的沒畫太多了)

(具體的操作過(guò)程看論文或我之前寫的文章把,這里只標(biāo)識(shí)了重點(diǎn)步驟)

def create_prior_boxes(widths: list, heights: list, scales: list, aspect_ratios: list) -> torch.Tensor:
    """
    Create prior boxes on each pixel following authors methods in paper
    :param widths: widths list of all feature maps using for create priors
    :param heights: heights list of all feature maps using for create priors
    :param scales: scales list of all feature maps use for create priors.
                Note that each feature map has a specific scale
    :param aspect_ratios: widths list of all feature maps use for create priors.
                Note that each feature maps has different nums of ratios
    :return: priors' location in center coordinates , a tensor in shape of(8732, 4)
    """
    prior_boxes = []
    for i, (width, height, scale, ratios) in enumerate(zip(widths, heights, scales, aspect_ratios)):
        for y in range(height):
            for x in range(width):
                # change cxcy to the center of pixel
                # change cxcy in range 0 to 1
                cx = (x + 0.5) / width
                cy = (y + 0.5) / height
                for ratio in ratios:
                    # all those params are proportional form(percent coordinates)
                    prior_width = scale * math.sqrt(ratio)
                    prior_height = scale / math.sqrt(ratio)
                    prior_boxes.append([cx, cy, prior_width, prior_height])

                    # For the aspect ratio of 1, we also add a default box whose scale is sqrt(s(k)*(sk+1))
                    if ratio == 1:
                        try:
                            additional_scale = math.sqrt(scales[i] * scales[i + 1])
                        # except this is the last feature map, only one pixel is left
                        except IndexError:
                            additional_scale = 1

                        # ratio of 1 means scale is width and height
                        prior_boxes.append([cx, cy, additional_scale, additional_scale])

    return torch.FloatTensor(prior_boxes).clamp_(0, 1).to(device) # (8732, 4) Note that they are percent coordinates

1.2 Prior的表示形式

Prior在論文中表示為(cx, cy, w, h):中心表示形式,而有時(shí)候?yàn)榱司幊痰姆奖氵€會(huì)采用(xmin, ymin , xmax, ymax)的邊緣表示形式,這就需要兩種表示形式的相互轉(zhuǎn)化

def xy_to_cxcy(xy: torch.Tensor) -> torch.Tensor:
    """
    把(xmin, ymin, xmax, ymax)的中心表示形式轉(zhuǎn)換為(cx, cy, w, h)的邊緣表示形式
    :param xy: 邊界框的(xmin, ymin, xmax, ymax)表示,a tensor of size (num_boxes, 4)
    :return:邊界框的(cx, cy, w, h)表示, a tensor of size (num_boxes, 4)
    """
    return torch.cat([(xy[:, 2:] + xy[:, :2] )/ 2, xy[:, 2:] - xy[:, :2]], dim=1)

def cxcy_to_xy(cxcy: torch.Tensor) -> torch.Tensor:
    """
    把(cx, cy, w, h)表示形式轉(zhuǎn)換為(xmin, ymin, xmax, ymax)
    :param cxcy: 邊界框的(cx, cy, w, h)表示,a tensor of size (n_boxes, 4)
    :return: 邊界框的(xmin, ymin, xmax, ymax)表示
    """
    return torch.cat([cxcy[:, :2] - (cxcy[:, 2:] / 2), cxcy[:, :2] + (cxcy[:, 2:] / 2)], 1)

注:在之前的論文閱讀部分也指明了通過(guò)多方面考慮應(yīng)該使用相對(duì)長(zhǎng)度(或相對(duì)坐標(biāo),即已進(jìn)行歸一化)來(lái)表示Prior

1.3 Prior to ground truth

很顯然priors并不是真正的groud truth信息(與真實(shí)邊界存在偏差、未指定類別、且每個(gè)prior的ground truth據(jù)有不確定性,我們需要量化這些信息),我們需要將priors的信息調(diào)整為ground truth信息來(lái)計(jì)算損失(同時(shí)我么也必須理解我們預(yù)測(cè)的是什么,預(yù)測(cè)結(jié)果怎么轉(zhuǎn)換為真實(shí)預(yù)測(cè)邊界框的信息)

1.3.1 offset

偏移量表示為(\Delta cx,\Delta cy,\Delta w,\Delta h),論文閱讀部分指出進(jìn)行了如下編碼:

? \hat{cx}=\frac{cx-cx_{anchor}}{width_{anchor}},\hat{cy}=\frac{cy-cy_{anchor}}{height_{anchor}},\hat{w}=log(\frac{w}{w_{anchor}}),\hat{h}=log(\frac{h}{h_{anchor}}) (1),

? 其中(cx,cy,w,h)是ground truth的真實(shí)位置信息,(cx_{anchor},cy_{anchor},w_{anchor},h_{anchor})是prior的真實(shí)位置信息

而在實(shí)際使用的時(shí)候常常使用基于經(jīng)驗(yàn)參數(shù)的標(biāo)準(zhǔn)化對(duì)編碼結(jié)果再次處理,即:

? \hat{cx}=\frac{\frac{cx-cx_{anchor}}{width_{anchor}}-\mu_x}{\sigma_x},\hat{cy}=\frac{\frac{cy-cy_{anchor}}{height_{anchor}}-\mu_y}{\sigma_y},\hat{w}=\frac{log(\frac{w}{w_{anchor}})-\mu_w}{\sigma_w},\hat{h}=\frac{log(\frac{h}{h_{anchor}})-\mu_h}{\sigma_h} (2),

? 其中經(jīng)驗(yàn)參數(shù)\mu_x=\mu_y=\mu_w=\mu_h=0,\sigma_x=\sigma_y=0.1,\sigma_w=\sigma_h=0.1

def cxcy_to_gcxgcy(cxcy: torch.Tensor, priors_cxcy: torch.Tensor) -> torch.Tensor:
    """
    使用中心格式的輸入計(jì)算與目標(biāo)區(qū)域與priors的偏移量,該偏移量按式(2)編碼
    中心格式的目標(biāo)區(qū)域與priors是一一對(duì)應(yīng)的
    :param cxcy: 邊緣格式的邊界框, a tensor of size (n_priors, 4)
    :param priors_cxcy: prior的邊界框, a tensor of size (n_priors, 4)
    :return: encoded bounding boxes, a tensor of size (n_priors, 4)
    """
    return torch.cat([(cxcy[:, :2] - priors_cxcy[:, :2]) / (priors_cxcy[:, 2:]) * 10,  
                      torch.log(cxcy[:, 2:] / priors_cxcy[:, 2:]) * 5], 1) 
    

我們要獲得實(shí)際預(yù)測(cè)邊界框,則需要對(duì)上述過(guò)程進(jìn)行解碼(注:預(yù)測(cè)器實(shí)際預(yù)測(cè)的結(jié)果是上面最終編碼的的offsets)

def gcxgcy_to_cxcy(gcxgcy: torch.Tensor, priors_cxcy: torch.Tensor) -> torch.Tensor:
    """
    輸入模型預(yù)測(cè)的offsets和priors(一一對(duì)應(yīng)),解碼出的預(yù)測(cè)邊界框中心格式邊界框
    :param gcxgcy:編碼后的邊界框(即offset),如模型的輸出, a tensor of size (n_priors, 4)
    :param priors_cxcy:prior的邊界框, a tensor of size (n_priors, 4)
    :return: decoded bounding boxes in center-size form, a tensor of size (n_priors, 4)
    """
    return torch.cat([gcxgcy[:, :2] / 10 * priors_cxcy[:, 2:] + priors_cxcy[:, 2],
                      torch.exp(gcxgcy[:, 2:] / 5) * priors_cxcy[:, 2:]], dim=1)

這一部分中g(shù)round truth offset只需cxcy為ground truth labels即可,但cxcy需與priors一一對(duì)應(yīng),這種對(duì)應(yīng)關(guān)系,就是我們接下來(lái)討論的內(nèi)容

1.3.2 object class

0代表背景類,1-n_classes代表目標(biāo)類別。每個(gè)圖像中目標(biāo)個(gè)數(shù)、目標(biāo)類別均不一定相同,因此我要先給priors分配一個(gè)目標(biāo),由該目標(biāo)的類別確定prior的類別

1.3.3 criterion

為了為priors分配類別,必須采用一種指標(biāo),來(lái)判斷priors與真實(shí)邊界框的匹配程度

原文中采用了jaccard overlap(交并比,IoU)


IoU

下面定義了計(jì)算交并比的函數(shù),注意輸入是邊界框的邊緣形式

def find_intersection(set_1, set_2):
    """
    Find the intersection of every box combination between two sets of boxes that are in boundary coordinates.
    :param set_1: set 1, a tensor of dimensions (n1, 4)
    :param set_2: set 2, a tensor of dimensions (n2, 4)
    :return: intersection of each of the boxes in set 1 with respect to each of the boxes in set 2, a tensor of dimensions (n1, n2)
    """

    # PyTorch auto-broadcasts singleton dimensions
    lower_bound = torch.max(set_1[:, :2].unsqueeze(1), set_2[:, :2].unsqueeze(0))  # (n1,n2,2)
    upper_bound = torch.min(set_1[:, 2:].unsqueeze(1), set_2[:, 2:].unsqueeze(0))  # (n1,n2,2)
    intersection_dims = torch.clamp(upper_bound - lower_bound, 0)  # (n1, n2, 2)
    return intersection_dims[:, :, 0] * intersection_dims[:, :, 1]  # (n1, n2)


def find_jaccard_overlap(set_1, set_2):
    """
    Find the Jaccard Overlap (IoU) of every box combination between two sets of boxes that are in boundary coordinates.
    :param set_1: set 1, a tensor of dimensions (n1, 4)
    :param set_2: set 2, a tensor of dimensions (n2, 4)
    :return: Jaccard Overlap of each of the boxes in set 1 with respect to each of the boxes in set 2, a tensor of dimensions (n1, n2)
    """
    # Find intersections
    intersection = find_intersection(set_1, set_2)

    # Find areas of each box in both sets
    areas_set_1 = (set_1[:, 2] - set_1[:, 0]) * (set_1[:, 3] - set_1[:, 1])  # (n1)
    areas_set_2 = (set_2[:, 2] - set_2[:, 0]) * (set_2[:, 3] - set_2[:, 1])  # (n2)

    # Find the union
    # PyTorch auto-broadcasts singleton dimensions
    union = areas_set_1.unsqueeze(1) + areas_set_2.unsqueeze(0) - intersection  # (n1, n2)
    return intersection / union  # (n1, n2)

假設(shè)set_1是priors(8732, 4),set_2是真實(shí)邊界框(n_object_per_image, 4),我們最終的到(8732, n_object_per_image)的tensor,即在該圖像內(nèi)每個(gè)prior與每個(gè)object box的交并比

1.3.4 priors to ground truth
def label_prior(priors_cxcy, boxes, classes):
    """
    Assign ground truth label for prior. Note that we do this for each image in a batch
    priors are fixed pretrain, boxes and classes are from dataloader.
    :param priors_cxcy: priors which we create in shape of (8732, 4),note that they are center center coordinates and percent coordinates
    :param boxes: boxes is a tensor of true objects' bounding boxes in the image. Note that they are percent coordinates
    :param classes: classes is a tensor of true objects' class labels in the image
    :return:
    """

    n_objects = boxes.size(0)
    # cxcy to xy
    priors_cxcy = priors_cxcy
    priors_cxcy = cxcy_to_xy(priors_cxcy)
    overlaps = find_jaccard_overlap(boxes, priors_cxcy)

    # 為每個(gè)prior找出最大的overlap并以此為標(biāo)準(zhǔn)分配目標(biāo)(注意不是類別)
    overlap_per_prior, object_per_prior = overlaps.max(dim=0)  # (8732)

    # 直接為按交并比大小分配類別會(huì)產(chǎn)生如下的問(wèn)題
    # 1. 如果一個(gè)檢測(cè)目標(biāo)對(duì)與所有priors的交并比都不是最大的,該目標(biāo)的類別則不能分配給任意一個(gè)prior
    # 2. 給定閾值(0.5)將交并比較小的prior分配給背景類(class 0)

    # 解決第一個(gè)問(wèn)題:
    _, prior_per_object = overlaps.max(dim=1)  # (nums of object)每個(gè)值為該目標(biāo)對(duì)應(yīng)的index in (0, 8731)

    object_per_prior[prior_per_object] = torch.LongTensor(range(n_objects)).to(device)  # 為與每個(gè)目標(biāo)overlap最大prior的分配為該目標(biāo)
    overlap_per_prior[prior_per_object] = 1

    # 解決第二個(gè)問(wèn)題:
    class_per_prior = classes[object_per_prior]  # 根據(jù)object的索引獲得對(duì)應(yīng)其真實(shí)的類別標(biāo)簽
    class_per_prior[overlap_per_prior < 0.5] = 0  # (8732)

    # 為每個(gè)prior計(jì)算與之前所分配objcet邊界框的offset
    offset_per_prior = cxcy_to_gcxgcy(boxes[object_per_prior], priors_cxcy)  # (8732, 4)

    return class_per_prior, offset_per_prior

不難注意到,每個(gè)prior對(duì)應(yīng)了一個(gè)ground truth,它們用來(lái)檢測(cè)不同尺度、不同位置的目標(biāo)

label_prior()是針對(duì)batch里的一個(gè)圖像與之對(duì)應(yīng)的目標(biāo)邊界框和目標(biāo)類別(xml文件標(biāo)注的,from dataloard),只需在batches里寫個(gè)for循環(huán)即可,就得到了針對(duì)該圖片的priors to ground truth,用于Loss計(jì)算(見5.1).

2. 網(wǎng)絡(luò)結(jié)構(gòu)

SSD模型的網(wǎng)絡(luò)結(jié)構(gòu)將VGG-16從FC之前截?cái)嘧鳛閎ase net,將base net細(xì)節(jié)結(jié)構(gòu)進(jìn)行更改并加上Conv6和Conv7,在base net之后加上了額外的卷積層結(jié)構(gòu)

(注:為代碼的可讀性網(wǎng)絡(luò),SSD的網(wǎng)絡(luò)被拆分BaseNet和AuxiliaryConvolutions)

vgg-16
作者提供的細(xì)節(jié)更改+附加結(jié)構(gòu)

完整的VGG-16模型由于全連接層的存在,需要輸入的大小為( 3, 224, 224),作者將網(wǎng)絡(luò)魔改一下用來(lái)接收300x300的輸入(SSD300 model)

2.0 Conv4_3:

按vgg-16向前傳播的時(shí)候,Conv_4中300 x 300的原始圖像會(huì)被下采樣到37 x 37,而這里指出的大小為38 x 38。vgg-16網(wǎng)絡(luò)中,能夠下采樣的只有池化層,所以這里變化是由maxpool3的修改而導(dǎo)致的,將其中計(jì)算輸出尺寸的函數(shù)由向下取整(floor)改為向上取整(ceiling)

self.pool3=nn.MaxPool2d(kernel_size=2, 2, ceil_mode=True)

2.1 Maxpool5

不在使用原來(lái)vgg-16中同一結(jié)構(gòu),而改用size=(3,3),stride=1,padding=1的maxpool

self.pool5 = nn.MaxPool2d(kernel_size=3, stride=1, padding=1)

2.2 Conv6與Conv7:希望我能表述的足夠清楚??

fc6-fc7:圖像為(512, 7, 7).flatten()\Rightarrow(fc6)\Rightarrow4096\Rightarrow(fc7)\Rightarrow1000,作者希望直接利用fc6和fc7的weights生成Conv6和Conv7的卷積核

2.2.1我們先來(lái)理清一下卷積層與全連接層的相互轉(zhuǎn)化問(wèn)題
  • 卷積層->全連接層:


    Conv to FC
由上圖很容易的出轉(zhuǎn)換fc層的權(quán)重是取自卷積核權(quán)重的稀疏矩陣。又特征圖每個(gè)輸出通道上的像素由輸入空間所有in_channel在相同位置的卷積值相加得到(i.e.紅框陰影由多層藍(lán)陰影框(假設(shè)有多層-_-)分別與多個(gè)卷積核卷積得到的多層結(jié)果相加得到),所以out_channel控制特征圖的個(gè)數(shù),in_channel和out_channels控制fc權(quán)重的長(zhǎng)和寬
  • 全連接層->卷積層:考慮input像素(512,7,7).flatten() -> 4096個(gè),此時(shí)fc權(quán)重為(512*7*7,4096)

    假設(shè)卷積核大小與圖像大小一致,為(4096,512,7,7),按照卷積的運(yùn)算過(guò)程,得到的結(jié)果是(某一輸出通道內(nèi))每個(gè)通道的每個(gè)像素與對(duì)應(yīng)的卷積核權(quán)重相乘之后相加,與全連接的計(jì)算結(jié)果完全一致,此時(shí)通道維是原來(lái)的特征維

  • 所以conv6的卷積核應(yīng)為(4096,512,7,7),conv7的卷積核應(yīng)為(4096,4096,1,1)

However,這樣還不行??,這些過(guò)濾器數(shù)量眾多、體積龐大,而且計(jì)算成本很高,所以作者對(duì)卷積核進(jìn)行了下采樣

2.2.2 卷積核下采樣

其實(shí)這個(gè)過(guò)程非常的簡(jiǎn)單,就是把卷積核的參數(shù)(out_channels, height, width這三個(gè)dim)給下采樣了.......

from collections import Iterable
def decimate(tensor: torch.Tensor, m: Iterable) -> torch.Tensor:
    """
    對(duì)tensor的一些維度進(jìn)行下采樣,每一維度的下采樣間隔列表為m
    :param tensor: 要被下采樣的tensor
    :param m: 每一維度的下采樣間隔參數(shù)列表,如果某一維度不進(jìn)行下采樣,參數(shù)為None
    :return: 下采樣后的tensor
    """
    assert tensor.dim() == len(m)
    for d in range(tensor.dim()):
        if m[d] is not None:
            tensor = tensor.index_select(dim=d, index=torch.arange(start=0, end=tensor.size(d), step=m[d]))
    return tensor

作者將 height和width dim的采樣率都設(shè)為3(每三取一),out_channels采樣率為4采樣出了\frac{1}{4}的原始卷積核

終于我們得到了Conv6核Conv7的卷積核分別為(1024,512,3,3),(1024, 1024,1, 1)
2.2.2Atrous卷積

Atrous卷積(空洞卷積, also known as Dilated Convolution or Convolution with holes......)實(shí)際針對(duì)的是相鄰的像素(因?yàn)橄噜徬袼匾话阍谛畔⑸嫌休^大冗余)。為了在不進(jìn)行pooling下采樣的情況下能夠獲得更大的感受野,我們便可以在卷積的輸入空間內(nèi)加入空洞(因?yàn)閜ooling意味著圖片信息的損失。Atrous卷積實(shí)際并沒有圖片信息的損失,只不過(guò)特征圖同一像素不提取輸入空間相鄰像素的信息,而在其他特征圖像素中,之前被“跳過(guò)”的相鄰像也確實(shí)和卷積核進(jìn)行了運(yùn)算......不多說(shuō)了,看圖更清楚\downarrow??)

該圖片來(lái)自:vdumoulin/conv_arithmetic (可能大家對(duì)這一系列的圖都很熟悉,陰影部分是卷積運(yùn)算的區(qū)域??)

DILATED CONVOLUTIONS with kernel size 3x3, dilation=2

不難發(fā)現(xiàn),確實(shí)每個(gè)輸入空間的像素都被用到(沒有像pooling那樣丟棄)并且還擴(kuò)大了感受野

2.2.3Atrous算法與卷積核的下采樣

原文中,conv6的輸出大小仍是19x19,且使用了atrous卷積。

按之前講述的內(nèi)容卷積核被下采樣后,特征圖原本應(yīng)該與7x7卷積核運(yùn)算,但下采樣使部分核有所缺失(holes are in the kernel),所以合適的方法應(yīng)該讓卷積時(shí)跳過(guò)3個(gè)像素。然而作者的倉(cāng)庫(kù)中實(shí)際上使用了dilation=6,這樣的操作可能是考慮了修改之后maxpool5沒有使輸出大小縮小一半,所以dilation需要增加一倍

self.conv6 = nn.Conv2d(512, 1024, kernel_size=3, padding=6, dilation=6)  # atrous convolution
self.conv7 = nn.Conv2d(1024, 1024, kernel_size=1)

接下來(lái)使用原全連接層的weight和bias更新base_net:

# this part can be defined in class BaseNet as a function for init.
# get state_dict which only contains params
state_dict = base_net.state_dict()  # base net is instance of BaseNet
pretrained_state_dict = torchvision.models.vgg16(pretrained=True).state_dict()

# fc6
conv_fc_weight = pretrained_state_dict['classifier.0.weight'].view(4096, 512, 7, 7)  # (4096, 512, 7, 7)
conv_fc_bias = pretrained_state_dict['classifier.0.bias']  # (4096)
state_dict['conv6.weight'] = decimate(conv_fc_weight, m=[4, None, 3, 3])# (1024, 512, 3, 3)
# fc7:在預(yù)訓(xùn)練模型中,fc7的名字就是classifier.3
conv_fc7_weight = pretrained_state_dict['classifier.3.weight'].view(4096, 4096, 1, 1)  # (4096, 4096, 1, 1)
conv_fc7_bias = pretrained_state_dict['classifier.3.bias']  # (4096)
state_dict['conv7.weight'] = decimate(conv_fc7_weight, m=[4, 4, None, None])  # (1024, 1024, 1, 1)
state_dict['conv7.bias'] = decimate(conv_fc7_bias, m=[4])  # (1024)

base_net.load_state_dict(state_dict)

......這個(gè)令人頭疼的部分終于結(jié)束了??

2.3 其余的附加卷積層:

都是作者附加的用來(lái)提取大尺度特征的,挺好理解,1x1卷積層有妙用(類似于提取特征圖進(jìn)一步提取特征?)??

class AuxiliaryConvolutions(nn.Module):
    """
    Additional convolutions to produce higher-level feature maps.
    """

    def __init__(self):
        super(AuxiliaryConvolutions, self).__init__()

        # Auxiliary convolutions on top of the VGG base
        self.conv8_1 = nn.Conv2d(1024, 256, kernel_size=1, padding=0)  
        self.conv8_2 = nn.Conv2d(256, 512, kernel_size=3, stride=2, padding=1) 

        self.conv9_1 = nn.Conv2d(512, 128, kernel_size=1, padding=0)
        self.conv9_2 = nn.Conv2d(128, 256, kernel_size=3, stride=2, padding=1)  
        
        self.conv10_1 = nn.Conv2d(256, 128, kernel_size=1, padding=0)
        self.conv10_2 = nn.Conv2d(128, 256, kernel_size=3, padding=0)  

        self.conv11_1 = nn.Conv2d(256, 128, kernel_size=1, padding=0)
        self.conv11_2 = nn.Conv2d(128, 256, kernel_size=3, padding=0)  
        
        # Initialize convolutions' parameters
        for c in self.children():
            if isinstance(c, nn.Conv2d):
                nn.init.xavier_normal_(c.weight)
                nn.init.constant_(c.bias, 0.)

2.4 multi-level feature maps:

從圖中可以看出,用來(lái)提取多尺度特征的特征圖選擇為conv4_3, conv7, conv8_2, conv9_2, conv10_2, and conv11_2(有低層特征圖,也有高層特征圖),在forward內(nèi)把這些特征圖返回出來(lái)即可

BaseNet:forward return conv4_3_features, conv7_features

AuxiliaryConvolutions: foward return conv8_2_features, conv9_2_features, conv10_2_features, conv11_2_features

2.5 predictor

多層特征圖傳入各自的預(yù)測(cè)其,分別預(yù)測(cè)offset和class,各層的預(yù)測(cè)器具有較類似的結(jié)構(gòu):kernel_size=3, padding=1

注意offset的預(yù)測(cè)結(jié)果是基于該層特征圖上priors的編碼結(jié)果(見1.3),class需要為各個(gè)類別評(píng)分

def loc_predictor(in_channels, num_priors):
    """
    邊界框預(yù)測(cè)層,為每個(gè)輸入空間每個(gè)像素上的priors預(yù)測(cè)4個(gè)偏移量
    :param in_channels: 輸入空間通道數(shù)
    :param num_priors:每個(gè)單元為中心生成 num_priors 個(gè)prior
    :return:預(yù)測(cè)offset的卷積層
    """
    return nn.Conv2d(in_channels, num_priors * 4, kernel_size=3, padding=1)


def cls_predictor(in_channels, num_priors, num_classes):
    """
    類別預(yù)測(cè)層,為每個(gè)輸入空間像素上的priors預(yù)測(cè)各個(gè)類別的評(píng)分
    類別預(yù)測(cè)層使用一個(gè)保持輸入高和寬的卷積層。此時(shí),輸出和輸入在特征圖寬和高上的空間坐標(biāo)一一對(duì)應(yīng)
    :param in_channels: 輸入空間通道數(shù)
    :param num_priors: 每個(gè)單元為中心生成 num_priors 個(gè)prior
    :param num_classes: 目標(biāo)的類別個(gè)數(shù)為 num_classes
    :return:類別預(yù)測(cè)的卷積層
    """
    return nn.Conv2d(in_channels, num_priors * num_classes, kernel_size=3, padding=1)

priors是在特征圖每個(gè)像素上生成的,預(yù)測(cè)器的預(yù)測(cè)結(jié)果的w,h與輸入空間一致,所以每個(gè)預(yù)測(cè)空間像素與輸入空間像素對(duì)應(yīng),很自然offset是針對(duì)對(duì)應(yīng)prior的編碼后offset,此時(shí)out_channels轉(zhuǎn)換為了特征維,為了應(yīng)對(duì)不同輸入空間大小不同導(dǎo)致w,h和num_priors的不同,我們需要在把所有輸出結(jié)果concatenate前,需要把其空間維flatten一下。class預(yù)測(cè)與offset預(yù)測(cè)的思路基本一致只是最后的特征維(輸出通道)不同

  • 為了訓(xùn)練還需要把選取提取特征的特征圖元素個(gè)數(shù)湊得和priors的個(gè)數(shù)一致(一一對(duì)應(yīng)關(guān)系)

最后把所有特征圖的預(yù)測(cè)結(jié)果連接起來(lái)

class PredictionConvolution(nn.Module):
    """
    Convolutions to predict class scores and bounding boxes
    """

    def __init__(self, n_classes):
        """
        :param n_class: number of different types of objects
        """
        self.n_classes = n_classes
        super(PredictionConvolution, self).__init__()
        # Number of priors, as we showing before ,at per position in each feature map
        n_boxes = {'conv4_3': 4,
                   'conv7': 6,
                   'conv8_2': 6,
                   'conv9_2': 6,
                   'conv10_2': 4,
                   'conv11_2': 4}
        self.convs = ['conv4_3', 'conv7', 'conv8_2', 'conv9_2', 'conv10_2', 'conv11_2']
        for name, ic in zip(self.convs, [512, 1024, 512, 256, 256, 256]):
            setattr(self, 'cls_%s' % name, cls_predictor(ic, n_boxes[name], n_classes))
            setattr(self, 'loc_%s' % name, loc_predictor(ic, n_boxes[name]))      

        # Initialize convolutions' parameters
        for c in self.children():
            if isinstance(c, nn.Conv2d):
                nn.init.xavier_normal_(c.weight)
                nn.init.constant_(c.bias, 0.)

    def _apply(self, x: torch.Tensor, conv: nn.Conv2d, num_features: int):
        """
        Apply forward calculation for each conv2d with respect to specific feature map
        :param x: input tensor
        :param conv: conv
        :param num_features: output feature, for loc_pred is 4, for label_pred is num_classes+1
        :return: locations and class scores
        """
        x = conv(x).permute(0, 2, 3, 1).contiguous()
        return x.view(x.size(0), -1, num_features)

    def forward(self, *args):
        # args are feature maps needed for prediction
        assert len(args) == len(self.convs)
        locs = []
        classes_scores = []

        for name, x in zip(self.convs, args):
            classes_scores.append(self._apply(x, getattr(self, 'cls_%s' %name), self.n_classes))
            locs.append(self._apply(x, getattr(self, 'loc_%s' % name), 4))

        locs = torch.cat(locs, dim=1)  # (N, 8732, 4)
        classes_scores = torch.cat(classes_scores, dim=1)  # (N, 8732, n_classes)

        return locs, classes_scores

2.6 SSD300

把BaseNet,AuxiliaryConvolutions和PredictionConvolution整合在一起得到SSD300模型

3. 訓(xùn)練數(shù)據(jù)處理

數(shù)據(jù)增廣時(shí)除了圖像本身的處理外還涉及對(duì)真實(shí)邊界框的處理,所以我們不能直接使用torchvision.transform里封裝好的類,我們只能手動(dòng)寫了??

作者使用的data augmentation

針對(duì)文中所說(shuō)的0.5的概率進(jìn)行圖像增廣,只需通過(guò)判斷random.random()是否小于0.5來(lái)進(jìn)行圖像增廣即可

3.1 隨機(jī)裁剪

原文中的數(shù)據(jù)增廣主要就是這個(gè)隨機(jī)裁剪了

def random_crop(image: torch.Tensor, boxes: torch.Tensor, labels: torch.Tensor):
    """
    隨機(jī)裁剪,能夠幫助網(wǎng)絡(luò)學(xué)習(xí)更大尺度的目標(biāo),但某些目標(biāo)可能被完全剪切掉
    :param image: 圖像, a tensor of dimensions (3, original_h, original_w)
    :param boxes: 邊緣形式的真實(shí)邊界框, a tensor of dimensions (n_objects, 4)
    :param labels: 真實(shí)目標(biāo)類別, a tensor of dimensions (n_objects)
    :return: 隨機(jī)裁剪后圖像,邊界框,目標(biāo)類別
    """
    original_width = image.size(2)
    original_height = image.size(1)

    while True:
        # 'None' 意味著不剪裁,0意味著隨即裁剪,[.1, .3, .5, .7, .9]是作者文中描述的最小交并比
        min_overlap = random.choice([0., .1, .3, .5, .7, .9, None])
        if min_overlap is None:
            return image, boxes, labels

        # 對(duì)選取的最小交并比嘗試50次(原文中未提及,但作者倉(cāng)庫(kù)中使用),若均不滿足條件,則進(jìn)行下一循環(huán)選擇新的最小交并比
        for _ in range(50):
            min_scale = 0.3
            # 論文中提及采樣比例是[.1, 1],但作者倉(cāng)庫(kù)使用[.3, 1]
            # random.uniform(a,b)->[a,b]閉區(qū)間
            new_width = int(original_width * random.uniform(min_scale, 1))
            new_height = int(original_height * random.uniform(min_scale, 1))

            # 論文重提及采樣后aspect ratio應(yīng)該在[0.5,2]
            if not .5 <= new_height / new_width <= 2:
                continue

            # 獲取裁剪的位置
            # random.randint(a,b)->[a,b]閉區(qū)間
            left = random.randint(0, original_width - new_width)
            top = random.randint(0, original_height - new_height)
            right = left + new_width
            bottom = top + new_height

            crop_bounding = torch.FloatTensor([left, top, right, bottom])

            # 計(jì)算剪裁后的圖片與真實(shí)邊界框交并比
            over_lap = find_jaccard_overlap(crop_bounding.unsqueeze(0), boxes).squeeze(0)  # (n_objects)

            # 論文中提及,與所有目標(biāo)的交并比應(yīng)該> min_overlap
            if over_lap.max().item() < min_overlap:
                continue

            cropped_image = image[:, top:bottom, left:right]

            # 判斷object是否在圖像中的判據(jù):true bounding box的中心是否在裁剪后的圖像中
            box_centers = (boxes[:, :2] + boxes[:, 2:]) / 2.  # (n_objects, 2)
            center_in_cropped_iamge = (box_centers[:, 0] > left) * (box_centers[:, 0] < right) * ( box_centers[:, 1] > top) * (box_centers[:, 0] < bottom)  # (n_objects)

            # 如果沒有一個(gè)目標(biāo)的中心在裁剪后的圖像中
            if center_in_cropped_iamge.any():
                continue

            # 丟棄沒有通過(guò)判據(jù)的目標(biāo)
            new_boxes = boxes[center_in_cropped_iamge]
            new_labels = labels[center_in_cropped_iamge]

            # 計(jì)算剪切后圖像中邊界框的位置
            # 篩選出真實(shí)左邊界、上邊界和裁剪左邊界、上邊界之中小的那個(gè)
            new_boxes[:, :2] = torch.max(new_boxes[:, :2], crop_bounding[:2])
            new_boxes[:, :2] -= crop_bounding[:2]
            # 篩選出真實(shí)右邊界、下邊界和裁剪右邊界、下邊界之中大的那個(gè)
            new_boxes[:, 2:] = torch.min(new_boxes[:, 2:], crop_bounding[2:])
            new_boxes[:, 2:] -= crop_bounding[:2]

            return cropped_image, new_boxes, new_labels

3.2 水平翻轉(zhuǎn)

這個(gè)很簡(jiǎn)單,就是真實(shí)邊界框不是圖像還需要額外處理

def flip(image, boxes):
    """
    Flip image horizontally.
    :param image: 一個(gè)PIL圖像,因?yàn)檎{(diào)用了torchvision的函數(shù),必須使用PIL Image
    :param boxes: 邊緣形式的真實(shí)邊界框, a tensor of dimensions (n_objects, 4)
    :return: 水平翻轉(zhuǎn)圖像, 更新后的邊界框
    """

    # Flip image
    new_image = torchvision.transforms.functional.hflip(image)

    # Flip boxes
    new_boxes = boxes
    new_boxes[:, 0] = image.width - (boxes[:, 0] + 1)
    new_boxes[:, 2] = image.width - (boxes[:, 2] + 1)
    new_boxes = new_boxes[:, [2, 1, 0, 3]]

    return new_image, new_boxes

3.3 Resize

SSD300模型需要將訓(xùn)練集resize到300 x 300,此外在這里把真實(shí)邊界框處理成比例 (\in[0, 1] ) 的形式

def resize(image, boxes, size=(300, 300), return_percent_coords=True):
    """
    Resize image. For the SSD300, resize to (300, 300).

    Since percent/fractional coordinates are calculated for the bounding boxes (w.r.t image dimensions) in this process,
    you may choose to retain them.
    :param image: image, a PIL Image
    :param boxes: bounding boxes in boundary coordinates, a tensor of dimensions (n_objects, 4)
    :param size: resize to specific size
    :param return_percent_coords: whether to return new bounding box coordinates in form of percent coordinates
    :return: resized image, updated bounding box coordinates (or fractional coordinates, in which case they remain the same)
    """
    # Resize image
    new_image = transforms.functional.resize(image, size)

    # Resize bounding boxes
    old_size = torch.FloatTensor([image.width, image.height, image.width, image.height]).unsqueeze(0)
    # resize means percent coordinates will not change for only augment or shrink
    new_boxes = boxes / old_size  # percent coordinates means same even if different size 

    if not return_percent_coords:
        new_size = torch.FloatTensor([size[0], size[1], size[0], size[1]]).unsqueeze(0)
        new_boxes = new_boxes * new_size

    return new_image, new_boxes

3.5 Expand

由于模型對(duì)于較小尺度目標(biāo)的檢測(cè)性能不好,在此我們將訓(xùn)練數(shù)據(jù)放大,以增強(qiáng)對(duì)小尺度目標(biāo)的檢測(cè)能力

整體的步驟與resize十分類似,只不過(guò)需要將新圖片放大,將原圖片放在新圖片內(nèi)部,再將其他空白部分填充一下

這個(gè)填充的值推薦使用三個(gè)channels各自的平均值(可以在3.6中看到)

由于新圖片范圍比原圖片大,真實(shí)邊界框只需加上[ 向左的移動(dòng),向下的移動(dòng),向左的移動(dòng),向下的移動(dòng) ]

3.6 標(biāo)準(zhǔn)化

輸入數(shù)據(jù)先被歸一化到[0, 1],預(yù)訓(xùn)練的模型會(huì)還需對(duì)歸一化輸入進(jìn)行標(biāo)準(zhǔn)化,這個(gè)頁(yè)面展示了torchvision.model預(yù)訓(xùn)練模型的具體處理

mean = [0.485, 0.456, 0.406] # RGB channels
std = [0.229, 0.224, 0.225]  # RGB channels

4. Dataset and DataLoader

Dataset需要手動(dòng)創(chuàng)建torch.utils.data.Dataset的子類,在里面對(duì)圖片、真實(shí)邊界框、目標(biāo)標(biāo)記進(jìn)行第3節(jié)的處理即可

Dataset返回圖片、真實(shí)邊界框、目標(biāo)標(biāo)記

然而在使用DataLoader讀取batches的時(shí)候會(huì)出現(xiàn)問(wèn)題:

注意每個(gè)圖片內(nèi)objects的個(gè)數(shù)不同,這會(huì)導(dǎo)致每個(gè)圖片內(nèi)boxes和labels的長(zhǎng)度不同,這樣沒辦法組成batches

所以我們要為DataLoader的collate_fn=參數(shù)指定一個(gè)函數(shù)(注意只需傳入函數(shù)名),按此函數(shù)整理輸出

def collate_fn(batch):
    """
      
    This describes how to combine these tensors of different sizes. We use lists.

    :param batch: an iterable of N sets from __getitem__()
    :return: a tensor of images, lists of varying-size tensors of bounding boxes, labels, and difficulties
    """

    images = list()
    boxes = list()
    labels = list()

    for b in batch:
        images.append(b[0])
        boxes.append(b[1])
        labels.append(b[2])

        images = torch.stack(images, dim=0)

        return images, boxes, labels, difficulties  # tensor (N, 3, 300, 300), 3 lists of N tensors each

5.訓(xùn)練

5.1 Loss Function

location_loss=torch.nn.L1Loss()
confidence_loss=nn.CrossEntropyLoss(reduction='none')

5.2 Hard negative mining

由于訓(xùn)練數(shù)據(jù)中的負(fù)類(背景類)遠(yuǎn)遠(yuǎn)多于正類,導(dǎo)致訓(xùn)練數(shù)據(jù)正負(fù)類嚴(yán)重的不平衡,所以這里要使用Hard negative mining,選擇Loss最大的負(fù)類,使正負(fù)類之比為1:3

def calculate_loss(priors_cxcy, pred_locs, pred_scores, boxes, labels, loc_loss, conf_loss, alpha=1):
    """
    使用Hard Negative mining 計(jì)算損失
    :param priors_cxcy: 中心形式的priors
    :param pred_locs: 預(yù)測(cè)的offsets, 一個(gè)batch的預(yù)測(cè)結(jié)果
    :param pred_scores: 類別預(yù)測(cè)分?jǐn)?shù), 一個(gè)batch的預(yù)測(cè)結(jié)果
    :param boxes: 真實(shí)邊界框,from a batch of dataloader
    :param labels: 真實(shí)類別標(biāo)記,from a batch of dataloader
    :param loc_loss: nn.L1Loss()
    :param conf_loss: nn.CrossEntropyLoss(reduction='none')
    :param alpha: 論文中位置損失的權(quán)重,默認(rèn)為1
    :return: 
    """
    n_priors = priors_cxcy.size(0)
    batch_size = pred_locs.size(0)
    n_classes = pred_scores.size(2)

    assert n_priors == pred_scores.size(1) == pred_scores.size(1)
    true_locs = torch.zeros((batch_size, n_priors, 4), dtype=torch.float).to(device)  # (N, 8732, 4)
    true_classes = torch.zeros((batch_size, n_priors), dtype=torch.long).to(device)  # (N, 8732)

    # 在不同圖片里,為每個(gè)prior分配真實(shí)標(biāo)簽
    for i in range(batch_size):
        cls, loc = label_prior(priors_cxcy, boxes[i], labels[i])
        true_locs[i] = loc
        true_classes[i] = cls

    positive_priors = (true_classes != 0)  # (N, 8732)

    # 計(jì)算位置損失:位置損失只計(jì)算正類(非背景類)
    loss_of_loc = loc_loss(pred_locs[positive_priors], true_locs[positive_priors])

    # 計(jì)算信度損失

    # 按論文中負(fù)類:正類 = 3:1選取負(fù)類
    n_hard_negative = 3 * positive_priors.sum(dim=1)  # (N)

    # 首先計(jì)算所由正類和負(fù)類的信度損失,這樣可以免得計(jì)算不同圖片導(dǎo)致的位置關(guān)系
    # CrossEntropyLoss(reduction='none')使得損失在第0維度上羅列開來(lái)而不是相加或取平均

    loss_of_conf_all = conf_loss(pred_scores.view(-1, n_classes), labels.view(-1))  # (N * 8732)
    loss_of_conf_all = loss_of_conf_all.view(batch_size, n_priors)  # (N, 8732)

    # 我們已經(jīng)知道了所有正類的損失
    loss_of_conf_pos = loss_of_conf_all[positive_priors]  # (sum(n_positives))

    loss_of_conf_neg = loss_of_conf_all.clone()  # (N, 8732)
    loss_of_conf_neg[positive_priors] = 0  # (N, 8732), 使正類的loss永遠(yuǎn)不能在前n_hard_negatives
    loss_of_conf_neg, _ = loss_of_conf_neg.sort(dim=1, descending=True)  # 負(fù)類將損失按降序排序
    neg_ranks = torch.LongTensor(range(n_priors)).unsqueeze(0).expand_as(loss_of_conf_neg)  # (N, 8732), 為每行元素標(biāo)序號(hào)
    hard_negatives = (neg_ranks < n_hard_negative.unsqueeze(1))  # (N, 8732)
    loss_of_conf_hard_neg = loss_of_conf_neg[hard_negatives]  # (sum(n_hard_negatives)

    # As in the paper, averaged over positive priors only, although computed over both positive and hard-negative priors
    loss_of_conf = (loss_of_conf_pos.sum() + loss_of_conf_hard_neg.sum()) / positive_priors.sum().float()  # (), scalar

    # TOTAL LOSS

    return loss_of_conf + alpha * loss_of_loc

6. 目標(biāo)檢測(cè)

6.1 非極大值抑制

在最后進(jìn)行目標(biāo)檢測(cè)的時(shí)候,我們不希望輸出過(guò)多的預(yù)測(cè)邊界框(此時(shí)的邊界框存在大量的重疊),這時(shí)候我們需要進(jìn)行非極大值抑制,把認(rèn)為是重疊的邊界框(不同預(yù)測(cè)邊界框之間的交并比大于給定閾值認(rèn)為是重疊)去除,只保留信度最大的邊界框

def none_max_suppress(priors_cxcy, pred_locs, pred_scores, min_score, max_overlap, top_k):
    """
    執(zhí)行非極大值預(yù)測(cè)
    :param priors_cxcy: 中心格式的priors
    :param pred_locs: 預(yù)測(cè)的offsets,預(yù)測(cè)器的輸出
    :param pred_scores: 預(yù)測(cè)的得分,預(yù)測(cè)器的輸出
    :param min_score: 設(shè)置接收的最小得分
    :param max_overlap: 設(shè)置抑制的最大交并比
    :param top_k: 保留至多top_k個(gè)預(yù)測(cè)目標(biāo)
    :return: 壓縮后邊緣形式的邊界框、類別、得分
    """
    batch_size = priors.size(0)
    n_priors = priors.size(0)
    n_classes = pred_scores.size(2)

    pred_scores = torch.softmax(pred_scores, dim=2)  # (batch_size, n_priors, n_classes)

    assert n_priors == pred_scores.size(1) == pred_locs.size(1)

    boxes_all_image = []
    scores_all_image = []
    labels_all_image = []

    for i in range(batch_size):
        # 將預(yù)測(cè)的offset解碼為邊緣形式的邊界框
        boxes = cxcy_to_xy(gcxgcy_to_cxcy(pred_locs[i], priors_cxcy))  # (n_priors, 4)

        boxes_per_image = []
        scores_per_image = []
        labels_per_image = []

        for c in range(1, n_classes):
            class_scores = pred_scores[i, :, c]  # (8732)
            score_above_min = class_scores > min_score
            n_score_above_min = score_above_min.sum().item()

            if n_score_above_min == 0:
                continue

            # 僅保留score>min_score的預(yù)測(cè)
            class_scores = class_scores[score_above_min]
            class_boxes = boxes[score_above_min]

            # 按檢測(cè)信度排序
            class_scores, sorted_ind = class_scores.sort(dim=0, descending=True)  # (n_score_above_min)
            class_boxes = class_boxes[sorted_ind]  # (n_score_above_min, 4)

            # 按交并比進(jìn)行非極大值壓縮
            overlap = find_jaccard_overlap(class_boxes, class_boxes)  # (n_score_above_min, n_score_above_min)

            # 創(chuàng)建記錄是否被壓縮的掩碼,1代表壓縮
            suppress = torch.zeros((n_score_above_min), dtype=torch.uint8).to(device)

            for b_id in range(n_score_above_min):
                # 若已被掩碼記錄為壓縮,則跳過(guò)
                if suppress[b_id] == 1:
                    continue
                # 按預(yù)測(cè)邊框間的交并比是否>max_overlap更新mask,并保持原來(lái)被壓縮的邊界框不變
                suppress = torch.max(suppress, (overlap[box] > max_overlap).byte())
                # 不壓縮當(dāng)前邊界框
                suppress[b_id] = 0

            # 僅為每個(gè)類存儲(chǔ)未被壓縮的預(yù)測(cè)
            boxes_per_image.append(class_boxes[(1 - suppress).bool()])
            scores_per_image.append(class_scores[(1 - suppress).bool()])
            labels_per_image.append(torch.LongTensor([c] * (1 - suppress).sum().item()))

        # 如果該圖片中沒有包含任何類別, 則把整個(gè)圖片標(biāo)注為背景類
        if len(labels_per_image) == 0:
            boxes_per_image.append(torch.FloatTensor([0, 0, 1, 1]).to(device))
            labels_per_image.append(torch.LongTensor([0]).to(device))
            scores_per_image.append(torch.FloatTensor([0]).to(device))

        boxes_per_image = torch.cat(boxes_per_image, dim=0)  # (n_objects, 4)
        scores_per_image = torch.cat(scores_per_image, dim=0)  # (n_objects)
        labels_per_image = torch.cat(labels_per_image, dim=0)  # (n_objects)
        n_object = boxes_per_image.size(0)

        # 只保留按信度排序前K個(gè)目標(biāo)
        if n_object > top_k:
            scores_per_image, sorted_ind = scores_per_image.sort(dim=0, descending=True)
            scores_per_image = scores_per_image[:top_k]
            boxes_per_image = boxes_per_image[sorted_ind][:top_k]
            labels_per_image = labels_per_image[sorted_ind][:top_k]

        boxes_all_image.append(boxes_per_image)
        scores_all_image.append(scores_per_image)
        labels_all_image.append(labels_per_image)

    return boxes_all_image, labels_all_image, scores_all_image  #  長(zhǎng)度為batch_size的列表

額外部分:一些注意點(diǎn)

  • 我們將各層特征圖的輸出連接成一個(gè)tensor,此時(shí)conv4_3 feature maps處于較低層,其features數(shù)值比之高層的大很多(下采樣會(huì)使特征響應(yīng)的數(shù)值減?。?/em>,因此我們可以選擇對(duì)feature maps進(jìn)行歸一化(如L2 normalization)后,再放大其特征響應(yīng)(該factor由網(wǎng)絡(luò)自己學(xué)習(xí))。我認(rèn)為Batch Normalization同樣也適用。

  • 使用dtype=torch.bool或torch.uint8(至少1.3.0之后就廢除了uint8的索引操作了)為多維tensor進(jìn)行索引操作,得到的索引結(jié)果是flatten的(注:此 bool tensor的位置與原 tensor一一時(shí),若不是則會(huì)保留dim(即使還維剩余1個(gè)數(shù)組),切片則會(huì)把僅剩一個(gè)數(shù)組的維度給壓縮了),如

    x = torch.rand((2, 3, 4))  # 假設(shè)有一半的數(shù)據(jù)>0.5
    y = x > 0.5  # y in shape of (2, 3, 4),一半是True,一半是False
    print(x[y].shape) # tenor in shape of(12)
    
  • 提高訓(xùn)練速度的一些操作

    torch.backends.cudnn.benchmark = True

    dataloader的pin_memory=True,使用GPU中的鎖頁(yè)內(nèi)存(不與虛擬內(nèi)存交換數(shù)據(jù)以加快速度),需要GPU內(nèi)存足夠,更具體內(nèi)容參考:https://blog.csdn.net/tfcy694/article/details/83270701

  • 這里沒用使用eval函數(shù)去評(píng)價(jià)模型實(shí)際的效果,可以選擇使用mAP。在保存最好的網(wǎng)絡(luò)模型時(shí),可以考慮eval指標(biāo)的增加來(lái)保留下好的參數(shù),同時(shí)可以用此eval指標(biāo)控制epochs提前終止

新人上路,請(qǐng)多多關(guān)注??,純手動(dòng)不易,歡迎討論

轉(zhuǎn)載請(qǐng)說(shuō)明出處。

References

最后編輯于
?著作權(quán)歸作者所有,轉(zhuǎn)載或內(nèi)容合作請(qǐng)聯(lián)系作者
【社區(qū)內(nèi)容提示】社區(qū)部分內(nèi)容疑似由AI輔助生成,瀏覽時(shí)請(qǐng)結(jié)合常識(shí)與多方信息審慎甄別。
平臺(tái)聲明:文章內(nèi)容(如有圖片或視頻亦包括在內(nèi))由作者上傳并發(fā)布,文章內(nèi)容僅代表作者本人觀點(diǎn),簡(jiǎn)書系信息發(fā)布平臺(tái),僅提供信息存儲(chǔ)服務(wù)。

相關(guān)閱讀更多精彩內(nèi)容

友情鏈接更多精彩內(nèi)容