目標(biāo)檢測(cè)算法之SSD代碼解析(萬(wàn)字長(zhǎng)文超詳細(xì))

前言

前面的推文已經(jīng)介紹過(guò)SSD算法,我覺得原理說(shuō)的還算清楚了,但是一個(gè)算法不深入到代碼去理解是完全不夠的。因此本篇文章是在上篇SSD算法原理解析的基礎(chǔ)上做的代碼解析,解析SSD算法原理的推文的地址如下:https://mp.weixin.qq.com/s/lXqobT45S1wz-evc7KO5DA。今天要解析的SSD源碼來(lái)自于github一個(gè)非?;鸬腜ytorch實(shí)現(xiàn),已經(jīng)有3K+星,地址為:https://github.com/amdegroot/ssd.pytorch/

網(wǎng)絡(luò)結(jié)構(gòu)

為了比較好的對(duì)應(yīng)SSD的結(jié)構(gòu)來(lái)看代碼,我們首先放出SSD的網(wǎng)絡(luò)結(jié)構(gòu),如下圖所示:

在這里插入圖片描述

可以看到原始的SSD網(wǎng)絡(luò)是以VGG-16作Backbone(骨干網(wǎng)絡(luò))的。為了更加清晰看到相比于VGG16,SSD的網(wǎng)絡(luò)使用了哪些變化,知乎上的一個(gè)帖子做了一個(gè)非常清晰的圖,這里借用一下,原圖地址為:https://zhuanlan.zhihu.com/p/79854543 。帶有特征圖維度信息的更清晰的骨干網(wǎng)絡(luò)和VGG16的對(duì)比圖如下:
在這里插入圖片描述

源碼解析

OK,現(xiàn)在我們就要開始從源碼剖析SSD了 。主要弄清楚三個(gè)方面,網(wǎng)絡(luò)結(jié)構(gòu)的搭建,Anchor還有損失函數(shù),就算是理解這個(gè)源碼了。

網(wǎng)絡(luò)搭建

從上面的圖中我們可以清晰的看到在以VGG16做骨干網(wǎng)絡(luò)時(shí),在conv5后丟棄了CGG16中的全連接層改為了1024\times 3\times 31024\times1\times1的卷積層。其中conv4-1卷積層前面的maxpooling層的ceil_model=True,使得輸出特征圖長(zhǎng)寬為38\times 38。還有conv5-3后面的一層maxpooling層參數(shù)為(kernelsize=3,stride=1,padding=1),不進(jìn)行下采樣。然后在fc7后面接上多尺度提取的另外4個(gè)卷積層就構(gòu)成了完整的SSD網(wǎng)絡(luò)。這里VGG16修改后的代碼如下,來(lái)自ssd.py:

def vgg(cfg, i, batch_norm=False):
    layers = []
    in_channels = i
    for v in cfg:
        if v == 'M':
            layers += [nn.MaxPool2d(kernel_size=2, stride=2)]
        elif v == 'C':
            layers += [nn.MaxPool2d(kernel_size=2, stride=2, ceil_mode=True)]
        else:
            conv2d = nn.Conv2d(in_channels, v, kernel_size=3, padding=1)
            if batch_norm:
                layers += [conv2d, nn.BatchNorm2d(v), nn.ReLU(inplace=True)]
            else:
                layers += [conv2d, nn.ReLU(inplace=True)]
            in_channels = v
    pool5 = nn.MaxPool2d(kernel_size=3, stride=1, padding=1)
    conv6 = nn.Conv2d(512, 1024, kernel_size=3, padding=6, dilation=6)
    conv7 = nn.Conv2d(1024, 1024, kernel_size=1)
    layers += [pool5, conv6,
               nn.ReLU(inplace=True), conv7, nn.ReLU(inplace=True)]
    return layers

可以看到和我們上面的那張圖是完全一致的。代碼里面最后獲得的conv7就是我們上面圖里面的fc7,特征維度是:[None,1024,19,19]
現(xiàn)在可以開始搭建SSD網(wǎng)絡(luò)后面的多尺度提取網(wǎng)絡(luò)了。也就是網(wǎng)絡(luò)結(jié)構(gòu)圖中的Extra Feature Layers。我們從開篇的結(jié)構(gòu)圖中截取一下這一部分,方便我們對(duì)照代碼。

在這里插入圖片描述

實(shí)現(xiàn)的代碼如下(同樣來(lái)自ssd.py):

def add_extras(cfg, i, batch_norm=False):
    # Extra layers added to VGG for feature scaling
    layers = []
    in_channels = i
    flag = False #flag 用來(lái)控制 kernel_size= 1 or 3
    for k, v in enumerate(cfg):
        if in_channels != 'S':
            if v == 'S':
                layers += [nn.Conv2d(in_channels, cfg[k + 1],
                           kernel_size=(1, 3)[flag], stride=2, padding=1)]
            else:
                layers += [nn.Conv2d(in_channels, v, kernel_size=(1, 3)[flag])]
            flag = not flag
        in_channels = v
return layers

可以看到網(wǎng)絡(luò)結(jié)構(gòu)中除了魔改后的VGG16和Extra Layers還有6個(gè)橫著的線,這代表的是對(duì)6個(gè)尺度的特征圖進(jìn)行卷積獲得預(yù)測(cè)框的回歸(loc)和類別(cls)信息,注意SSD將背景也看成類別了,所以對(duì)于VOC數(shù)據(jù)集類別數(shù)就是20+1=21。這部分的代碼為:

def multibox(vgg, extra_layers, cfg, num_classes):
    loc_layers = []#多尺度分支的回歸網(wǎng)絡(luò)
    conf_layers = []#多尺度分支的分類網(wǎng)絡(luò)
    # 第一部分,vgg 網(wǎng)絡(luò)的 Conv2d-4_3(21層), Conv2d-7_1(-2層)
    vgg_source = [21, -2]
    for k, v in enumerate(vgg_source):
        # 回歸 box*4(坐標(biāo))
        loc_layers += [nn.Conv2d(vgg[v].out_channels,
                                 cfg[k] * 4, kernel_size=3, padding=1)]
        # 置信度 box*(num_classes)
        conf_layers += [nn.Conv2d(vgg[v].out_channels,
                        cfg[k] * num_classes, kernel_size=3, padding=1)]
    # 第二部分,cfg從第三個(gè)開始作為box的個(gè)數(shù),而且用于多尺度提取的網(wǎng)絡(luò)分別為1,3,5,7層
    for k, v in enumerate(extra_layers[1::2], 2):
        loc_layers += [nn.Conv2d(v.out_channels, cfg[k]
                                 * 4, kernel_size=3, padding=1)]
        conf_layers += [nn.Conv2d(v.out_channels, cfg[k]
                                  * num_classes, kernel_size=3, padding=1)]
    return vgg, extra_layers, (loc_layers, conf_layers)
# 用下面的測(cè)試代碼測(cè)試一下
if __name__  == "__main__":
    vgg, extra_layers, (l, c) = multibox(vgg(base['300'], 3),
                                         add_extras(extras['300'], 1024),
                                         [4, 6, 6, 6, 4, 4], 21)
    print(nn.Sequential(*l))
    print('---------------------------')
    print(nn.Sequential(*c))

在jupter notebook輸出信息為:

'''
loc layers: 
'''
Sequential(
  (0): Conv2d(512, 16, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
  (1): Conv2d(1024, 24, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
  (2): Conv2d(512, 24, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
  (3): Conv2d(256, 24, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
  (4): Conv2d(256, 16, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
  (5): Conv2d(256, 16, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
)
---------------------------
'''
conf layers: 
''' 
Sequential(
  (0): Conv2d(512, 84, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
  (1): Conv2d(1024, 126, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
  (2): Conv2d(512, 126, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
  (3): Conv2d(256, 126, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
  (4): Conv2d(256, 84, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
  (5): Conv2d(256, 84, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
)

Anchor生成(Prior_Box層)

這個(gè)在前面SSD的原理篇中講過(guò)了,這里不妨再回憶一下,SSD從魔改后的VGG16的conv4_3開始一共使用了6個(gè)不同大小的特征圖,大小分別為(38,28),(19,19),(10,10),(5,5),(3,3),(1,1),但每個(gè)特征圖上設(shè)置的先驗(yàn)框(Anchor)的數(shù)量不同。先驗(yàn)框的設(shè)置包含尺度和長(zhǎng)寬比兩個(gè)方面。對(duì)于先驗(yàn)框的設(shè)置,公式如下:
s_k=s_{min}+\frac{s_{max}-s_{min}}{m-1}(k-1),k\in [1,m],其中M指的是特征圖個(gè)數(shù),這里為5,因?yàn)榈谝粚?code>conv4_3的Anchor是單獨(dú)設(shè)置的,s_k代表先驗(yàn)框大小相對(duì)于特征圖的比例,注意這里不是相對(duì)原圖哦。最后,s_{min}s_{max}表示比例的最小值和最大值,論文中分別取0.20.9。
對(duì)于第一個(gè)特征圖,它的先驗(yàn)框尺度比例設(shè)置為s_{min}/2=0.1,則他的尺度為300\times 0.1=30,后面的特征圖帶入公式計(jì)算,并將其映射會(huì)原圖300的大小可以得到,剩下的5個(gè)特征圖的尺度s_k{60,111,162,213,264}。所以綜合起來(lái),6個(gè)特征圖的尺度s_k{30,60,111,162,213,264}。有了Anchor的尺度,接下來(lái)設(shè)置Anchor的長(zhǎng)寬,論文中長(zhǎng)寬設(shè)置一般為a_r={1,2,3,\frac{1}{2},\frac{1}{3}},根據(jù)面積和長(zhǎng)寬比可以得到先驗(yàn)框的寬度和高度:
w_k^a=s_k\sqrt{a_r},h_k^a=s_k/\sqrt{a_r}。
這里有一些值得注意的點(diǎn),如下:

  • 上面的s_k是相對(duì)于原圖的大小。
  • 默認(rèn)情況下,每個(gè)特征圖除了上面5個(gè)比例的Anchor,還會(huì)設(shè)置一個(gè)尺度為s_k^{'}=\sqrt{s_ks_{k+1}}a_r=1的先驗(yàn)框,這樣每個(gè)特征圖都設(shè)置了兩個(gè)長(zhǎng)寬比為1但大小不同的正方形先驗(yàn)框。最后一個(gè)特征圖需要參考一下s_{m+1}=315來(lái)計(jì)算s_m。
  • 在實(shí)現(xiàn)conv4_3,conv10_2,conv11_2層時(shí)僅使用4個(gè)先驗(yàn)框,不使用長(zhǎng)寬比為3,\frac{1}{3}的Anchor。
  • 每個(gè)單元的先驗(yàn)框中心點(diǎn)分布在每個(gè)單元的中心,即:
    [\frac{i+0.5}{|f_k|},\frac{j+0.5}{|f_k|}],i,j\in[0,|f_k|],其中f_k是特征圖的大小。

從Anchor的值來(lái)看,越前面的特征圖Anchor的尺寸越小,也就是說(shuō)對(duì)小目標(biāo)的效果越好。先驗(yàn)框的總數(shù)為num_priors = 38x38x4+19x19x6+10x10x6+5x5x6+3x3x4+1x1x4=8732。

生成先驗(yàn)框的代碼如下(來(lái)自layers/functions/prior_box.py)

class PriorBox(object):
    """Compute priorbox coordinates in center-offset form for each source
    feature map.
    """
    def __init__(self, cfg):
        super(PriorBox, self).__init__()
        self.image_size = cfg['min_dim']
        # number of priors for feature map location (either 4 or 6)
        self.num_priors = len(cfg['aspect_ratios'])
        self.variance = cfg['variance'] or [0.1]
        self.feature_maps = cfg['feature_maps']
        self.min_sizes = cfg['min_sizes']
        self.max_sizes = cfg['max_sizes']
        self.steps = cfg['steps']
        self.aspect_ratios = cfg['aspect_ratios']
        self.clip = cfg['clip']
        self.version = cfg['name']
        for v in self.variance:
            if v <= 0:
                raise ValueError('Variances must be greater than 0')

    def forward(self):
        mean = []
        # 遍歷多尺度的 特征圖: [38, 19, 10, 5, 3, 1]
        for k, f in enumerate(self.feature_maps):
            # 遍歷每個(gè)像素
            for i, j in product(range(f), repeat=2):
                # k-th 層的feature map 大小
                f_k = self.image_size / self.steps[k]
                # # 每個(gè)框的中心坐標(biāo)
                cx = (j + 0.5) / f_k
                cy = (i + 0.5) / f_k

                # aspect_ratio: 1 當(dāng) ratio==1的時(shí)候,會(huì)產(chǎn)生兩個(gè) box
                # r==1, size = s_k, 正方形
                s_k = self.min_sizes[k]/self.image_size
                mean += [cx, cy, s_k, s_k]

                # r==1, size = sqrt(s_k * s_(k+1)), 正方形
                # rel size: sqrt(s_k * s_(k+1))
                s_k_prime = sqrt(s_k * (self.max_sizes[k]/self.image_size))
                mean += [cx, cy, s_k_prime, s_k_prime]

                # 當(dāng) ratio != 1 的時(shí)候,產(chǎn)生的box為矩形
                for ar in self.aspect_ratios[k]:
                    mean += [cx, cy, s_k*sqrt(ar), s_k/sqrt(ar)]
                    mean += [cx, cy, s_k/sqrt(ar), s_k*sqrt(ar)]
        # 轉(zhuǎn)化為 torch的Tensor
        output = torch.Tensor(mean).view(-1, 4)
        #歸一化,把輸出設(shè)置在 [0,1]
        if self.clip:
            output.clamp_(max=1, min=0)
return output

網(wǎng)絡(luò)結(jié)構(gòu)

結(jié)合了前面介紹的魔改后的VGG16,還有Extra Layers,還有生成Anchor的Priobox策略,我們可以寫出SSD的整體結(jié)構(gòu)如下(代碼在ssd.py):

class SSD(nn.Module):
    """Single Shot Multibox Architecture
    The network is composed of a base VGG network followed by the
    added multibox conv layers.  Each multibox layer branches into
        1) conv2d for class conf scores
        2) conv2d for localization predictions
        3) associated priorbox layer to produce default bounding
           boxes specific to the layer's feature map size.
    See: https://arxiv.org/pdf/1512.02325.pdf for more details.
    Args:
        phase: (string) Can be "test" or "train"
        size: input image size
        base: VGG16 layers for input, size of either 300 or 500
        extras: extra layers that feed to multibox loc and conf layers
        head: "multibox head" consists of loc and conf conv layers
    """

    def __init__(self, phase, size, base, extras, head, num_classes):
        super(SSD, self).__init__()
        self.phase = phase
        self.num_classes = num_classes
        # 配置config
        self.cfg = (coco, voc)[num_classes == 21]
        # 初始化先驗(yàn)框
        self.priorbox = PriorBox(self.cfg)
        self.priors = Variable(self.priorbox.forward(), volatile=True)
        self.size = size

        # SSD network
        # backbone網(wǎng)絡(luò)
        self.vgg = nn.ModuleList(base)
        # Layer learns to scale the l2 normalized features from conv4_3
        # conv4_3后面的網(wǎng)絡(luò),L2 正則化
        self.L2Norm = L2Norm(512, 20)
        self.extras = nn.ModuleList(extras)
        # 回歸和分類網(wǎng)絡(luò)
        self.loc = nn.ModuleList(head[0])
        self.conf = nn.ModuleList(head[1])

        if phase == 'test':
            self.softmax = nn.Softmax(dim=-1)
            self.detect = Detect(num_classes, 0, 200, 0.01, 0.45)

    def forward(self, x):
        """Applies network layers and ops on input image(s) x.
        Args:
            x: input image or batch of images. Shape: [batch,3,300,300].
        Return:
            Depending on phase:
            test:
                Variable(tensor) of output class label predictions,
                confidence score, and corresponding location predictions for
                each object detected. Shape: [batch,topk,7]
            train:
                list of concat outputs from:
                    1: confidence layers, Shape: [batch*num_priors,num_classes]
                    2: localization layers, Shape: [batch,num_priors*4]
                    3: priorbox layers, Shape: [2,num_priors*4]
        """
        sources = list()
        loc = list()
        conf = list()

        # apply vgg up to conv4_3 relu
        # vgg網(wǎng)絡(luò)到conv4_3
        for k in range(23):
            x = self.vgg[k](x)
        # l2 正則化
        s = self.L2Norm(x)
        sources.append(s)

        # apply vgg up to fc7
        # conv4_3 到 fc
        for k in range(23, len(self.vgg)):
            x = self.vgg[k](x)
        sources.append(x)

        # apply extra layers and cache source layer outputs
        # extras 網(wǎng)絡(luò)
        for k, v in enumerate(self.extras):
            x = F.relu(v(x), inplace=True)
            if k % 2 == 1:
                # 把需要進(jìn)行多尺度的網(wǎng)絡(luò)輸出存入 sources
                sources.append(x)

        # apply multibox head to source layers
        # 多尺度回歸和分類網(wǎng)絡(luò)
        for (x, l, c) in zip(sources, self.loc, self.conf):
            loc.append(l(x).permute(0, 2, 3, 1).contiguous())
            conf.append(c(x).permute(0, 2, 3, 1).contiguous())

        loc = torch.cat([o.view(o.size(0), -1) for o in loc], 1)
        conf = torch.cat([o.view(o.size(0), -1) for o in conf], 1)
        if self.phase == "test":
            output = self.detect(
                loc.view(loc.size(0), -1, 4),                   # loc preds
                self.softmax(conf.view(conf.size(0), -1,
                             self.num_classes)),                # conf preds
                self.priors.type(type(x.data))                  # default boxes
            )
        else:
            output = (
                # loc的輸出,size:(batch, 8732, 4)
                loc.view(loc.size(0), -1, 4),
                # conf的輸出,size:(batch, 8732, 21)
                conf.view(conf.size(0), -1, self.num_classes),
                # 生成所有的候選框 size([8732, 4])
                self.priors
            )
        return output
    # 加載模型參數(shù)
    def load_weights(self, base_file):
        other, ext = os.path.splitext(base_file)
        if ext == '.pkl' or '.pth':
            print('Loading weights into state dict...')
            self.load_state_dict(torch.load(base_file,
                                 map_location=lambda storage, loc: storage))
            print('Finished!')
        else:
            print('Sorry only .pth and .pkl files supported.')

然后為了增加可讀性,重新封裝了一下,代碼如下:

base = {
    '300': [64, 64, 'M', 128, 128, 'M', 256, 256, 256, 'C', 512, 512, 512, 'M',
            512, 512, 512],
    '512': [],
}
extras = {
    '300': [256, 'S', 512, 128, 'S', 256, 128, 256, 128, 256],
    '512': [],
}
mbox = {
    '300': [4, 6, 6, 6, 4, 4],  # number of boxes per feature map location
    '512': [],
}


def build_ssd(phase, size=300, num_classes=21):
    if phase != "test" and phase != "train":
        print("ERROR: Phase: " + phase + " not recognized")
        return
    if size != 300:
        print("ERROR: You specified size " + repr(size) + ". However, " +
              "currently only SSD300 (size=300) is supported!")
        return
    # 調(diào)用multibox,生成vgg,extras,head
    base_, extras_, head_ = multibox(vgg(base[str(size)], 3),
                                     add_extras(extras[str(size)], 1024),
                                     mbox[str(size)], num_classes)
    return SSD(phase, size, base_, extras_, head_, num_classes)

Loss解析

SSD的損失函數(shù)包含兩個(gè)部分,一個(gè)是定位損失L_{loc},一個(gè)是分類損失L_{conf},整個(gè)損失函數(shù)表達(dá)如下:
L(x,c,l,g)=\frac{1}{N}(L_{conf}(x,c)+\alpha L_{loc}(x,l,g))
其中,N是先驗(yàn)框的正樣本數(shù)量,c是類別置信度預(yù)測(cè)值,l是先驗(yàn)框?qū)?yīng)的邊界框預(yù)測(cè)值,g是ground truth的位置參數(shù),x代表網(wǎng)絡(luò)的預(yù)測(cè)值。對(duì)于位置損失,采用Smooth L1 Loss,位置信息都是encode之后的數(shù)值,后面會(huì)講這個(gè)encode的過(guò)程。而對(duì)于分類損失,首先需要使用hard negtive mining將正負(fù)樣本按照1:3 的比例把負(fù)樣本抽樣出來(lái),抽樣的方法是:針對(duì)所有batch的confidence,按照置信度誤差進(jìn)行降序排列,取出前top_k個(gè)負(fù)樣本。損失函數(shù)可以用下圖表示:

在這里插入圖片描述

實(shí)現(xiàn)步驟

  • Reshape所有batch中的conf,即代碼中的batch_conf = conf_data.view(-1, self.num_classes),方便后續(xù)排序。
  • 置信度誤差越大,實(shí)際上就是預(yù)測(cè)背景的置信度越小。
  • 把所有conf進(jìn)行logsoftmax處理(均為負(fù)值),預(yù)測(cè)的置信度越小,則logsoftmax越小,取絕對(duì)值,則|logsoftmax|越大,降序排列-logsoftmax,取前top_k的負(fù)樣本。
    其中,log_sum_exp函數(shù)的代碼如下:
def log_sum_exp(x):
    x_max = x.detach().max()
    return torch.log(torch.sum(torch.exp(x-x_max), 1, keepdim=True))+x_max

分類損失conf_logP函數(shù)如下:

conf_logP = log_sum_exp(batch_conf) - batch_conf.gather(1, conf_t.view(-1, 1))

這樣計(jì)算的原因主要是為了增強(qiáng)logsoftmax損失的數(shù)值穩(wěn)定性。放一張我的手推圖:

在這里插入圖片描述

損失函數(shù)完整代碼實(shí)現(xiàn),來(lái)自layers/modules/multibox_loss.py

class MultiBoxLoss(nn.Module):
    """SSD Weighted Loss Function
    Compute Targets:
        1) Produce Confidence Target Indices by matching  ground truth boxes
           with (default) 'priorboxes' that have jaccard index > threshold parameter
           (default threshold: 0.5).
        2) Produce localization target by 'encoding' variance into offsets of ground
           truth boxes and their matched  'priorboxes'.
        3) Hard negative mining to filter the excessive number of negative examples
           that comes with using a large number of default bounding boxes.
           (default negative:positive ratio 3:1)
    Objective Loss:
        L(x,c,l,g) = (Lconf(x, c) + αLloc(x,l,g)) / N
        Where, Lconf is the CrossEntropy Loss and Lloc is the SmoothL1 Loss
        weighted by α which is set to 1 by cross val.
        Args:
            c: class confidences,
            l: predicted boxes,
            g: ground truth boxes
            N: number of matched default boxes
        See: https://arxiv.org/pdf/1512.02325.pdf for more details.
    """

    def __init__(self, num_classes, overlap_thresh, prior_for_matching,
                 bkg_label, neg_mining, neg_pos, neg_overlap, encode_target,
                 use_gpu=True):
        super(MultiBoxLoss, self).__init__()
        self.use_gpu = use_gpu
        self.num_classes = num_classes
        self.threshold = overlap_thresh
        self.background_label = bkg_label
        self.encode_target = encode_target
        self.use_prior_for_matching = prior_for_matching
        self.do_neg_mining = neg_mining
        self.negpos_ratio = neg_pos
        self.neg_overlap = neg_overlap
        self.variance = cfg['variance']

    def forward(self, predictions, targets):
        """Multibox Loss
        Args:
            predictions (tuple): A tuple containing loc preds, conf preds,
            and prior boxes from SSD net.
                conf shape: torch.size(batch_size,num_priors,num_classes)
                loc shape: torch.size(batch_size,num_priors,4)
                priors shape: torch.size(num_priors,4)
            targets (tensor): Ground truth boxes and labels for a batch,
                shape: [batch_size,num_objs,5] (last idx is the label).
        """
        loc_data, conf_data, priors = predictions
        num = loc_data.size(0)# batch_size
        priors = priors[:loc_data.size(1), :]
        num_priors = (priors.size(0)) # 先驗(yàn)框個(gè)數(shù)
        num_classes = self.num_classes #類別數(shù)

        # match priors (default boxes) and ground truth boxes
        # 獲取匹配每個(gè)prior box的 ground truth
        # 創(chuàng)建 loc_t 和 conf_t 保存真實(shí)box的位置和類別
        loc_t = torch.Tensor(num, num_priors, 4)
        conf_t = torch.LongTensor(num, num_priors)
        for idx in range(num):
            truths = targets[idx][:, :-1].data #ground truth box信息
            labels = targets[idx][:, -1].data # ground truth conf信息
            defaults = priors.data # priors的 box 信息
            # 匹配 ground truth
            match(self.threshold, truths, defaults, self.variance, labels,
                  loc_t, conf_t, idx)
        if self.use_gpu:
            loc_t = loc_t.cuda()
            conf_t = conf_t.cuda()
        # wrap targets
        loc_t = Variable(loc_t, requires_grad=False)
        conf_t = Variable(conf_t, requires_grad=False)
        # 匹配中所有的正樣本mask,shape[b,M]
        pos = conf_t > 0
        num_pos = pos.sum(dim=1, keepdim=True)
        # Localization Loss,使用 Smooth L1
        # shape[b,M]-->shape[b,M,4]
        pos_idx = pos.unsqueeze(pos.dim()).expand_as(loc_data)
        loc_p = loc_data[pos_idx].view(-1, 4) #預(yù)測(cè)的正樣本box信息
        loc_t = loc_t[pos_idx].view(-1, 4) #真實(shí)的正樣本box信息
        loss_l = F.smooth_l1_loss(loc_p, loc_t, size_average=False) #Smooth L1 損失
        
        '''
        Target;
            下面進(jìn)行hard negative mining
        過(guò)程:
            1、 針對(duì)所有batch的conf,按照置信度誤差(預(yù)測(cè)背景的置信度越小,誤差越大)進(jìn)行降序排列;
            2、 負(fù)樣本的label全是背景,那么利用log softmax 計(jì)算出logP,
               logP越大,則背景概率越低,誤差越大;
            3、 選取誤差交大的top_k作為負(fù)樣本,保證正負(fù)樣本比例接近1:3;
        '''
        # Compute max conf across batch for hard negative mining
        # shape[b*M,num_classes]
        batch_conf = conf_data.view(-1, self.num_classes)
        # 使用logsoftmax,計(jì)算置信度,shape[b*M, 1]
        loss_c = log_sum_exp(batch_conf) - batch_conf.gather(1, conf_t.view(-1, 1))

        # Hard Negative Mining
        loss_c[pos] = 0  # 把正樣本排除,剩下的就全是負(fù)樣本,可以進(jìn)行抽樣
        loss_c = loss_c.view(num, -1)# shape[b, M]
        # 兩次sort排序,能夠得到每個(gè)元素在降序排列中的位置idx_rank
        _, loss_idx = loss_c.sort(1, descending=True)
        _, idx_rank = loss_idx.sort(1)
         # 抽取負(fù)樣本
        # 每個(gè)batch中正樣本的數(shù)目,shape[b,1]
        num_pos = pos.long().sum(1, keepdim=True)
        num_neg = torch.clamp(self.negpos_ratio*num_pos, max=pos.size(1)-1)
        # 抽取前top_k個(gè)負(fù)樣本,shape[b, M]
        neg = idx_rank < num_neg.expand_as(idx_rank)

        # Confidence Loss Including Positive and Negative Examples
        # shape[b,M] --> shape[b,M,num_classes]
        pos_idx = pos.unsqueeze(2).expand_as(conf_data)
        neg_idx = neg.unsqueeze(2).expand_as(conf_data)
        # 提取出所有篩選好的正負(fù)樣本(預(yù)測(cè)的和真實(shí)的)
        conf_p = conf_data[(pos_idx+neg_idx).gt(0)].view(-1, self.num_classes)
        targets_weighted = conf_t[(pos+neg).gt(0)]
        # 計(jì)算conf交叉熵
        loss_c = F.cross_entropy(conf_p, targets_weighted, size_average=False)

        # Sum of losses: L(x,c,l,g) = (Lconf(x, c) + αLloc(x,l,g)) / N
        # 正樣本個(gè)數(shù)
        N = num_pos.data.sum()
        loss_l /= N
        loss_c /= N
        return loss_l, loss_c

先驗(yàn)框匹配策略

上面的代碼中還有一個(gè)地方?jīng)]講到,就是match函數(shù)。這是SSD算法的先驗(yàn)框匹配函數(shù)。在訓(xùn)練時(shí)首先需要確定訓(xùn)練圖片中的ground truth是由哪一個(gè)先驗(yàn)框來(lái)匹配,與之匹配的先驗(yàn)框所對(duì)應(yīng)的邊界框?qū)⒇?fù)責(zé)預(yù)測(cè)它。SSD的先驗(yàn)框和ground truth匹配原則主要有2點(diǎn)。第一點(diǎn)是對(duì)于圖片中的每個(gè)ground truth,找到和它IOU最大的先驗(yàn)框,該先驗(yàn)框與其匹配,這樣可以保證每個(gè)ground truth一定與某個(gè)prior匹配。第二點(diǎn)是對(duì)于剩余的未匹配的先驗(yàn)框,若某個(gè)ground truth和它的IOU大于某個(gè)閾值(一般設(shè)為0.5),那么改prior和這個(gè)ground truth,剩下沒(méi)有匹配上的先驗(yàn)框都是負(fù)樣本(如果多個(gè)ground truth和某一個(gè)先驗(yàn)框的IOU均大于閾值,那么prior只與IOU最大的那個(gè)進(jìn)行匹配)。代碼實(shí)現(xiàn)如下,來(lái)自layers/box_utils.py

def match(threshold, truths, priors, variances, labels, loc_t, conf_t, idx):
    """把和每個(gè)prior box 有最大的IOU的ground truth box進(jìn)行匹配,
    同時(shí),編碼包圍框,返回匹配的索引,對(duì)應(yīng)的置信度和位置
    Args:
        threshold: IOU閾值,小于閾值設(shè)為背景
        truths: ground truth boxes, shape[N,4]
        priors: 先驗(yàn)框, shape[M,4]
        variances: prior的方差, list(float)
        labels: 圖片的所有類別,shape[num_obj]
        loc_t: 用于填充encoded loc 目標(biāo)張量
        conf_t: 用于填充encoded conf 目標(biāo)張量
        idx: 現(xiàn)在的batch index        
        The matched indices corresponding to 1)location and 2)confidence preds.
    """
    # jaccard index
    # 計(jì)算IOU
    overlaps = jaccard(
        truths,
        point_form(priors)
    )
    # (Bipartite Matching)
    # [1,num_objects] 和每個(gè)ground truth box 交集最大的 prior box
    best_prior_overlap, best_prior_idx = overlaps.max(1, keepdim=True)
    # [1,num_priors] 和每個(gè)prior box 交集最大的 ground truth box
    best_truth_overlap, best_truth_idx = overlaps.max(0, keepdim=True)
    best_truth_idx.squeeze_(0) #M
    best_truth_overlap.squeeze_(0) #M
    best_prior_idx.squeeze_(1) #N
    best_prior_overlap.squeeze_(1) #N
    # 保證每個(gè)ground truth box 與某一個(gè)prior box 匹配,固定值為 2 > threshold
    best_truth_overlap.index_fill_(0, best_prior_idx, 2)  # ensure best prior
    # TODO refactor: index  best_prior_idx with long tensor
    # ensure every gt matches with its prior of max overlap
    # 保證每一個(gè)ground truth 匹配它的都是具有最大IOU的prior
    # 根據(jù) best_prior_dix 鎖定 best_truth_idx里面的最大IOU prior
    for j in range(best_prior_idx.size(0)):
        best_truth_idx[best_prior_idx[j]] = j
    matches = truths[best_truth_idx]          # 提取出所有匹配的ground truth box, Shape: [M,4]    
    conf = labels[best_truth_idx] + 1         # 提取出所有GT框的類別, Shape:[M]   
    # 把 iou < threshold 的框類別設(shè)置為 bg,即為0
    conf[best_truth_overlap < threshold] = 0  # label as background
    # 編碼包圍框
    loc = encode(matches, priors, variances)
    # 保存匹配好的loc和conf到loc_t和conf_t中
    loc_t[idx] = loc    # [num_priors,4] encoded offsets to learn
    conf_t[idx] = conf  # [num_priors] top class label for each prior

位置坐標(biāo)轉(zhuǎn)換

我們看到上面出現(xiàn)了一個(gè)point_form函數(shù),這是什么意思呢?這是因?yàn)槟繕?biāo)框有2種表示方式:

  • (x_{min},y_{min},x_{max},y_{max})
  • (x,y,w,h)
    這部分的代碼在layers/box_utils.py下:
def point_form(boxes):
    """ Convert prior_boxes to (xmin, ymin, xmax, ymax)
   把 prior_box (cx, cy, w, h)轉(zhuǎn)化為(xmin, ymin, xmax, ymax)
    """
    return torch.cat((boxes[:, :2] - boxes[:, 2:]/2,     # xmin, ymin
                     boxes[:, :2] + boxes[:, 2:]/2), 1)  # xmax, ymax


def center_size(boxes):
    """ Convert prior_boxes to (cx, cy, w, h)
    把 prior_box (xmin, ymin, xmax, ymax) 轉(zhuǎn)化為 (cx, cy, w, h)
    """
    return torch.cat((boxes[:, 2:] + boxes[:, :2])/2,  # cx, cy
                            boxes[:, 2:] - boxes[:, :2], 1) # w, h

IOU計(jì)算

這部分比較簡(jiǎn)單,對(duì)于兩個(gè)Box來(lái)講,首先計(jì)算兩個(gè)box左上角點(diǎn)坐標(biāo)的最大值和右下角坐標(biāo)的最小值,然后計(jì)算交集面積,最后把交集面積除以對(duì)應(yīng)的并集面積。代碼仍在layers/box_utils.py

def intersect(box_a, box_b):
    """ We resize both tensors to [A,B,2] without new malloc:
    [A,2] -> [A,1,2] -> [A,B,2]
    [B,2] -> [1,B,2] -> [A,B,2]
    Then we compute the area of intersect between box_a and box_b.
    Args:
      box_a: (tensor) bounding boxes, Shape: [A,4].
      box_b: (tensor) bounding boxes, Shape: [B,4].
    Return:
      (tensor) intersection area, Shape: [A,B].
    """
    A = box_a.size(0)
    B = box_b.size(0)
     # 右下角,選出最小值
    max_xy = torch.min(box_a[:, 2:].unsqueeze(1).expand(A, B, 2),
                       box_b[:, 2:].unsqueeze(0).expand(A, B, 2))
    # 左上角,選出最大值
    min_xy = torch.max(box_a[:, :2].unsqueeze(1).expand(A, B, 2),
                       box_b[:, :2].unsqueeze(0).expand(A, B, 2))
    # 負(fù)數(shù)用0截?cái)啵瑸?代表交集為0
    inter = torch.clamp((max_xy - min_xy), min=0)
    return inter[:, :, 0] * inter[:, :, 1]


def jaccard(box_a, box_b):
    """Compute the jaccard overlap of two sets of boxes.  The jaccard overlap
    is simply the intersection over union of two boxes.  Here we operate on
    ground truth boxes and default boxes.
    E.g.:
        A ∩ B / A ∪ B = A ∩ B / (area(A) + area(B) - A ∩ B)
    Args:
        box_a: (tensor) Ground truth bounding boxes, Shape: [num_objects,4]
        box_b: (tensor) Prior boxes from priorbox layers, Shape: [num_priors,4]
    Return:
        jaccard overlap: (tensor) Shape: [box_a.size(0), box_b.size(0)]
    """
    inter = intersect(box_a, box_b)# A∩B
     # box_a和box_b的面積
    area_a = ((box_a[:, 2]-box_a[:, 0]) *
              (box_a[:, 3]-box_a[:, 1])).unsqueeze(1).expand_as(inter)  # [A,B]#(N,)
    area_b = ((box_b[:, 2]-box_b[:, 0]) *
              (box_b[:, 3]-box_b[:, 1])).unsqueeze(0).expand_as(inter)  # [A,B]#(M,)
    union = area_a + area_b - inter
    return inter / union  # [A,B]

L2標(biāo)準(zhǔn)化

VGG16的conv4_3特征圖的大小為38\times 38,網(wǎng)絡(luò)層靠前,方差比較大,需要加一個(gè)L2標(biāo)準(zhǔn)化,以保證和后面的檢測(cè)層差異不是很大。L2標(biāo)準(zhǔn)化的公式如下:
\hat{x}=\frac{x}{||x||^2},其中x=(x_1...x_d)||x||_2=(\sum_{i=1}^d|x_i|^2)^{1/2}。同時(shí),這里還要注意的是如果簡(jiǎn)單的對(duì)一個(gè)layer的輸入進(jìn)行L2標(biāo)準(zhǔn)化就會(huì)改變?cè)搶拥囊?guī)模,并且會(huì)減慢學(xué)習(xí)速度,因此這里引入了一個(gè)縮放系數(shù)\gamma_i
,對(duì)于每一個(gè)通道l2標(biāo)準(zhǔn)化后的結(jié)果為:
y_i=\gamma_i\hat{x_i},通常scale的值設(shè)10或者20,效果比較好。代碼來(lái)自layers/modules/l2norm.py。

class L2Norm(nn.Module):
    '''
    conv4_3特征圖大小38x38,網(wǎng)絡(luò)層靠前,norm較大,需要加一個(gè)L2 Normalization,以保證和后面的檢測(cè)層差異不是很大,具體可以參考: ParseNet。這個(gè)前面的推文里面有講。
    '''
    def __init__(self, n_channels, scale):
        super(L2Norm, self).__init__()
        self.n_channels = n_channels
        self.gamma = scale or None
        self.eps = 1e-10
        # 將一個(gè)不可訓(xùn)練的類型Tensor轉(zhuǎn)換成可以訓(xùn)練的類型 parameter
        self.weight = nn.Parameter(torch.Tensor(self.n_channels))
        self.reset_parameters()

    # 初始化參數(shù)    
    def reset_parameters(self):
        nn.init.constant_(self.weight, self.gamma)

    def forward(self, x):
        # 計(jì)算x的2范數(shù)
        norm = x.pow(2).sum(dim=1, keepdim=True).sqrt() # shape[b,1,38,38]
        x = x / norm   # shape[b,512,38,38]

        # 擴(kuò)展self.weight的維度為shape[1,512,1,1],然后參考公式計(jì)算
        out = self.weight[None,...,None,None] * x
        return out

位置信息編解碼

上面提到了計(jì)算坐標(biāo)損失的時(shí)候,坐標(biāo)是encoding之后的,這是怎么回事呢?根據(jù)論文的描述,預(yù)測(cè)框和ground truth邊界框存在一個(gè)轉(zhuǎn)換關(guān)系,先定義一些變量:

  • 先驗(yàn)框位置:d=(d^{cx},d^{cy},d^w,d^h)
  • ground truth框位置:g=(g^{cx},g^{cy},g^w,g^h)
  • variance是先驗(yàn)框的坐標(biāo)方差。
    然后編碼的過(guò)程可以表示為:
    \hat{g_j^{cx}}=(g_j^{cx}-d_i^{cx})/d_i^w/varicance[0]
    \hat{g_j^{cy}}=(g_j^{cy}-d_i^{cy})/d_i^h/varicance[1]
    \hat{g_j^w}=log(\frac{g_j^w}{d_i^w})/variance[2]
    \hat{g_j^h}=log(\frac{g_j^h}{d_i^h})/variance[3]

解碼的過(guò)程可以表示為:
g_{predict}^{cx}=d^w*(variance[0]*l^{cx})+d^{cx}
g_{predict}^{cy}=d^h*(variance[1]*l^{cy})+d^{cy}
g_{predict}^w=d^wexp(vairance[2]*l^w)
g_{predict}^h=d^hexp(vairance[3]*l^h)

這部分對(duì)應(yīng)的代碼在layers/box_utils.py里面:

def encode(matched, priors, variances):
    """Encode the variances from the priorbox layers into the ground truth boxes
    we have matched (based on jaccard overlap) with the prior boxes.
    Args:
        matched: (tensor) Coords of ground truth for each prior in point-form
            Shape: [num_priors, 4].
        priors: (tensor) Prior boxes in center-offset form
            Shape: [num_priors,4].
        variances: (list[float]) Variances of priorboxes
    Return:
        encoded boxes (tensor), Shape: [num_priors, 4]
    """

    # dist b/t match center and prior's center
    g_cxcy = (matched[:, :2] + matched[:, 2:])/2 - priors[:, :2]
    # encode variance
    g_cxcy /= (variances[0] * priors[:, 2:])
    # match wh / prior wh
    g_wh = (matched[:, 2:] - matched[:, :2]) / priors[:, 2:]
    g_wh = torch.log(g_wh) / variances[1]
    # return target for smooth_l1_loss
    return torch.cat([g_cxcy, g_wh], 1)  # [num_priors,4]


# Adapted from https://github.com/Hakuyume/chainer-ssd
def decode(loc, priors, variances):
    """Decode locations from predictions using priors to undo
    the encoding we did for offset regression at train time.
    Args:
        loc (tensor): location predictions for loc layers,
            Shape: [num_priors,4]
        priors (tensor): Prior boxes in center-offset form.
            Shape: [num_priors,4].
        variances: (list[float]) Variances of priorboxes
    Return:
        decoded bounding box predictions
    """

    boxes = torch.cat((
        priors[:, :2] + loc[:, :2] * variances[0] * priors[:, 2:],
        priors[:, 2:] * torch.exp(loc[:, 2:] * variances[1])), 1)
    boxes[:, :2] -= boxes[:, 2:] / 2
    boxes[:, 2:] += boxes[:, :2]
return boxes

后處理NMS

這部分我在上周的推文講過(guò)原理了,這里不再贅述了。這里IOU閾值取了0.5。不了解原理可以去看一下我的那篇推文,也給了源碼講解,地址是:https://mp.weixin.qq.com/s/orYMdwZ1VwwIScPmIiq5iA 。這部分的代碼也在layers/box_utils.py里面。就不再拿代碼來(lái)贅述了。

檢測(cè)函數(shù)

模型在測(cè)試的時(shí)候,需要把loc和conf輸入到detect函數(shù)進(jìn)行nms,然后給出結(jié)果。這部分的代碼在layers/functions/detection.py里面,如下:

class Detect(Function):
    """At test time, Detect is the final layer of SSD.  Decode location preds,
    apply non-maximum suppression to location predictions based on conf
    scores and threshold to a top_k number of output predictions for both
    confidence score and locations.
    """
    def __init__(self, num_classes, bkg_label, top_k, conf_thresh, nms_thresh):
        self.num_classes = num_classes
        self.background_label = bkg_label
        self.top_k = top_k
        # Parameters used in nms.
        self.nms_thresh = nms_thresh
        if nms_thresh <= 0:
            raise ValueError('nms_threshold must be non negative.')
        self.conf_thresh = conf_thresh
        self.variance = cfg['variance']

    def forward(self, loc_data, conf_data, prior_data):
        """
        Args:
            loc_data: 預(yù)測(cè)出的loc張量,shape[b,M,4], eg:[b, 8732, 4]
            conf_data:預(yù)測(cè)出的置信度,shape[b,M,num_classes], eg:[b, 8732, 21]
            prior_data:先驗(yàn)框,shape[M,4], eg:[8732, 4]
        """
        num = loc_data.size(0)  # batch size
        num_priors = prior_data.size(0)
        output = torch.zeros(num, self.num_classes, self.top_k, 5)# 初始化輸出
        conf_preds = conf_data.view(num, num_priors,
                                    self.num_classes).transpose(2, 1)

        # 解碼loc的信息,變?yōu)檎5腷boxes
        for i in range(num):
            # 解碼loc
            decoded_boxes = decode(loc_data[i], prior_data, self.variance)
            # 拷貝每個(gè)batch內(nèi)的conf,用于nms
            conf_scores = conf_preds[i].clone()
            # 遍歷每一個(gè)類別
            for cl in range(1, self.num_classes):
                # 篩選掉 conf < conf_thresh 的conf
                c_mask = conf_scores[cl].gt(self.conf_thresh)
                scores = conf_scores[cl][c_mask]
                # 如果都被篩掉了,則跳入下一類
                if scores.size(0) == 0:
                    continue
                # 篩選掉 conf < conf_thresh 的框
                l_mask = c_mask.unsqueeze(1).expand_as(decoded_boxes)
                boxes = decoded_boxes[l_mask].view(-1, 4)
                # idx of highest scoring and non-overlapping boxes per class
                # nms
                ids, count = nms(boxes, scores, self.nms_thresh, self.top_k)
                # nms 后得到的輸出拼接
                output[i, cl, :count] = \
                    torch.cat((scores[ids[:count]].unsqueeze(1),
                               boxes[ids[:count]]), 1)
        flt = output.contiguous().view(num, -1, 5)
        _, idx = flt[:, :, 0].sort(1, descending=True)
        _, rank = idx.sort(1)
        flt[(rank < self.top_k).unsqueeze(-1).expand_as(flt)].fill_(0)
    return output

后記

SSD的核心代碼解析大概就到這里了,我覺得這個(gè)過(guò)程算法還算比較清晰了,不過(guò)SSD能夠表現(xiàn)較好的原因還和它的多種有效的數(shù)據(jù)增強(qiáng)方式有關(guān),之后我們有機(jī)會(huì)再來(lái)解析一下他的數(shù)據(jù)增強(qiáng)策略。本文寫作的目錄參考了知乎https://zhuanlan.zhihu.com/p/79854543,看代碼和寫作以及理解一些細(xì)節(jié)大概花了一周時(shí)間,看到這里的同學(xué)不妨給我點(diǎn)個(gè)贊吧。


歡迎關(guān)注我的微信公眾號(hào)GiantPadaCV,期待和你一起交流機(jī)器學(xué)習(xí),深度學(xué)習(xí),圖像算法,優(yōu)化技術(shù),比賽及日常生活等。


圖片.png
?著作權(quán)歸作者所有,轉(zhuǎn)載或內(nèi)容合作請(qǐng)聯(lián)系作者
【社區(qū)內(nèi)容提示】社區(qū)部分內(nèi)容疑似由AI輔助生成,瀏覽時(shí)請(qǐng)結(jié)合常識(shí)與多方信息審慎甄別。
平臺(tái)聲明:文章內(nèi)容(如有圖片或視頻亦包括在內(nèi))由作者上傳并發(fā)布,文章內(nèi)容僅代表作者本人觀點(diǎn),簡(jiǎn)書系信息發(fā)布平臺(tái),僅提供信息存儲(chǔ)服務(wù)。

相關(guān)閱讀更多精彩內(nèi)容

友情鏈接更多精彩內(nèi)容