特點(diǎn):self-attention layers,end-to-end set predictions,bipartite matching loss
The DETR model有兩個(gè)重要部分:
1)保證真實(shí)值與預(yù)測(cè)值之間唯一匹配的集合預(yù)測(cè)損失。
2)一個(gè)可以預(yù)測(cè)(一次性)目標(biāo)集合和對(duì)他們關(guān)系建模的架構(gòu)。
3)由于是加了自注意力機(jī)制,而且在學(xué)習(xí)的過(guò)程中,觀眾的注意力訓(xùn)練的很好,每個(gè)人的關(guān)注點(diǎn)都不一樣,所以分割效果很好,有效的解決遮擋問題



DETR 將目標(biāo)檢測(cè)任務(wù)視為一個(gè)圖像到集合(image-to-set)的問題,即給定一張圖像,模型的預(yù)測(cè)結(jié)果是一個(gè)包含了所有目標(biāo)的無(wú)序集合。
這個(gè)將目標(biāo)檢測(cè)任務(wù)轉(zhuǎn)化為一個(gè)序列預(yù)測(cè)(set prediction)的任務(wù),使用transformer編碼-解碼器結(jié)構(gòu)和雙邊匹配的方法,由輸入圖像直接得到預(yù)測(cè)結(jié)果序列.他不產(chǎn)生anchor,直接產(chǎn)生預(yù)測(cè)結(jié)構(gòu)
缺點(diǎn):對(duì)于小目標(biāo)、多目標(biāo)場(chǎng)景較差
1.基礎(chǔ)知識(shí)
1.1one-hot矩陣:是指每一行有且只有一個(gè)元素為1,其他元素都是0的矩陣。針對(duì)字典中的每個(gè)單詞,我們分配一個(gè)編號(hào),對(duì)某句話進(jìn)行編碼時(shí),將里面的每個(gè)單詞轉(zhuǎn)換成字典里面這個(gè)單詞編號(hào)對(duì)應(yīng)的位置為1的one-hot矩陣就可以了。比如我們要表達(dá)“the cat sat on the mat”,可以使用如下的矩陣表示。
one-hot表示方式很直觀,但是有兩個(gè)缺點(diǎn),第一,矩陣的每一維長(zhǎng)度都是字典的長(zhǎng)度,比如字典包含10000個(gè)單詞,那么每個(gè)單詞對(duì)應(yīng)的one-hot向量就是1X10000的向量,而這個(gè)向量只有一個(gè)位置為1,其余都是0,浪費(fèi)空間,不利于計(jì)算。第二,one-hot矩陣相當(dāng)于簡(jiǎn)單的給每個(gè)單詞編了個(gè)號(hào),但是單詞和單詞之間的關(guān)系則完全體現(xiàn)不出來(lái)。比如“cat”和“mouse”的關(guān)聯(lián)性要高于“cat”和“cellphone”,這種關(guān)系在one-hot表示法中就沒有體現(xiàn)出來(lái)。
1.2詞嵌入向量Word Embedding:解決了這兩個(gè)問題。Word Embedding矩陣給每個(gè)單詞分配一個(gè)固定長(zhǎng)度的向量表示,這個(gè)長(zhǎng)度可以自行設(shè)定,比如300,實(shí)際上會(huì)遠(yuǎn)遠(yuǎn)小于字典長(zhǎng)度(比如10000)。而且兩個(gè)單詞向量之間的夾角值可以作為他們之間關(guān)系的一個(gè)衡量。如下表示


三個(gè)w矩陣就是衡量qkv權(quán)重的矩陣,權(quán)重矩陣都是訓(xùn)練得來(lái)的,自注意力的第一步就是從每個(gè)編碼器的輸入向量(每個(gè)單詞的詞向量)中生成三個(gè)向量。也就是說(shuō)對(duì)于每個(gè)單詞,我們創(chuàng)造一個(gè)查詢向量、一個(gè)鍵向量和一個(gè)值向量。這三個(gè)向量是通過(guò)詞嵌入與三個(gè)權(quán)重矩陣后相乘創(chuàng)建的。
查詢向量:將每個(gè)詞嵌入向量與WQ 向量相乘得到。用于與所有的鍵向量相乘直接得到分?jǐn)?shù)。
鍵向量:同樣的,將每個(gè)詞嵌入向量與W K 得到。
值向量:同上。用于對(duì)每個(gè)單詞的分?jǐn)?shù)的加權(quán)。
1.3自注意力計(jì)算步驟:





1.將查詢向量與每個(gè)鍵向量相乘,得到打分,比如112,96,此打分評(píng)估Thinking與Machines這兩個(gè)單詞與自身以及其余單詞的相關(guān)性。
2.將打分除以鍵向量維數(shù)的平方根(sqrt{64}=8),維度懲罰項(xiàng)目,這樣有利于梯度穩(wěn)定。
3.進(jìn)行softmax進(jìn)行歸一化,每個(gè)單詞都得到一個(gè)權(quán)重。
4.將每個(gè)值向量按照每個(gè)單詞的權(quán)重進(jìn)行加權(quán)求和。得到Z i
bi是集合了全局信息的變量,只要在∑時(shí)候只算(1,1)就是只收集局部信息。
1.4多頭自注意力:對(duì)于“多頭”注意機(jī)制,我們有多個(gè)查詢/鍵/值權(quán)重矩陣集(Transformer使用八個(gè)注意力頭,因此我們對(duì)于每個(gè)編碼器/解碼器有八個(gè)矩陣集合)。這些集合中的每一個(gè)都是隨機(jī)初始化的,在訓(xùn)練之后,每個(gè)集合都被用來(lái)將輸入詞嵌入(或來(lái)自較低編碼器/解碼器的向量)投影到不同的表示子空間中


之后把八個(gè)矩陣給concat,再用一個(gè)訓(xùn)練號(hào)的矩陣w0給他融合注意力

1.5位置編碼positional encoding:
在NLP中,句子中的單詞也需要一個(gè)位置編碼,用于建立單詞之間的距離。encoder 為每個(gè)輸入 embedding 添加了一個(gè)向量,這些向量符合一種特定模式,可以確定每個(gè)單詞的位置,或者序列中不同單詞之間的距離。例如,input embedding 的維度為4,那么實(shí)際的positional encodings如下所示

1.代碼中作者自己的思想
Arg:一個(gè)超參數(shù)集合
Namespace(aux_loss=True, backbone='resnet50', batch_size=2, bbox_loss_coef=5, clip_max_norm=0.1, coco_panoptic_path=None, coco_path=None, dataset_file='coco', dec_layers=6, device='cuda', dice_loss_coef=1, dilation=False, dim_feedforward=2048, dist_url='env://', distributed=False, dropout=0.1, enc_layers=6, eos_coef=0.1, epochs=300, eval=False, frozen_weights=None, giou_loss_coef=2, hidden_dim=256, lr=0.0001, lr_backbone=1e-05, lr_drop=200, mask_loss_coef=1, masks=False, nheads=8, num_queries=100, num_workers=2, output_dir='', position_embedding='sine', pre_norm=False, remove_difficult=False, resume='', seed=42, set_cost_bbox=5, set_cost_class=1, set_cost_giou=2, start_epoch=0, weight_decay=0.0001, world_size=1)
作者新建的一種數(shù)據(jù)類型用來(lái)存放特征:tensor就是我么的圖片的值,當(dāng)一個(gè)batch中的圖片大小不一樣的時(shí)候,我們要把它們處理的整齊,簡(jiǎn)單說(shuō)就是把圖片都padding成最大的尺寸,padding的方式就是補(bǔ)零,那么batch中的每一張圖都有一個(gè)mask矩陣,一個(gè)mask矩陣即用以指示哪些是真正的數(shù)據(jù),哪些是padding,1代表真實(shí)數(shù)據(jù);0代表padding數(shù)據(jù)。
NestedTensor數(shù)據(jù)類型:
#NestedTensor, which consists of:
- samples.tensor: batched images, of shape [batch_size x 3 x H x W]
- samples.mask: a binary mask of shape [batch_size x H x W], containing 1 on padded pixels
class NestedTensor(object):
def __init__(self, tensors, mask: Optional[Tensor]):
self.tensors = tensors
self.mask = mask
def to(self, device):
# type: (Device) -> NestedTensor # noqa
cast_tensor = self.tensors.to(device)
mask = self.mask
if mask is not None:
assert mask is not None
cast_mask = mask.to(device)
else:
cast_mask = None
return NestedTensor(cast_tensor, cast_mask)
def decompose(self):
return self.tensors, self.mask
def __repr__(self):
return str(self.tensors)
2.DETR網(wǎng)絡(luò)結(jié)構(gòu)
detr沒提出新的layer,他直接提出了新的框架。先通關(guān)傳統(tǒng)的cnn來(lái)提取特征,與此同時(shí)同步再進(jìn)行positional encoding。經(jīng)過(guò)Transformer的encoder、decoder后經(jīng)過(guò)簡(jiǎn)單的前向傳播網(wǎng)絡(luò)(FNN)后得到結(jié)果。

class DETR(nn.Module):
def __init__(self, num_classes, hidden_dim, nheads,
num_encoder_layers, num_decoder_layers):
super().__init__()
# We take only convolutional layers from ResNet-50 model
self.backbone = nn.Sequential(*list(resnet50(pretrained=True).children())[:-2])
self.conv = nn.Conv2d(2048, hidden_dim, 1)
self.transformer = nn.Transformer(hidden_dim, nheads,
num_encoder_layers, num_decoder_layers)
self.linear_class = nn.Linear(hidden_dim, num_classes + 1)
self.linear_bbox = nn.Linear(hidden_dim, 4)
self.query_pos = nn.Parameter(torch.rand(100, hidden_dim))
self.row_embed = nn.Parameter(torch.rand(50, hidden_dim // 2))
self.col_embed = nn.Parameter(torch.rand(50, hidden_dim // 2))
def forward(self, inputs):
x = self.backbone(inputs)
h = self.conv(x)
H, W = h.shape[-2:]
pos = torch.cat([
self.col_embed[:W].unsqueeze(0).repeat(H, 1, 1),
self.row_embed[:H].unsqueeze(1).repeat(1, W, 1),
], dim=-1).flatten(0, 1).unsqueeze(1)
h = self.transformer(pos + h.flatten(2).permute(2, 0, 1),
self.query_pos.unsqueeze(1))
return self.linear_class(h), self.linear_bbox(h).sigmoid()
2.0backbone
他要求的比較簡(jiǎn)單,只要滿足
輸入是C=3×H×W,輸出是C = 2048和H,W = H / 32,W / 32。
之后把所有的feature map鋪平,變成了CxWH規(guī)模的,此時(shí)的位置信息就是二維的

2.1position encode

Main.py:
def main(args):
model, criterion, postprocessors = build_model(args)
if __name__ == '__main__':
parser = argparse.ArgumentParser('DETR training and evaluation script', parents=[get_args_parser()])
args = parser.parse_args()
if args.output_dir:
Path(args.output_dir).mkdir(parents=True, exist_ok=True)
main(args)
Detr.py:
def build(args):
backbone = build_backbone(args)
position encoding:
def build_position_encoding(args):
N_steps = args.hidden_dim // 2#隱藏層維度的一半
if args.position_embedding in ('v2', 'sine'):
# TODO find a better way of exposing other arguments
position_embedding = PositionEmbeddingSine(N_steps, normalize=True)
elif args.position_embedding in ('v3', 'learned'):
position_embedding = PositionEmbeddingLearned(N_steps)
else:
raise ValueError(f"not supported {args.position_embedding}")
return position_embedding
class PositionEmbeddingSine(nn.Module):
"""
This is a more standard version of the position embedding, very similar to the one
used by the Attention is all you need paper, generalized to work on images.搞了個(gè)和Attention is all you need paper一樣的位置編碼方式
"""
def __init__(self, num_pos_feats=64, temperature=10000, normalize=False, scale=None):
super().__init__()
self.num_pos_feats = num_pos_feats
self.temperature = temperature
self.normalize = normalize
if scale is not None and normalize is False:
raise ValueError("normalize should be True if scale is passed")
if scale is None:
scale = 2 * math.pi
self.scale = scale
def forward(self, tensor_list: NestedTensor):
x = tensor_list.tensors
mask = tensor_list.mask
assert mask is not None
not_mask = ~mask
y_embed = not_mask.cumsum(1, dtype=torch.float32)
x_embed = not_mask.cumsum(2, dtype=torch.float32)
if self.normalize:
eps = 1e-6
y_embed = y_embed / (y_embed[:, -1:, :] + eps) * self.scale
x_embed = x_embed / (x_embed[:, :, -1:] + eps) * self.scale
dim_t = torch.arange(self.num_pos_feats, dtype=torch.float32, device=x.device)
dim_t = self.temperature ** (2 * (dim_t // 2) / self.num_pos_feats)
pos_x = x_embed[:, :, :, None] / dim_t#這里的x——embed就是posx
pos_y = y_embed[:, :, :, None] / dim_t
pos_x = ((pos_x[:, :, :, 0::2].sin(), pos_x[:, :, :, 1::2].cos()), dim=4).flatten(3)
pos_y = torch.stack((pos_y[:, :, :, 0::2].sin(), pos_y[:, :, :, 1::2].cos()), dim=4).flatten(3)
pos = torch.cat((pos_y, pos_x), dim=3).permute(0, 3, 1, 2)
return pos
2.2Transform

class Transformer(nn.Module):
def __init__(self, d_model=512, nhead=8, num_encoder_layers=6,
num_decoder_layers=6, dim_feedforward=2048, dropout=0.1,
activation="relu", normalize_before=False,
return_intermediate_dec=False):
super().__init__()
encoder_layer = TransformerEncoderLayer(d_model, nhead, dim_feedforward,
dropout, activation, normalize_before)
encoder_norm = nn.LayerNorm(d_model) if normalize_before else None
self.encoder = TransformerEncoder(encoder_layer, num_encoder_layers, encoder_norm)
#六次自編碼:每次都是一次八頭自注意力+FFN
decoder_layer = TransformerDecoderLayer(d_model, nhead, dim_feedforward,
dropout, activation, normalize_before)
decoder_norm = nn.LayerNorm(d_model)
self.decoder = TransformerDecoder(decoder_layer, num_decoder_layers, decoder_norm,
return_intermediate=return_intermediate_dec)
self._reset_parameters()
self.d_model = d_model
self.nhead = nhead
def _reset_parameters(self):
for p in self.parameters():
if p.dim() > 1:
nn.init.xavier_uniform_(p)
def forward(self, src, mask, query_embed, pos_embed):
# flatten NxCxHxW to HWxNxC
bs, c, h, w = src.shape
src = src.flatten(2).permute(2, 0, 1)
pos_embed = pos_embed.flatten(2).permute(2, 0, 1)
query_embed = query_embed.unsqueeze(1).repeat(1, bs, 1)#擴(kuò)展成兩通道的,并且新的那一層是復(fù)制而來(lái)
mask = mask.flatten(1)#鋪平
tgt = torch.zeros_like(query_embed)#輸出一個(gè)大小和query_embed一樣但全是0的矩陣,初始化作為一開始第一次層固定輸入
memory = self.encoder(src, src_key_padding_mask=mask, pos=pos_embed)#編碼
hs = self.decoder(tgt, memory, memory_key_padding_mask=mask,
pos=pos_embed, query_pos=query_embed)#解碼
return hs.transpose(1, 2), memory.permute(1, 2, 0).view(bs, c, h, w)
class PositionEmbeddingLearned(nn.Module):
"""
Absolute pos embedding, learned.
"""
def __init__(self, num_pos_feats=256):
super().__init__()
self.row_embed = nn.Embedding(50, num_pos_feats)
self.col_embed = nn.Embedding(50, num_pos_feats)
self.reset_parameters()
def reset_parameters(self):
nn.init.uniform_(self.row_embed.weight)
nn.init.uniform_(self.col_embed.weight)
def forward(self, tensor_list: NestedTensor):
x = tensor_list.tensors
h, w = x.shape[-2:]
i = torch.arange(w, device=x.device)
j = torch.arange(h, device=x.device)
x_emb = self.col_embed(i)
y_emb = self.row_embed(j)
pos = torch.cat([
x_emb.unsqueeze(0).repeat(h, 1, 1),
y_emb.unsqueeze(1).repeat(1, w, 1),
], dim=-1).permute(2, 0, 1).unsqueeze(0).repeat(x.shape[0], 1, 1, 1)
return pos
2.2.1Transformer encoder
首先,1x1卷積將高階特征圖f的通道維數(shù)從C降低到更小的維數(shù)d。由于transformer要求輸入的是一個(gè)序列,故將此特征的每個(gè)通道拉成一個(gè)向量,成為d X WH大小。由于這個(gè)transformer具備置換不變性(permutation invariant,輸入的順序?qū)Y(jié)果沒有影響,所以直接沒有了位置信息),故需要補(bǔ)充固定的位置編碼作為輸入。每個(gè)encoder層由 multi-head self-attention模塊和FFN組成每個(gè)輸入通過(guò)編碼器都會(huì)輸出一個(gè)d維的特征向量。
在“多頭”注意機(jī)制下,我們?yōu)槊總€(gè)頭保持獨(dú)立的查詢/鍵/值權(quán)重矩陣,只需八次不同的權(quán)重矩陣運(yùn)算,我們就會(huì)得到八個(gè)不同的Z矩陣
編碼器的左右只是為了服務(wù)解碼器,把輸入的特征做了自注意力加權(quán)映射,如果有更好的backbone網(wǎng)絡(luò),可以完全不編碼

encoder的輸入為:src, mask, pos_embed
class TransformerEncoder(nn.Module):
def __init__(self, encoder_layer, num_layers, norm=None):
super().__init__()
self.layers = _get_clones(encoder_layer, num_layers)#對(duì)encoder_layer復(fù)制num_layers次
self.num_layers = num_layers
self.norm = norm
def forward(self, src,
mask: Optional[Tensor] = None,
src_key_padding_mask: Optional[Tensor] = None,
pos: Optional[Tensor] = None):
output = src
for layer in self.layers:
output = layer(output, src_mask=mask,
src_key_padding_mask=src_key_padding_mask, pos=pos)
#output進(jìn)行六次編碼
if self.norm is not None:
output = self.norm(output)
return output
class TransformerEncoderLayer(nn.Module):
def __init__(self, d_model, nhead, dim_feedforward=2048, dropout=0.1,
activation="relu", normalize_before=False):
super().__init__()
self.self_attn = nn.MultiheadAttention(d_model, nhead, dropout=dropout)
# Implementation of Feedforward model
self.linear1 = nn.Linear(d_model, dim_feedforward)
self.dropout = nn.Dropout(dropout)
self.linear2 = nn.Linear(dim_feedforward, d_model)
self.norm1 = nn.LayerNorm(d_model)
self.norm2 = nn.LayerNorm(d_model)
self.dropout1 = nn.Dropout(dropout)
self.dropout2 = nn.Dropout(dropout)
self.activation = _get_activation_fn(activation)
self.normalize_before = normalize_before
def with_pos_embed(self, tensor, pos: Optional[Tensor]):#位置編碼
return tensor if pos is None else tensor + pos
#pre和post的區(qū)別在于有無(wú)歸一化
def forward_post(self,
src,
src_mask: Optional[Tensor] = None,
src_key_padding_mask: Optional[Tensor] = None,
pos: Optional[Tensor] = None):
q = k = self.with_pos_embed(src, pos)#位置編碼融入,pos + src
src2 = self.self_attn(q, k, value=src, attn_mask=src_mask,
key_padding_mask=src_key_padding_mask)[0]
src = src + self.dropout1(src2)# 類似于殘差網(wǎng)絡(luò)的加法
src = self.norm1(src)#layernorm,不是batchnorm
src2 = self.linear2(self.dropout(self.activation(self.linear1(src))))#兩個(gè)ffn
src = src + self.dropout2(src2)
src = self.norm2(src)
return src
def forward_pre(self, src,
src_mask: Optional[Tensor] = None,
src_key_padding_mask: Optional[Tensor] = None,
pos: Optional[Tensor] = None):
src2 = self.norm1(src)
q = k = self.with_pos_embed(src2, pos)
src2 = self.self_attn(q, k, value=src2, attn_mask=src_mask,
key_padding_mask=src_key_padding_mask)[0]
src = src + self.dropout1(src2)
src2 = self.norm2(src)
src2 = self.linear2(self.dropout(self.activation(self.linear1(src2))))
src = src + self.dropout2(src2)
return src
def forward(self, src,
src_mask: Optional[Tensor] = None,
src_key_padding_mask: Optional[Tensor] = None,
pos: Optional[Tensor] = None):
if self.normalize_before:
return self.forward_pre(src, src_mask, src_key_padding_mask, pos)
return self.forward_post(src, src_mask, src_key_padding_mask, pos)
前五次的輸出都是src用于下一次的訓(xùn)練,組后一次輸出的較memory,喲關(guān)于傳遞到decoder,兩個(gè)一樣,只是命名不同
2.2.2Transformer decoder
和編碼原理差不多,就是要加入編碼-解碼注意力,解碼才是Transformer 的關(guān)鍵,前面的一大堆都可以看成特征提取,解碼是要輸出預(yù)測(cè)的類別和回歸坐標(biāo)。
他的輸入是N個(gè) object queries ,在經(jīng)過(guò)自注意力機(jī)制后,混入解碼的輸出在做編碼解碼自注意力。然后通過(guò)前饋網(wǎng)絡(luò),將它們獨(dú)立解碼為框坐標(biāo)和類標(biāo)簽,從而產(chǎn)生N個(gè)最終預(yù)測(cè)。使用自注意力和編碼-解碼注意力對(duì)這些嵌入的關(guān)注,模型全局地使用所有對(duì)象之間的成對(duì)關(guān)系,同時(shí)能夠使用整個(gè)圖像作為上下文信息
輸入:memory:這個(gè)就是encoder的輸出size=[56,2,256]
mask:還是上面的mask
pos_embed:還是上面的pos_embed
query_embed:隨機(jī)生成的觀眾注意力,size=[100,2,256]
tgt: 每一層的decoder的輸入,第一層的話等于0
class TransformerDecoder(nn.Module):
def __init__(self, decoder_layer, num_layers, norm=None, return_intermediate=False):
super().__init__()
self.layers = _get_clones(decoder_layer, num_layers)
self.num_layers = num_layers
self.norm = norm
self.return_intermediate = return_intermediate
def forward(self, tgt, memory,
tgt_mask: Optional[Tensor] = None,
memory_mask: Optional[Tensor] = None,
tgt_key_padding_mask: Optional[Tensor] = None,
memory_key_padding_mask: Optional[Tensor] = None,
pos: Optional[Tensor] = None,
query_pos: Optional[Tensor] = None):
output = tgt
intermediate = []
for layer in self.layers:
output = layer(output, memory, tgt_mask=tgt_mask,
memory_mask=memory_mask,
tgt_key_padding_mask=tgt_key_padding_mask,
memory_key_padding_mask=memory_key_padding_mask,
pos=pos, query_pos=query_pos)
if self.return_intermediate:
intermediate.append(self.norm(output))
if self.norm is not None:
output = self.norm(output)
if self.return_intermediate:
intermediate.pop()
intermediate.append(output)
if self.return_intermediate:
return torch.stack(intermediate)
return output.unsqueeze(0)
class TransformerDecoderLayer(nn.Module):
def __init__(self, d_model, nhead, dim_feedforward=2048, dropout=0.1,
activation="relu", normalize_before=False):
super().__init__()
self.self_attn = nn.MultiheadAttention(d_model, nhead, dropout=dropout)#自注意力
self.multihead_attn = nn.MultiheadAttention(d_model, nhead, dropout=dropout)#和上面是一個(gè)東西,不過(guò)是解碼注意力
# Implementation of Feedforward model
self.linear1 = nn.Linear(d_model, dim_feedforward)#前饋神經(jīng)網(wǎng)絡(luò)
self.dropout = nn.Dropout(dropout)
self.linear2 = nn.Linear(dim_feedforward, d_model)
self.norm1 = nn.LayerNorm(d_model)
self.norm2 = nn.LayerNorm(d_model)
self.norm3 = nn.LayerNorm(d_model)
self.dropout1 = nn.Dropout(dropout)
self.dropout2 = nn.Dropout(dropout)
self.dropout3 = nn.Dropout(dropout)
self.activation = _get_activation_fn(activation)
self.normalize_before = normalize_before
def with_pos_embed(self, tensor, pos: Optional[Tensor]):
return tensor if pos is None else tensor + pos
def forward_post(self, tgt, memory,
tgt_mask: Optional[Tensor] = None,
memory_mask: Optional[Tensor] = None,
tgt_key_padding_mask: Optional[Tensor] = None,
memory_key_padding_mask: Optional[Tensor] = None,
pos: Optional[Tensor] = None,
query_pos: Optional[Tensor] = None):
q = k = self.with_pos_embed(tgt, query_pos)
tgt2 = self.self_attn(q, k, value=tgt, attn_mask=tgt_mask,
key_padding_mask=tgt_key_padding_mask)[0]
tgt = tgt + self.dropout1(tgt2)
tgt = self.norm1(tgt)
tgt2 = self.multihead_attn(query=self.with_pos_embed(tgt, query_pos),
key=self.with_pos_embed(memory, pos),
value=memory, attn_mask=memory_mask,
key_padding_mask=memory_key_padding_mask)[0]
tgt = tgt + self.dropout2(tgt2)
tgt = self.norm2(tgt)
tgt2 = self.linear2(self.dropout(self.activation(self.linear1(tgt))))
tgt = tgt + self.dropout3(tgt2)
tgt = self.norm3(tgt)
return tgt
def forward_pre(self, tgt, memory,
tgt_mask: Optional[Tensor] = None,
memory_mask: Optional[Tensor] = None,
tgt_key_padding_mask: Optional[Tensor] = None,
memory_key_padding_mask: Optional[Tensor] = None,
pos: Optional[Tensor] = None,
query_pos: Optional[Tensor] = None):
tgt2 = self.norm1(tgt)
q = k = self.with_pos_embed(tgt2, query_pos)
tgt2 = self.self_attn(q, k, value=tgt2, attn_mask=tgt_mask,
key_padding_mask=tgt_key_padding_mask)[0]
tgt = tgt + self.dropout1(tgt2)
tgt2 = self.norm2(tgt)
tgt2 = self.multihead_attn(query=self.with_pos_embed(tgt2, query_pos),
key=self.with_pos_embed(memory, pos),
value=memory, attn_mask=memory_mask,
key_padding_mask=memory_key_padding_mask)[0]
tgt = tgt + self.dropout2(tgt2)
tgt2 = self.norm3(tgt)
tgt2 = self.linear2(self.dropout(self.activation(self.linear1(tgt2))))
tgt = tgt + self.dropout3(tgt2)
return tgt
def forward(self, tgt, memory,
tgt_mask: Optional[Tensor] = None,
memory_mask: Optional[Tensor] = None,
tgt_key_padding_mask: Optional[Tensor] = None,
memory_key_padding_mask: Optional[Tensor] = None,
pos: Optional[Tensor] = None,
query_pos: Optional[Tensor] = None):
if self.normalize_before:
return self.forward_pre(tgt, memory, tgt_mask, memory_mask,
tgt_key_padding_mask, memory_key_padding_mask, pos, query_pos)
return self.forward_post(tgt, memory, tgt_mask, memory_mask,
tgt_key_padding_mask, memory_key_padding_mask, pos, query_pos)
2.3Object querise
是產(chǎn)生的一組隨機(jī)向量,模擬觀眾的關(guān)注度,論文中模擬了n=100個(gè)觀眾的關(guān)注點(diǎn),這些關(guān)注點(diǎn)可以沒有目標(biāo)是空集。而且每個(gè)觀眾都會(huì)由三種顏色的主義力--green color corresponds to small
boxes, red to large horizontal boxes and blue to large vertical boxes
在class detr里有這么一句話:
self.query_embed = nn.Embedding(num_queries, hidden_dim)
這句話就是用來(lái)生成觀眾注意力的
num_embeddings (python:int) – 詞典的大小尺寸,比如總共出現(xiàn)5000個(gè)詞,那就輸入5000。此時(shí)index為(0-4999)
embedding_dim (python:int) – 嵌入向量的維度,即用多少維來(lái)表示一個(gè)符號(hào)。
padding_idx (python:int, optional) – 填充id,比如,輸入長(zhǎng)度為100,但是每次的句子長(zhǎng)度并不一樣,后面就需要用統(tǒng)一的數(shù)字填充,而這里就是指定這個(gè)數(shù)字,這樣,網(wǎng)絡(luò)在遇到填充id時(shí),就不會(huì)計(jì)算其與其它符號(hào)的相關(guān)性。(初始化為0)
max_norm (python:float, optional) – 最大范數(shù),如果嵌入向量的范數(shù)超過(guò)了這個(gè)界限,就要進(jìn)行再歸一化。
norm_type (python:float, optional) – 指定利用什么范數(shù)計(jì)算,并用于對(duì)比max_norm,默認(rèn)為2范數(shù)。
scale_grad_by_freq (boolean, optional) – 根據(jù)單詞在mini-batch中出現(xiàn)的頻率,對(duì)梯度進(jìn)行放縮。默認(rèn)為False.
sparse (bool, optional) – 若為True,則與權(quán)重矩陣相關(guān)的梯度轉(zhuǎn)變?yōu)橄∈鑿埩?

把這些Object querise先進(jìn)行一遍self attenton,之后用他們?nèi)ズ蚭ncoder產(chǎn)生的self attention,合在一起再去做編碼-解碼attention,這時(shí)候的特征就是加了觀眾注意力的特征,然后再去做FFN絡(luò)輸出class和bbox的整合信息,再傳給兩個(gè)FFN分別去做分類和回歸

2.4回歸
class detr:定義一下bbox_embed和class_embed ,就是最后的產(chǎn)生bbox的FFN和類,classes+1是因?yàn)橛斜尘靶畔?/p>
self.bbox_embed = MLP(hidden_dim, hidden_dim, 4, 3)
self.class_embed = nn.Linear(hidden_dim, num_classes + 1)
FFN模塊,比較簡(jiǎn)單
class MLP(nn.Module):
""" Very simple multi-layer perceptron (also called FFN)"""
def __init__(self, input_dim, hidden_dim, output_dim, num_layers):
super().__init__()
self.num_layers = num_layers
h = [hidden_dim] * (num_layers - 1)
self.layers = nn.ModuleList(nn.Linear(n, k) for n, k in zip([input_dim] + h, h + [output_dim]))
def forward(self, x):
for i, layer in enumerate(self.layers):
x = F.relu(layer(x)) if i < self.num_layers - 1 else layer(x)
return x
class detr:
outputs_class = self.class_embed(hs)
outputs_coord = self.bbox_embed(hs).sigmoid()
pred_logits:[2,100,classes+1]
outputs_coord:[2,100,4]

3.損失函數(shù)
由于DETR特殊的結(jié)構(gòu),所以要重新構(gòu)造損失函數(shù)。DETR輸出固定大小的N(文章里n取100)個(gè)預(yù)測(cè)結(jié)果,N通常比實(shí)際的一張圖片中的目標(biāo)數(shù)要多。預(yù)測(cè)的主要困難就是給輸出目標(biāo)(包括類別,位置,尺寸)一個(gè)打分。
y拔表示n個(gè)預(yù)測(cè)的集合,由于n是遠(yuǎn)大于圖中目標(biāo)的個(gè)數(shù),所以要把y用?pad一下。n個(gè)y拔和n個(gè)y要找到一個(gè)損失最小的,就變成了一個(gè)二分圖的最大匹配問題,所以可以使用加權(quán)的匈牙利算法(km算法):遞歸的找除一組損失最小損失兩兩匹配順序,其關(guān)鍵就是找到增廣路(augmenting path)--假設(shè)目前已有一個(gè)匹配結(jié)果,存在一組未匹配定點(diǎn)的,能夠找到一條路徑,這條路徑上匹配和未匹配的定點(diǎn)交替出現(xiàn),稱為增廣路。
ps:用貪心算法也可以獲得近似最優(yōu)解,而且計(jì)算量比較小

這樣把計(jì)算gt和region proposal的iou,并不再使用非極大值抑制來(lái)減少region proposa產(chǎn)生,同時(shí)減少了人工成本。下面公式的含義:找到y(tǒng)和y拔匹配中損失最小的一對(duì),最后求全局所有配對(duì)的最小損失

損失函數(shù)就是類別損失與box損失的線性組合,左邊為class損失,右邊為box損失。
先說(shuō)class損失:σ拔就是在最優(yōu)匹配下的情況,pσ就是此時(shí)的概率,然后以其為指數(shù),以10為底數(shù)求對(duì)數(shù)的交叉熵?fù)p失(c為真實(shí)概率,p為預(yù)測(cè)概率)。由于一個(gè)圖片中的類別c肯定小于n(n=100)所以,c有可能為空集,當(dāng)c為空集的時(shí)候,把log-probability 權(quán)重降低10倍,來(lái)解決正負(fù)樣本不平衡。

def loss_labels(self, outputs, targets, indices, num_boxes, log=True):
"""Classification loss (NLL)
targets dicts must contain the key "labels" containing a tensor of dim [nb_target_boxes]
"""
assert 'pred_logits' in outputs
src_logits = outputs['pred_logits']
idx = self._get_src_permutation_idx(indices)
target_classes_o = torch.cat([t["labels"][J] for t, (_, J) in zip(targets, indices)])
target_classes = torch.full(src_logits.shape[:2], self.num_classes,
dtype=torch.int64, device=src_logits.device)
target_classes[idx] = target_classes_o
loss_ce = F.cross_entropy(src_logits.transpose(1, 2), target_classes, self.empty_weight)#交叉熵
losses = {'loss_ce': loss_ce}
后面bbox具體損失為:L1 loss 與IOU loss的組合,因?yàn)橹苯邮褂肔1loss的話,loss對(duì)目標(biāo)的大小很敏感,權(quán)重藍(lán)不大為超參數(shù),要人工設(shè)置

def loss_boxes(self, outputs, targets, indices, num_boxes):
"""Compute the losses related to the bounding boxes, the L1 regression loss and the GIoU loss
targets dicts must contain the key "boxes" containing a tensor of dim [nb_target_boxes, 4]
The target boxes are expected in format (center_x, center_y, w, h), normalized by the image size.
"""
assert 'pred_boxes' in outputs#斷言,就是會(huì)報(bào)錯(cuò)的if
idx = self._get_src_permutation_idx(indices)
src_boxes = outputs['pred_boxes'][idx]
target_boxes = torch.cat([t['boxes'][i] for t, (_, i) in zip(targets, indices)], dim=0)
loss_bbox = F.l1_loss(src_boxes, target_boxes, reduction='none')
losses = {}#losses是字典類型
losses['loss_bbox'] = loss_bbox.sum() / num_boxes#第一部分損失L1 loss
loss_giou = 1 - torch.diag(box_ops.generalized_box_iou(
box_ops.box_cxcywh_to_xyxy(src_boxes),
box_ops.box_cxcywh_to_xyxy(target_boxes)))
losses['loss_giou'] = loss_giou.sum() / num_boxes#第二部分損失
return losses
4.實(shí)驗(yàn)結(jié)果
