本系列的代碼來自:https://github.com/jwyang/faster-rcnn.pytorch
大家可以去star一下,目前支持pytorch1.0系列
參考:原理部分摘自兩篇文章,個人記錄
詳解 ROI Align 的基本原理和實現(xiàn)細(xì)節(jié)
RoIPooling、RoIAlign筆記
Why roi-align
ROI Align 是在Mask-RCNN這篇論文里提出的一種區(qū)域特征聚集方式, 很好地解決了ROI Pooling操作中兩次量化造成的區(qū)域不匹配(mis-alignment)的問題。實驗顯示,在檢測測任務(wù)中將 ROI Pooling 替換為 ROI Align 可以提升檢測模型的準(zhǔn)確性。
1. roi pooling的局限性(造成mis-alignment問題)
在常見的兩級檢測框架(比如Fast-RCNN,F(xiàn)aster-RCNN,RFCN)中,ROI Pooling 的作用是根據(jù)預(yù)選框的位置坐標(biāo)在特征圖中將相應(yīng)區(qū)域池化為固定尺寸的特征圖,以便進行后續(xù)的分類和包圍框回歸操作。由于預(yù)選框的位置通常是由模型回歸得到的,一般來講是浮點數(shù),而池化后的特征圖要求尺寸固定。故ROI Pooling這一操作存在兩次量化的過程。
- 將候選框邊界量化為整數(shù)點坐標(biāo)值。
- 將量化后的邊界區(qū)域平均分割成 k x k 個單元(bin),對每一個單元的邊界進行量化。
事實上,經(jīng)過上述兩次量化,此時的候選框已經(jīng)和最開始回歸出來的位置有一定的偏差,這個偏差會影響檢測或者分割的準(zhǔn)確度。在論文里,作者把它總結(jié)為“不匹配問題(misalignment)。

針對上圖
Conv layers使用的是VGG16,feat_stride=32(即表示,經(jīng)過網(wǎng)絡(luò)層后圖片縮小為原圖的1/32),原圖800*800,最后一層特征圖feature map大小:25*25
假定原圖中有一region proposal,大小為665*665,這樣,映射到特征圖中的大?。?65/32=20.78,即20.78*20.78,如果你看過Caffe的Roi Pooling的C++源碼,在計算的時候會進行取整操作,于是,進行所謂的第一次量化,即映射的特征圖大小為20*20
假定pooled_w=7,pooled_h=7,即pooling后固定成7*7大小的特征圖,所以,將上面在 feature map上映射的20*20的 region proposal劃分成49個同等大小的小區(qū)域,每個小區(qū)域的大小20/7=2.86,即2.86*2.86,此時,進行第二次量化,故小區(qū)域大小變成2*2
每個2*2的小區(qū)域里,取出其中最大的像素值,作為這一個區(qū)域的‘代表’,這樣,49個小區(qū)域就輸出49個像素值,組成7*7大小的feature map
總結(jié),所以,通過上面可以看出,經(jīng)過兩次量化,即將浮點數(shù)取整,原本在特征圖上映射的20*20大小的region proposal,偏差成大小為7*7的,這樣的像素偏差勢必會對后層的回歸定位產(chǎn)生影響。
所以,產(chǎn)生了替代方案,RoiAlign
- roi align的原理

同樣,針對上圖,有著類似的映射
- Conv layers使用的是VGG16,feat_stride=32(即表示,經(jīng)過網(wǎng)絡(luò)層后圖片縮小為原圖的1/32),原圖800*800,最后一層特征圖feature map大小:25*25
- 假定原圖中有一region proposal,大小為665*665,這樣,映射到特征圖中的大?。?65/32=20.78,即20.78*20.78,此時,沒有像RoiPooling那樣就行取整操作,保留浮點數(shù)
- 假定pooled_w=7,pooled_h=7,即pooling后固定成7*7大小的特征圖,所以,將在 feature map上映射的20.78*20.78的region proposal 劃分成49個同等大小的小區(qū)域,每個小區(qū)域的大小20.78/7=2.97,即2.97*2.97
- 假定采樣點數(shù)為4,即表示,對于每個2.97*2.97的小區(qū)域,平分四份,每一份取其中心點位置,而中心點位置的像素,采用雙線性插值法進行計算,這樣,就會得到四個點的像素值,如下圖

上圖中,四個紅色叉叉‘×’的像素值是通過雙線性插值算法計算得到的。最后,取四個像素值中最大值作為這個小區(qū)域(即:2.97*2.97大小的區(qū)域)的像素值,如此類推,同樣是49個小區(qū)域得到49個像素值,組成7*7大小的feature map
roi-align代碼解讀
該部分目錄結(jié)構(gòu)與roi_pooling一致,參見:Faster RCNN源碼解讀(2)-roi_pooling
- 這里還是先看cpu版本的c語言roi_align.c
#include <TH/TH.h> // pytorch的 c拓展
#include <math.h>
#include <omp.h> // 多線程openMP
// 定義實現(xiàn)forward和backword的兩個函數(shù),C語言先定義
void ROIAlignForwardCpu(const float* bottom_data, const float spatial_scale, const int num_rois,
const int height, const int width, const int channels,
const int aligned_height, const int aligned_width, const float * bottom_rois,
float* top_data);
void ROIAlignBackwardCpu(const float* top_diff, const float spatial_scale, const int num_rois,
const int height, const int width, const int channels,
const int aligned_height, const int aligned_width, const float * bottom_rois,
float* top_data);
int roi_align_forward(int aligned_height, int aligned_width, float spatial_scale,
THFloatTensor * features, THFloatTensor * rois, THFloatTensor * output)
{
//Grab the input tensor
// 推測出features數(shù)據(jù)的格式,實際為一維數(shù)組(里面的[]是為了區(qū)分):
// [...,[c1,c2,c3,...,c_num_channels],[c1,c2,c3,...,c_num_channels],...]
// 一共data_height*data_width個[c1,c2,c3,...,c_num_channels]
float * data_flat = THFloatTensor_data(features);
// rois_flat = [...,[batch_index x1 y1 x2 y2],[batch_index x1 y1 x2 y2],...]
float * rois_flat = THFloatTensor_data(rois);
float * output_flat = THFloatTensor_data(output);
// Number of ROIs
int num_rois = THFloatTensor_size(rois, 0);
int size_rois = THFloatTensor_size(rois, 1);
// ROI = [batch_index x1 y1 x2 y2]
if (size_rois != 5)
{
return 0;
}
// data height
int data_height = THFloatTensor_size(features, 2);
// data width
int data_width = THFloatTensor_size(features, 3);
// Number of channels
int num_channels = THFloatTensor_size(features, 1);
// do ROIAlignForward,調(diào)用單獨的forward函數(shù)
ROIAlignForwardCpu(data_flat, spatial_scale, num_rois, data_height, data_width, num_channels,
aligned_height, aligned_width, rois_flat, output_flat);
return 1;
}
int roi_align_backward(int aligned_height, int aligned_width, float spatial_scale,
THFloatTensor * top_grad, THFloatTensor * rois, THFloatTensor * bottom_grad)
{
//Grab the input tensor
float * top_grad_flat = THFloatTensor_data(top_grad);
float * rois_flat = THFloatTensor_data(rois);
float * bottom_grad_flat = THFloatTensor_data(bottom_grad);
// Number of ROIs
int num_rois = THFloatTensor_size(rois, 0);
int size_rois = THFloatTensor_size(rois, 1);
if (size_rois != 5)
{
return 0;
}
// batch size
// int batch_size = THFloatTensor_size(bottom_grad, 0);
// data height
int data_height = THFloatTensor_size(bottom_grad, 2);
// data width
int data_width = THFloatTensor_size(bottom_grad, 3);
// Number of channels
int num_channels = THFloatTensor_size(bottom_grad, 1);
// do ROIAlignBackward,調(diào)用單獨的backward函數(shù)
ROIAlignBackwardCpu(top_grad_flat, spatial_scale, num_rois, data_height,
data_width, num_channels, aligned_height, aligned_width, rois_flat, bottom_grad_flat);
return 1;
}
void ROIAlignForwardCpu(const float* bottom_data, const float spatial_scale, const int num_rois,
const int height, const int width, const int channels,
const int aligned_height, const int aligned_width, const float * bottom_rois,
float* top_data)
{
// 輸出數(shù)據(jù)大小
const int output_size = num_rois * aligned_height * aligned_width * channels;
int idx = 0;
for (idx = 0; idx < output_size; ++idx)
{
// (n, c, ph, pw) is an element in the aligned output
int pw = idx % aligned_width; // 水平第幾個
int ph = (idx / aligned_width) % aligned_height; // 垂直第幾個
int c = (idx / aligned_width / aligned_height) % channels; // 第幾個通道
int n = idx / aligned_width / aligned_height / channels; // 第幾個roi
// bottom_rois:rois_flat
// 分別對應(yīng)ROI = [batch_index x1 y1 x2 y2]五個值
float roi_batch_ind = bottom_rois[n * 5 + 0];
float roi_start_w = bottom_rois[n * 5 + 1] * spatial_scale;
float roi_start_h = bottom_rois[n * 5 + 2] * spatial_scale;
float roi_end_w = bottom_rois[n * 5 + 3] * spatial_scale;
float roi_end_h = bottom_rois[n * 5 + 4] * spatial_scale;
// Force malformed ROI to be 1x1
float roi_width = fmaxf(roi_end_w - roi_start_w + 1., 0.);
float roi_height = fmaxf(roi_end_h - roi_start_h + 1., 0.);
// 每個bin的高度和寬度
float bin_size_h = roi_height / (aligned_height - 1.);
float bin_size_w = roi_width / (aligned_width - 1.);
//每個bin的坐標(biāo)
float h = (float)(ph) * bin_size_h + roi_start_h;
float w = (float)(pw) * bin_size_w + roi_start_w;
int hstart = fminf(floor(h), height - 2);
int wstart = fminf(floor(w), width - 2);
int img_start = roi_batch_ind * channels * height * width;
// bilinear interpolation 雙線性插值
if (h < 0 || h >= height || w < 0 || w >= width)
{
top_data[idx] = 0.;
}
else
{
float h_ratio = h - (float)(hstart);
float w_ratio = w - (float)(wstart);
int upleft = img_start + (c * height + hstart) * width + wstart;
int upright = upleft + 1;
int downleft = upleft + width;
int downright = downleft + 1;
top_data[idx] = bottom_data[upleft] * (1. - h_ratio) * (1. - w_ratio)
+ bottom_data[upright] * (1. - h_ratio) * w_ratio
+ bottom_data[downleft] * h_ratio * (1. - w_ratio)
+ bottom_data[downright] * h_ratio * w_ratio;
}
}
}
void ROIAlignBackwardCpu(const float* top_diff, const float spatial_scale, const int num_rois,
const int height, const int width, const int channels,
const int aligned_height, const int aligned_width, const float * bottom_rois,
float* bottom_diff)
{
const int output_size = num_rois * aligned_height * aligned_width * channels;
int idx = 0;
for (idx = 0; idx < output_size; ++idx)
{
// (n, c, ph, pw) is an element in the aligned output
int pw = idx % aligned_width;
int ph = (idx / aligned_width) % aligned_height;
int c = (idx / aligned_width / aligned_height) % channels;
int n = idx / aligned_width / aligned_height / channels;
float roi_batch_ind = bottom_rois[n * 5 + 0];
float roi_start_w = bottom_rois[n * 5 + 1] * spatial_scale;
float roi_start_h = bottom_rois[n * 5 + 2] * spatial_scale;
float roi_end_w = bottom_rois[n * 5 + 3] * spatial_scale;
float roi_end_h = bottom_rois[n * 5 + 4] * spatial_scale;
// Force malformed ROI to be 1x1
float roi_width = fmaxf(roi_end_w - roi_start_w + 1., 0.);
float roi_height = fmaxf(roi_end_h - roi_start_h + 1., 0.);
float bin_size_h = roi_height / (aligned_height - 1.);
float bin_size_w = roi_width / (aligned_width - 1.);
float h = (float)(ph) * bin_size_h + roi_start_h;
float w = (float)(pw) * bin_size_w + roi_start_w;
int hstart = fminf(floor(h), height - 2);
int wstart = fminf(floor(w), width - 2);
int img_start = roi_batch_ind * channels * height * width;
// bilinear interpolation 雙線性插值
if (h < 0 || h >= height || w < 0 || w >= width)
{
float h_ratio = h - (float)(hstart);
float w_ratio = w - (float)(wstart);
int upleft = img_start + (c * height + hstart) * width + wstart;
int upright = upleft + 1;
int downleft = upleft + width;
int downright = downleft + 1;
bottom_diff[upleft] += top_diff[idx] * (1. - h_ratio) * (1. - w_ratio);
bottom_diff[upright] += top_diff[idx] * (1. - h_ratio) * w_ratio;
bottom_diff[downleft] += top_diff[idx] * h_ratio * (1. - w_ratio);
bottom_diff[downright] += top_diff[idx] * h_ratio * w_ratio;
}
}
}
- 然后看functions下的roi_align.py,此處調(diào)用src實現(xiàn)的具體roi_align操作
# --------------------
# 此處實現(xiàn)roi align自定義層的function
# 包括forward和backward
# --------------------
import torch
from torch.autograd import Function
from .._ext import roi_align
# TODO use save_for_backward instead
class RoIAlignFunction(Function):
def __init__(self, aligned_height, aligned_width, spatial_scale):
self.aligned_width = int(aligned_width)
self.aligned_height = int(aligned_height)
self.spatial_scale = float(spatial_scale)
self.rois = None
self.feature_size = None
def forward(self, features, rois):
self.rois = rois
self.feature_size = features.size()
batch_size, num_channels, data_height, data_width = features.size()
num_rois = rois.size(0)
output = features.new(num_rois, num_channels, self.aligned_height, self.aligned_width).zero_()
if features.is_cuda:
roi_align.roi_align_forward_cuda(self.aligned_height,
self.aligned_width,
self.spatial_scale, features,
rois, output)
else:
roi_align.roi_align_forward(self.aligned_height,
self.aligned_width,
self.spatial_scale, features,
rois, output)
# raise NotImplementedError
return output
def backward(self, grad_output):
assert(self.feature_size is not None and grad_output.is_cuda)
batch_size, num_channels, data_height, data_width = self.feature_size
grad_input = self.rois.new(batch_size, num_channels, data_height,
data_width).zero_()
roi_align.roi_align_backward_cuda(self.aligned_height,
self.aligned_width,
self.spatial_scale, grad_output,
self.rois, grad_input)
# print grad_input
return grad_input, None
- 最后是modules下的roi_align.py,此處我們就實現(xiàn)了roi_align層了,此處調(diào)用functions下的roi_align.py定義的RoIAlignFunction()函數(shù)
# --------------------
# 此處調(diào)用function實現(xiàn)roi align自定義層的module
# 包括forward,實現(xiàn)了層的定義
# 有average pooling 和max pooling
# --------------------
from torch.nn.modules.module import Module
from torch.nn.functional import avg_pool2d, max_pool2d
from ..functions.roi_align import RoIAlignFunction
class RoIAlign(Module):
def __init__(self, aligned_height, aligned_width, spatial_scale):
super(RoIAlign, self).__init__()
self.aligned_width = int(aligned_width)
self.aligned_height = int(aligned_height)
self.spatial_scale = float(spatial_scale)
def forward(self, features, rois):
return RoIAlignFunction(self.aligned_height, self.aligned_width,
self.spatial_scale)(features, rois)
class RoIAlignAvg(Module):
def __init__(self, aligned_height, aligned_width, spatial_scale):
super(RoIAlignAvg, self).__init__()
self.aligned_width = int(aligned_width)
self.aligned_height = int(aligned_height)
self.spatial_scale = float(spatial_scale)
def forward(self, features, rois):
x = RoIAlignFunction(self.aligned_height+1, self.aligned_width+1,
self.spatial_scale)(features, rois)
return avg_pool2d(x, kernel_size=2, stride=1)
class RoIAlignMax(Module):
def __init__(self, aligned_height, aligned_width, spatial_scale):
super(RoIAlignMax, self).__init__()
self.aligned_width = int(aligned_width)
self.aligned_height = int(aligned_height)
self.spatial_scale = float(spatial_scale)
def forward(self, features, rois):
x = RoIAlignFunction(self.aligned_height+1, self.aligned_width+1,
self.spatial_scale)(features, rois)
return max_pool2d(x, kernel_size=2, stride=1)