Faster RCNN源碼解讀(3)-roi_align

本系列的代碼來自:https://github.com/jwyang/faster-rcnn.pytorch
大家可以去star一下,目前支持pytorch1.0系列

參考:原理部分摘自兩篇文章,個人記錄
詳解 ROI Align 的基本原理和實現(xiàn)細(xì)節(jié)
RoIPooling、RoIAlign筆記

Why roi-align

ROI Align 是在Mask-RCNN這篇論文里提出的一種區(qū)域特征聚集方式, 很好地解決了ROI Pooling操作中兩次量化造成的區(qū)域不匹配(mis-alignment)的問題。實驗顯示,在檢測測任務(wù)中將 ROI Pooling 替換為 ROI Align 可以提升檢測模型的準(zhǔn)確性。

1. roi pooling的局限性(造成mis-alignment問題)

在常見的兩級檢測框架(比如Fast-RCNN,F(xiàn)aster-RCNN,RFCN)中,ROI Pooling 的作用是根據(jù)預(yù)選框的位置坐標(biāo)在特征圖中將相應(yīng)區(qū)域池化為固定尺寸的特征圖,以便進行后續(xù)的分類和包圍框回歸操作。由于預(yù)選框的位置通常是由模型回歸得到的,一般來講是浮點數(shù),而池化后的特征圖要求尺寸固定。故ROI Pooling這一操作存在兩次量化的過程。

  • 將候選框邊界量化為整數(shù)點坐標(biāo)值。
  • 將量化后的邊界區(qū)域平均分割成 k x k 個單元(bin),對每一個單元的邊界進行量化。

事實上,經(jīng)過上述兩次量化,此時的候選框已經(jīng)和最開始回歸出來的位置有一定的偏差,這個偏差會影響檢測或者分割的準(zhǔn)確度。在論文里,作者把它總結(jié)為“不匹配問題(misalignment)。

ROIPool.png

針對上圖

  • Conv layers使用的是VGG16,feat_stride=32(即表示,經(jīng)過網(wǎng)絡(luò)層后圖片縮小為原圖的1/32),原圖800*800,最后一層特征圖feature map大小:25*25

  • 假定原圖中有一region proposal,大小為665*665,這樣,映射到特征圖中的大?。?65/32=20.78,即20.78*20.78,如果你看過Caffe的Roi Pooling的C++源碼,在計算的時候會進行取整操作,于是,進行所謂的第一次量化,即映射的特征圖大小為20*20

  • 假定pooled_w=7,pooled_h=7,即pooling后固定成7*7大小的特征圖,所以,將上面在 feature map上映射的20*20的 region proposal劃分成49個同等大小的小區(qū)域,每個小區(qū)域的大小20/7=2.86,即2.86*2.86,此時,進行第二次量化,故小區(qū)域大小變成2*2

  • 每個2*2的小區(qū)域里,取出其中最大的像素值,作為這一個區(qū)域的‘代表’,這樣,49個小區(qū)域就輸出49個像素值,組成7*7大小的feature map

  • 總結(jié),所以,通過上面可以看出,經(jīng)過兩次量化,即將浮點數(shù)取整,原本在特征圖上映射的20*20大小的region proposal,偏差成大小為7*7的,這樣的像素偏差勢必會對后層的回歸定位產(chǎn)生影響。

所以,產(chǎn)生了替代方案,RoiAlign

  1. roi align的原理
ROIAlign.png

同樣,針對上圖,有著類似的映射

  • Conv layers使用的是VGG16,feat_stride=32(即表示,經(jīng)過網(wǎng)絡(luò)層后圖片縮小為原圖的1/32),原圖800*800,最后一層特征圖feature map大小:25*25
  • 假定原圖中有一region proposal,大小為665*665,這樣,映射到特征圖中的大?。?65/32=20.78,即20.78*20.78,此時,沒有像RoiPooling那樣就行取整操作,保留浮點數(shù)
  • 假定pooled_w=7,pooled_h=7,即pooling后固定成7*7大小的特征圖,所以,將在 feature map上映射的20.78*20.78的region proposal 劃分成49個同等大小的小區(qū)域,每個小區(qū)域的大小20.78/7=2.97,即2.97*2.97
  • 假定采樣點數(shù)為4,即表示,對于每個2.97*2.97的小區(qū)域,平分四份,每一份取其中心點位置,而中心點位置的像素,采用雙線性插值法進行計算,這樣,就會得到四個點的像素值,如下圖
圖片來自參考博客

上圖中,四個紅色叉叉‘×’的像素值是通過雙線性插值算法計算得到的。最后,取四個像素值中最大值作為這個小區(qū)域(即:2.97*2.97大小的區(qū)域)的像素值,如此類推,同樣是49個小區(qū)域得到49個像素值,組成7*7大小的feature map

roi-align代碼解讀

該部分目錄結(jié)構(gòu)與roi_pooling一致,參見:Faster RCNN源碼解讀(2)-roi_pooling

  1. 這里還是先看cpu版本的c語言roi_align.c
#include <TH/TH.h> // pytorch的 c拓展
#include <math.h>
#include <omp.h> // 多線程openMP

// 定義實現(xiàn)forward和backword的兩個函數(shù),C語言先定義
void ROIAlignForwardCpu(const float* bottom_data, const float spatial_scale, const int num_rois,
                     const int height, const int width, const int channels,
                     const int aligned_height, const int aligned_width, const float * bottom_rois,
                     float* top_data);

void ROIAlignBackwardCpu(const float* top_diff, const float spatial_scale, const int num_rois,
                     const int height, const int width, const int channels,
                     const int aligned_height, const int aligned_width, const float * bottom_rois,
                     float* top_data);

int roi_align_forward(int aligned_height, int aligned_width, float spatial_scale,
                     THFloatTensor * features, THFloatTensor * rois, THFloatTensor * output)
{
    //Grab the input tensor
    // 推測出features數(shù)據(jù)的格式,實際為一維數(shù)組(里面的[]是為了區(qū)分):
    // [...,[c1,c2,c3,...,c_num_channels],[c1,c2,c3,...,c_num_channels],...]
    // 一共data_height*data_width個[c1,c2,c3,...,c_num_channels]
    float * data_flat = THFloatTensor_data(features);
    // rois_flat = [...,[batch_index x1 y1 x2 y2],[batch_index x1 y1 x2 y2],...]
    float * rois_flat = THFloatTensor_data(rois);

    float * output_flat = THFloatTensor_data(output);

    // Number of ROIs
    int num_rois = THFloatTensor_size(rois, 0);
    int size_rois = THFloatTensor_size(rois, 1);
    
    // ROI = [batch_index x1 y1 x2 y2]
    if (size_rois != 5) 
    {
        return 0;
    }

    
    // data height
    int data_height = THFloatTensor_size(features, 2);
    // data width
    int data_width = THFloatTensor_size(features, 3);
    // Number of channels
    int num_channels = THFloatTensor_size(features, 1);

    // do ROIAlignForward,調(diào)用單獨的forward函數(shù)
    ROIAlignForwardCpu(data_flat, spatial_scale, num_rois, data_height, data_width, num_channels,
            aligned_height, aligned_width, rois_flat, output_flat);

    return 1;
}

int roi_align_backward(int aligned_height, int aligned_width, float spatial_scale,
                       THFloatTensor * top_grad, THFloatTensor * rois, THFloatTensor * bottom_grad)
{
    //Grab the input tensor
    float * top_grad_flat = THFloatTensor_data(top_grad);
    float * rois_flat = THFloatTensor_data(rois);

    float * bottom_grad_flat = THFloatTensor_data(bottom_grad);

    // Number of ROIs
    int num_rois = THFloatTensor_size(rois, 0);
    int size_rois = THFloatTensor_size(rois, 1);
    if (size_rois != 5)
    {
        return 0;
    }

    // batch size
    // int batch_size = THFloatTensor_size(bottom_grad, 0);
    // data height
    int data_height = THFloatTensor_size(bottom_grad, 2);
    // data width
    int data_width = THFloatTensor_size(bottom_grad, 3);
    // Number of channels
    int num_channels = THFloatTensor_size(bottom_grad, 1);

    // do ROIAlignBackward,調(diào)用單獨的backward函數(shù)
    ROIAlignBackwardCpu(top_grad_flat, spatial_scale, num_rois, data_height,
            data_width, num_channels, aligned_height, aligned_width, rois_flat, bottom_grad_flat);

    return 1;
}

void ROIAlignForwardCpu(const float* bottom_data, const float spatial_scale, const int num_rois,
                     const int height, const int width, const int channels,
                     const int aligned_height, const int aligned_width, const float * bottom_rois,
                     float* top_data)
{
    // 輸出數(shù)據(jù)大小
    const int output_size = num_rois * aligned_height * aligned_width * channels;

    int idx = 0;
    for (idx = 0; idx < output_size; ++idx)
    {
        // (n, c, ph, pw) is an element in the aligned output
        int pw = idx % aligned_width; // 水平第幾個
        int ph = (idx / aligned_width) % aligned_height; // 垂直第幾個
        int c = (idx / aligned_width / aligned_height) % channels; // 第幾個通道
        int n = idx / aligned_width / aligned_height / channels; // 第幾個roi

        // bottom_rois:rois_flat
        // 分別對應(yīng)ROI = [batch_index x1 y1 x2 y2]五個值
        float roi_batch_ind = bottom_rois[n * 5 + 0];
        float roi_start_w = bottom_rois[n * 5 + 1] * spatial_scale;
        float roi_start_h = bottom_rois[n * 5 + 2] * spatial_scale;
        float roi_end_w = bottom_rois[n * 5 + 3] * spatial_scale;
        float roi_end_h = bottom_rois[n * 5 + 4] * spatial_scale;

        // Force malformed ROI to be 1x1
        float roi_width = fmaxf(roi_end_w - roi_start_w + 1., 0.);
        float roi_height = fmaxf(roi_end_h - roi_start_h + 1., 0.);
        // 每個bin的高度和寬度
        float bin_size_h = roi_height / (aligned_height - 1.);
        float bin_size_w = roi_width / (aligned_width - 1.);
        
        //每個bin的坐標(biāo)
        float h = (float)(ph) * bin_size_h + roi_start_h;
        float w = (float)(pw) * bin_size_w + roi_start_w;

        int hstart = fminf(floor(h), height - 2);
        int wstart = fminf(floor(w), width - 2);

        int img_start = roi_batch_ind * channels * height * width;

        // bilinear interpolation 雙線性插值
        if (h < 0 || h >= height || w < 0 || w >= width)
        {
            top_data[idx] = 0.;
        }
        else
        {
            float h_ratio = h - (float)(hstart);
            float w_ratio = w - (float)(wstart);
            int upleft = img_start + (c * height + hstart) * width + wstart;
            int upright = upleft + 1;
            int downleft = upleft + width;
            int downright = downleft + 1;

            top_data[idx] = bottom_data[upleft] * (1. - h_ratio) * (1. - w_ratio)
                + bottom_data[upright] * (1. - h_ratio) * w_ratio
                + bottom_data[downleft] * h_ratio * (1. - w_ratio)
                + bottom_data[downright] * h_ratio * w_ratio;
        }
    }
}

void ROIAlignBackwardCpu(const float* top_diff, const float spatial_scale, const int num_rois,
                     const int height, const int width, const int channels,
                     const int aligned_height, const int aligned_width, const float * bottom_rois,
                     float* bottom_diff)
{
    const int output_size = num_rois * aligned_height * aligned_width * channels;

    int idx = 0;
    for (idx = 0; idx < output_size; ++idx)
    {
        // (n, c, ph, pw) is an element in the aligned output
        int pw = idx % aligned_width;
        int ph = (idx / aligned_width) % aligned_height;
        int c = (idx / aligned_width / aligned_height) % channels;
        int n = idx / aligned_width / aligned_height / channels;

        float roi_batch_ind = bottom_rois[n * 5 + 0];
        float roi_start_w = bottom_rois[n * 5 + 1] * spatial_scale;
        float roi_start_h = bottom_rois[n * 5 + 2] * spatial_scale;
        float roi_end_w = bottom_rois[n * 5 + 3] * spatial_scale;
        float roi_end_h = bottom_rois[n * 5 + 4] * spatial_scale;

        // Force malformed ROI to be 1x1
        float roi_width = fmaxf(roi_end_w - roi_start_w + 1., 0.);
        float roi_height = fmaxf(roi_end_h - roi_start_h + 1., 0.);
        float bin_size_h = roi_height / (aligned_height - 1.);
        float bin_size_w = roi_width / (aligned_width - 1.);

        float h = (float)(ph) * bin_size_h + roi_start_h;
        float w = (float)(pw) * bin_size_w + roi_start_w;

        int hstart = fminf(floor(h), height - 2);
        int wstart = fminf(floor(w), width - 2);

        int img_start = roi_batch_ind * channels * height * width;

        // bilinear interpolation 雙線性插值
        if (h < 0 || h >= height || w < 0 || w >= width)
        {
            float h_ratio = h - (float)(hstart);
            float w_ratio = w - (float)(wstart);
            int upleft = img_start + (c * height + hstart) * width + wstart;
            int upright = upleft + 1;
            int downleft = upleft + width;
            int downright = downleft + 1;

            bottom_diff[upleft] += top_diff[idx] * (1. - h_ratio) * (1. - w_ratio);
            bottom_diff[upright] += top_diff[idx] * (1. - h_ratio) *  w_ratio;
            bottom_diff[downleft] += top_diff[idx] * h_ratio * (1. - w_ratio);
            bottom_diff[downright] += top_diff[idx] * h_ratio * w_ratio;
        }
    }
}
  1. 然后看functions下的roi_align.py,此處調(diào)用src實現(xiàn)的具體roi_align操作
# --------------------
# 此處實現(xiàn)roi align自定義層的function
# 包括forward和backward
# --------------------
import torch
from torch.autograd import Function
from .._ext import roi_align


# TODO use save_for_backward instead
class RoIAlignFunction(Function):
    def __init__(self, aligned_height, aligned_width, spatial_scale):
        self.aligned_width = int(aligned_width)
        self.aligned_height = int(aligned_height)
        self.spatial_scale = float(spatial_scale)
        self.rois = None
        self.feature_size = None

    def forward(self, features, rois):
        self.rois = rois
        self.feature_size = features.size()

        batch_size, num_channels, data_height, data_width = features.size()
        num_rois = rois.size(0)

        output = features.new(num_rois, num_channels, self.aligned_height, self.aligned_width).zero_()
        if features.is_cuda:
            roi_align.roi_align_forward_cuda(self.aligned_height,
                                             self.aligned_width,
                                             self.spatial_scale, features,
                                             rois, output)
        else:
            roi_align.roi_align_forward(self.aligned_height,
                                        self.aligned_width,
                                        self.spatial_scale, features,
                                        rois, output)
#            raise NotImplementedError

        return output

    def backward(self, grad_output):
        assert(self.feature_size is not None and grad_output.is_cuda)

        batch_size, num_channels, data_height, data_width = self.feature_size

        grad_input = self.rois.new(batch_size, num_channels, data_height,
                                  data_width).zero_()
        roi_align.roi_align_backward_cuda(self.aligned_height,
                                          self.aligned_width,
                                          self.spatial_scale, grad_output,
                                          self.rois, grad_input)

        # print grad_input

        return grad_input, None

  1. 最后是modules下的roi_align.py,此處我們就實現(xiàn)了roi_align層了,此處調(diào)用functions下的roi_align.py定義的RoIAlignFunction()函數(shù)
# --------------------
# 此處調(diào)用function實現(xiàn)roi align自定義層的module
# 包括forward,實現(xiàn)了層的定義
# 有average pooling 和max pooling
# --------------------
from torch.nn.modules.module import Module
from torch.nn.functional import avg_pool2d, max_pool2d
from ..functions.roi_align import RoIAlignFunction


class RoIAlign(Module):
    def __init__(self, aligned_height, aligned_width, spatial_scale):
        super(RoIAlign, self).__init__()

        self.aligned_width = int(aligned_width)
        self.aligned_height = int(aligned_height)
        self.spatial_scale = float(spatial_scale)

    def forward(self, features, rois):
        return RoIAlignFunction(self.aligned_height, self.aligned_width,
                                self.spatial_scale)(features, rois)

class RoIAlignAvg(Module):
    def __init__(self, aligned_height, aligned_width, spatial_scale):
        super(RoIAlignAvg, self).__init__()

        self.aligned_width = int(aligned_width)
        self.aligned_height = int(aligned_height)
        self.spatial_scale = float(spatial_scale)

    def forward(self, features, rois):
        x =  RoIAlignFunction(self.aligned_height+1, self.aligned_width+1,
                                self.spatial_scale)(features, rois)
        return avg_pool2d(x, kernel_size=2, stride=1)

class RoIAlignMax(Module):
    def __init__(self, aligned_height, aligned_width, spatial_scale):
        super(RoIAlignMax, self).__init__()

        self.aligned_width = int(aligned_width)
        self.aligned_height = int(aligned_height)
        self.spatial_scale = float(spatial_scale)

    def forward(self, features, rois):
        x =  RoIAlignFunction(self.aligned_height+1, self.aligned_width+1,
                                self.spatial_scale)(features, rois)
        return max_pool2d(x, kernel_size=2, stride=1)

?著作權(quán)歸作者所有,轉(zhuǎn)載或內(nèi)容合作請聯(lián)系作者
【社區(qū)內(nèi)容提示】社區(qū)部分內(nèi)容疑似由AI輔助生成,瀏覽時請結(jié)合常識與多方信息審慎甄別。
平臺聲明:文章內(nèi)容(如有圖片或視頻亦包括在內(nèi))由作者上傳并發(fā)布,文章內(nèi)容僅代表作者本人觀點,簡書系信息發(fā)布平臺,僅提供信息存儲服務(wù)。

相關(guān)閱讀更多精彩內(nèi)容

友情鏈接更多精彩內(nèi)容