前言
cuda對(duì)圖像處理過程進(jìn)行加速是很常見的操作,而且圖像處理算法中sobel算子又是最為簡單的算法,本篇博客就是使用cuda opencv c++實(shí)現(xiàn)對(duì)圖像進(jìn)行sobel邊緣提取加速操作,cpu和gpu處理速度進(jìn)行一個(gè)對(duì)比試驗(yàn)
具體流程
首先貼上代碼
#include "cuda_runtime.h"
#include "device_launch_parameters.h"
#include <cuda.h>
#include <device_functions.h>
#include <opencv2\opencv.hpp>
#include <iostream>
using namespace std;
using namespace cv;
//Sobel算子邊緣檢測(cè)核函數(shù)
__global__ void sobelInCuda(unsigned char* dataIn, unsigned char* dataOut, int imgHeight, int imgWidth)
{
int xIndex = threadIdx.x + blockIdx.x * blockDim.x;
int yIndex = threadIdx.y + blockIdx.y * blockDim.y;
int index = yIndex * imgWidth + xIndex;
//printf("xIndex : %d, yIndex: %d\n", xIndex, yIndex);
//printf("index : %d\n", index);
int Gx = 0;
int Gy = 0;
if (xIndex > 0 && xIndex < imgWidth - 1 && yIndex > 0 && yIndex < imgHeight - 1)
{
Gx = dataIn[(yIndex - 1) * imgWidth + xIndex + 1] + 2 * dataIn[yIndex * imgWidth + xIndex + 1] + dataIn[(yIndex + 1) * imgWidth + xIndex + 1]
- (dataIn[(yIndex - 1) * imgWidth + xIndex - 1] + 2 * dataIn[yIndex * imgWidth + xIndex - 1] + dataIn[(yIndex + 1) * imgWidth + xIndex - 1]);
Gy = dataIn[(yIndex - 1) * imgWidth + xIndex - 1] + 2 * dataIn[(yIndex - 1) * imgWidth + xIndex] + dataIn[(yIndex - 1) * imgWidth + xIndex + 1]
- (dataIn[(yIndex + 1) * imgWidth + xIndex - 1] + 2 * dataIn[(yIndex + 1) * imgWidth + xIndex] + dataIn[(yIndex + 1) * imgWidth + xIndex + 1]);
dataOut[index] = (abs(Gx) + abs(Gy)) / 2;
}
}
//Sobel算子邊緣檢測(cè)CPU函數(shù)
void sobel(Mat srcImg, Mat dstImg, int imgHeight, int imgWidth)
{
int Gx = 0;
int Gy = 0;
for (int i = 1; i < imgHeight - 1; i++)
{
uchar* dataUp = srcImg.ptr<uchar>(i - 1);
uchar* data = srcImg.ptr<uchar>(i);
uchar* dataDown = srcImg.ptr<uchar>(i + 1);
uchar* out = dstImg.ptr<uchar>(i);
for (int j = 1; j < imgWidth - 1; j++)
{
Gx = (dataUp[j + 1] + 2 * data[j + 1] + dataDown[j + 1]) - (dataUp[j - 1] + 2 * data[j - 1] + dataDown[j - 1]);
Gy = (dataUp[j - 1] + 2 * dataUp[j] + dataUp[j + 1]) - (dataDown[j - 1] + 2 * dataDown[j] + dataDown[j + 1]);
out[j] = (abs(Gx) + abs(Gy)) / 2;
}
}
}
int main()
{
Mat grayImg = imread("lena.jpg", 0);
int imgHeight = grayImg.rows;
int imgWidth = grayImg.cols;
Mat gaussImg;
//高斯濾波
GaussianBlur(grayImg, gaussImg, Size(3, 3), 0, 0, BORDER_DEFAULT);
//Sobel算子CPU實(shí)現(xiàn)
Mat dst(imgHeight, imgWidth, CV_8UC1, Scalar(0));
double time1 = static_cast<double>(cv::getTickCount());
sobel(gaussImg, dst, imgHeight, imgWidth);
double time2 = static_cast<double>(cv::getTickCount());
std::cout << "cpu Time use: " << 1000 * (time2 - time1) / cv::getTickFrequency() << "ms" << std::endl;//輸出運(yùn)行時(shí)間
cv::imwrite("dst_cpu.jpg", dst);
//CUDA實(shí)現(xiàn)后的傳回的圖像
Mat dstImg(imgHeight, imgWidth, CV_8UC1, Scalar(0));
//創(chuàng)建GPU內(nèi)存
unsigned char* d_in;
unsigned char* d_out;
cudaMalloc((void**)&d_in, imgHeight * imgWidth * sizeof(unsigned char));
cudaMalloc((void**)&d_out, imgHeight * imgWidth * sizeof(unsigned char));
//將高斯濾波后的圖像從CPU傳入GPU
cudaMemcpy(d_in, gaussImg.data, imgHeight * imgWidth * sizeof(unsigned char), cudaMemcpyHostToDevice);
dim3 threadsPerBlock(32, 32);
dim3 blocksPerGrid((imgWidth + threadsPerBlock.x - 1) / threadsPerBlock.x, (imgHeight + threadsPerBlock.y - 1) / threadsPerBlock.y);
//記錄起始時(shí)間
cudaEvent_t start, stop;
cudaEventCreate(&start);
cudaEventCreate(&stop);
cudaEventRecord(start, 0);
//調(diào)用核函數(shù)
sobelInCuda << <blocksPerGrid, threadsPerBlock >> > (d_in, d_out, imgHeight, imgWidth);
//將圖像傳回GPU
cudaMemcpy(dstImg.data, d_out, imgHeight * imgWidth * sizeof(unsigned char), cudaMemcpyDeviceToHost);
cudaEventRecord(stop, 0);
cudaEventSynchronize(stop);
float elapsedTime;
cudaEventElapsedTime(&elapsedTime, start, stop);
printf("gpu time cost: %3.5f ms", elapsedTime);
cv::imwrite("dst_gpu.jpg", dstImg);
//釋放GPU內(nèi)存
cudaFree(d_in);
cudaFree(d_out);
cudaEventDestroy(start);
cudaEventDestroy(stop);
return 0;
}
當(dāng)然,這里的代碼是參考別人。下面著重講講對(duì)這部分代碼的理解吧
首先,Sobel算子:是離散微分算子(discrete differentiation operator),用來計(jì)算圖像灰度的近似梯度,梯度越大越有可能是邊緣。sobel邊緣提取就是使用sobel算子對(duì)圖像進(jìn)行卷積操作,這里sobel算子選用的是一個(gè)3*3的矩陣

sobel算子cpu的實(shí)現(xiàn),這樣沒有太多可說的,基本上就是對(duì)一個(gè)二維矩陣進(jìn)行矩陣乘積操作,然后再將矩陣乘積結(jié)果回填到目標(biāo)像素位置
而sobel算子的gpu實(shí)現(xiàn),則相對(duì)來說比較復(fù)雜一些。首先創(chuàng)建了block和grid的尺寸,即代碼中的threadsPerBlock和blocksPerGrid,使用cudaMalloc分配圖像尺寸大小的內(nèi)存空間,用來作為核函數(shù)的輸入變量和輸出變量,最后調(diào)用核函數(shù)完成并行操作,即sobel邊緣提取
再看看核函數(shù)內(nèi)部的實(shí)現(xiàn)
int xIndex = threadIdx.x + blockIdx.x * blockDim.x;
int yIndex = threadIdx.y + blockIdx.y * blockDim.y;
這兩行代碼主要是獲取線程塊內(nèi)的坐標(biāo)位置,這個(gè)位置和圖像矩陣位置是一一對(duì)應(yīng)的
int index = yIndex * imgWidth + xIndex;
而這行代碼,則是實(shí)現(xiàn)圖像矩陣坐標(biāo)和data_in, data_out這兩個(gè)指針里面存儲(chǔ)的一維數(shù)組的映射關(guān)系,(xIndex, yIndex)該坐標(biāo)像素值即為每次線程運(yùn)行時(shí),做sobel運(yùn)算時(shí)的中心坐標(biāo),而實(shí)際求Gx和Gy時(shí),還需要獲取該中心坐標(biāo)周圍的8個(gè)坐標(biāo)值來進(jìn)行運(yùn)算,所以是這樣進(jìn)行書寫的代碼
Gx = dataIn[(yIndex - 1) * imgWidth + xIndex + 1] + 2 * dataIn[yIndex * imgWidth + xIndex + 1] + dataIn[(yIndex + 1) * imgWidth + xIndex + 1]
- (dataIn[(yIndex - 1) * imgWidth + xIndex - 1] + 2 * dataIn[yIndex * imgWidth + xIndex - 1] + dataIn[(yIndex + 1) * imgWidth + xIndex - 1]);
Gy = dataIn[(yIndex - 1) * imgWidth + xIndex - 1] + 2 * dataIn[(yIndex - 1) * imgWidth + xIndex] + dataIn[(yIndex - 1) * imgWidth + xIndex + 1]
- (dataIn[(yIndex + 1) * imgWidth + xIndex - 1] + 2 * dataIn[(yIndex + 1) * imgWidth + xIndex] + dataIn[(yIndex + 1) * imgWidth + xIndex + 1]);
運(yùn)算的結(jié)果存回到dataOut
整個(gè)測(cè)試中輸入圖片是女神lena

cpu和gpu輸出的結(jié)果為


可以看出,gpu和cpu的結(jié)果圖是一模一樣的
cpu和gpu的時(shí)間測(cè)試
上一部分?jǐn)⑹隽薱pu和gpu版本的sobel算子邊緣提取,現(xiàn)在來測(cè)試一下兩個(gè)版本算法的運(yùn)行時(shí)間
cpu的版本使用的時(shí)間測(cè)試工具為opencv中自帶的計(jì)時(shí)工具,我一直在使用,比較靠譜。測(cè)試方式也比較簡單,就是在運(yùn)行sobel函數(shù)前后加上 static_cast<double>(cv::getTickCount())函數(shù),然后前后兩次count數(shù)相減再除于count的頻次就可以了,這里我將count乘以1000,這樣測(cè)試得到的時(shí)間基本單位就是ms
gpu版本使用的時(shí)間測(cè)試工具是cuda自帶的工具cudaEventElapsedTime,在計(jì)算時(shí)延之前還需要進(jìn)行cudaEventRecord操作,避免沒有g(shù)pu線程沒有完全執(zhí)行結(jié)束
最終測(cè)試的結(jié)果如下:

可以看出,對(duì)于同一個(gè)輸入圖片,gpu的運(yùn)行時(shí)間是cpu大概1/5的樣子
這里我對(duì)gpu進(jìn)行計(jì)時(shí)的時(shí)候?qū)D像從gpu傳到cpu的操作也進(jìn)行了計(jì)時(shí),如果將這部分除去,gpu的時(shí)間會(huì)更短

可以看出,這次sobel算子邊緣提取,真正在gpu上進(jìn)行運(yùn)算的時(shí)間僅為0.02ms,絕大部分時(shí)間是將數(shù)據(jù)從gpu拷貝到cpu的時(shí)間,這個(gè)時(shí)間會(huì)占據(jù)整個(gè)流程的85%左右
總結(jié)
整個(gè)探索過程還是比較有意義,既熟悉了sobel算子的原理,又熟悉了gpu并行運(yùn)算的整體流程,還做cpu與gpu運(yùn)算版本時(shí)間對(duì)比。