密码保护:2012_05_01
OpenCV2.3在CUDA4.0+VS2008+win7 32bit环境下的配置
- 下载相关文件
- 在sourceforge上下载OpenCV2.3相关文件,其中:
- OpenCV-2.3.0-win-superpack.exe 包含了源码和已编译好的版本,本教程以superpack为例;
- OpenCV-2.3.0-win-src.zip 只包含源码;
- OpenCV-2.3.0-gpu-support-win32-vs2008.zip 是针对OpenCV-2.3.0-win-superpack.exe 对GPU支持的补充包,如果想直接使用superpack中已编译好的版本且需要CUDA支持,需要下载此包。
- 运行OpenCV-2.3.0-win-superpack.exe,得到OpenCV2.3文件夹,其中build问价家为已编译好版本,opencv文件夹为源码;
- 源码编译(使用superpack中已编译好的版本可跳过此步骤)
- 下载并安装CMake. http://www.cmake.org/
- 运行CMake (cmake-gui)(我使用的是CMake2.8.7,各版本界面安排可能稍许不同),填写源码位置,及build输出位置:PIC1
- 点击下方 configure 按钮,选择编译器:PIC2 点击Finish,等待configure完毕;
- configure完毕后,配置build选项:PIC3 需要CUDA4.0需要注意 勾选WITH_CUDA: PIC4 他默认,或根据自己需要配置即可;
- 配置完毕,点击Generate生成build,在第二步中填入的build输出文件夹下会生成一个OpenCV project,运行OpenCV.sln,生成debug、release解决方案,等待生成完毕;
- 生成完毕后,在bin和lib文件夹下会生成debug和release的DLL,EXE,LIB,至此,源码编译步骤完毕。
- 配置OpenCV2.3
- 建立OpenCV23文件夹,以D:\OpenCV23 为例,将解压superpack得到的build问价夹拷贝到该目录下;
- 若是自行编译,将2-6中的debug和release 的DLL和EXE覆盖D:\OpenCV23\build\x86\vc9\bin;LIB覆盖D:\OpenCV23\build\x86\vc9\lib;若是使用superpack已编译好版本,将OpenCV-2.3.0-gpu-support-win32-vs2008.zip中的DLL和LIB分别加入上面两个目录替换同名文件;
- 将 D:\OpenCV23\build\x86\vc9\bin 加入到PATH变量;
- 在VS中 工具->选项->VC++目录中,包含文件添加:D:\OpenCV23\build\include\opencv
D:\OpenCV23\build\include
库文件添加:D:\OpenCV23\build\x86\vc9\lib
- 在建立的项目中,需要附加依赖项:DEBUG:D:\OpenCV23\build\x86\vc9\lib\opencv_core230d.lib
D:\OpenCV23\build\x86\vc9\lib\opencv_highgui230d.lib
D:\OpenCV23\build\x86\vc9\lib\opencv_video230d.lib
D:\OpenCV23\build\x86\vc9\lib\opencv_ml230d.lib
D:\OpenCV23\build\x86\vc9\lib\opencv_legacy230d.lib
D:\OpenCV23\build\x86\vc9\lib\opencv_imgproc230d.lib
RELEASE:
D:\OpenCV23\build\x86\vc9\lib\opencv_core230.lib
D:\OpenCV23\build\x86\vc9\lib\opencv_highgui230.lib
D:\OpenCV23\build\x86\vc9\lib\opencv_video230.lib
D:\OpenCV23\build\x86\vc9\lib\opencv_ml230.lib
D:\OpenCV23\build\x86\vc9\lib\opencv_legacy230.lib
D:\OpenCV23\build\x86\vc9\lib\opencv_imgproc230.lib
- 至此,OpenCV2.3在CUDA4.0+VS2008+win7 32bit环境下的配置完毕。
- 遇到的问题
7rack在编译OpenCV2.3和CUDA4.0中遇到了诸如:
1> D:\OpenCV2.3\build\include\opencv2/core/operations.hpp(1265): error: expected a “>”
1>D:\OpenCV2.3\build\include\opencv2/flann/index_testing.h(144): warning: variable “p1″ was set but never used
1>D:\OpenCV2.3\build\include\opencv2/flann/index_testing.h(221): warning: variable “p1″ was set but never used
1>D:\OpenCV2.3\build\include\opencv2/flann/autotuned_index.h(359): error: this operator is not allowed in a template argument expression
1>D:\OpenCV2.3\build\include\opencv2/flann/autotuned_index.h(359): error: expression must have a constant value
1>D:\OpenCV2.3\build\include\opencv2/flann/autotuned_index.h(359): error: expected a “>”
1>D:\OpenCV2.3\build\include\opencv2/flann/autotuned_index.h(394): error: this operator is not allowed in a template argument expression
1>D:\OpenCV2.3\build\include\opencv2/flann/autotuned_index.h(394): error: expression must have a constant value
1>D:\OpenCV2.3\build\include\opencv2/flann/autotuned_index.h(394): error: expected a “>”
1>D:\OpenCV2.3\build\include\opencv2/flann/autotuned_index.h(432): error: this operator is not allowed in a template argument expression
1>D:\OpenCV2.3\build\include\opencv2/flann/autotuned_index.h(432): error: expression must have a constant value
1>D:\OpenCV2.3\build\include\opencv2/flann/autotuned_index.h(432): error: expected a “>”
1>D:\OpenCV2.3\build\include\opencv2/flann/autotuned_index.h(464): error: this operator is not allowed in a template argument expression
1>D:\OpenCV2.3\build\include\opencv2/flann/autotuned_index.h(464): error: expression must have a constant value
1>D:\OpenCV2.3\build\include\opencv2/flann/autotuned_index.h(464): error: expected a “>”
1>D:\OpenCV2.3\build\include\opencv2/flann/autotuned_index.h(519): error: expected a “>”
1>D:\OpenCV2.3\build\include\opencv2/flann/autotuned_index.h(526): error: expected a “>”
的OpenCV2.3头文件的错误,这nvcc编译器的编译方法引起的错误,解决方法是,给每一个变量加上()括号(以上面这个operations.hpp(1265)为例):
operations.hpp 第1265行:
CV_DbgAssert( this->idx < cn ); 改成: CV_DbgAssert( (this->idx) < (cn) );
win7下硬盘安装ubuntu11.10
- 下载好ubuntu11.10镜像(本教程同样适用于10.xx);
- 硬盘需要20G或者更大的未分配空间(给Ubuntu分配多少空间自己决定)来安装Ubuntu,如若没有,进入磁盘管理,用压缩卷选项从已存在的分区中划分出未分配空间。
- 将镜像放到C盘(准确来说是win7系统盘,本文默认系统盘为C盘),提取镜像中 casper文件夹下的 initrd.lz 和 vmlinuz 两个文件,放到C盘根目录;
- 下载EasyBCD(官方网站),安装,运行,Add New Entry -> NeoGrub -> Install -> Configure,在打开的menu.lst 中添加如下文本:
title Install Ubuntu 11.10 root (hd0,0) kernel (hd0,0)/vmlinuz boot=casper iso-scan/filename=/ubuntu-11.10-desktop-i386.iso ro quiet splash locale=zh_CN.UTF-8 initrd (hd0,0)/initrd.lz
其中,
filename=/ubuntu-11.10-desktop-i386.iso
这个iso的名字要和步骤2中放到C盘镜像名字一样(大小写也要一样,包括后缀名.iso)。添加完后保存menu.lst,退出;
- 重启电脑,选择 NeoGrub Bootloader,再选择 Install Ubuntu 11.10,进入Ubuntu安装;
- 在安装Ubuntu之前,先打开终端执行如下命令:
sudo umount -l /isodevice
该命令是挂载ISO镜像所在的C盘分区;
- 安装Ubuntu。可以直接选择“与win7共存”的选项,让Ubuntu自行划分磁盘,也可以手动分配;
- 安装完后重启,在选项菜单中可以看到4个Ubuntu的选项和1个win7选项,如果此时没有win7选项则进入Ubuntu,在终端中执行:
sudo update-grub
来恢复win7选项;
- 重启进入win7,打开EasyBCD,点击Edit Boot Menu,删除NeoGrub Bootloader选项;
- 至此,win7下硬盘安装Ubuntu完成。
- 如果想删除Ubuntu,可以直接用分区工具把Ubuntu的分区格掉,然后恢复win7系统盘的MBR,就可以彻底删除掉Ubuntu。
CUDA学习之11:常量内存与事件
#include <stdio.h>
#include "common/cpu_bitmap.h"
//是否使用__constan__常量内存的开关
#define CONSTANT
#define INF 2e10f
#define rnd(x) (x*rand()/RAND_MAX)
#define SPHERES 200
#define DIM 800
//求结构体
struct Sphere {
float r, g, b; //球的颜色
float radius; //球的半径
float x, y, z; //球心坐标
//判断从像素点(ox, oy)射出的射线是否与该球相交,并返回交点的z坐标
__device__ float hit(float ox, float oy, float *n)
{
float dx = ox - x;
float dy = oy - y;
if (dx*dx + dy*dy < radius*radius)
{
float dz = sqrtf(radius*radius - dx*dx - dy*dy);
*n = dz/sqrtf(radius*radius);
return dz + z;
}
return -INF;
}
};
#ifdef CONSTANT
//__constant__ 常量内存
__constant__ Sphere s[SPHERES];
#else
Sphere *s;
#endif
#ifdef CONSTANT
//使用constant常量内存时,不能将其当作参数传到global函数
__global__ void kernel(unsigned char * ptr)
#else
//普通全局变量必须用传参的形式传递到global函数
__global__ void kernel(unsigned char * ptr, Sphere *s)
#endif
{
int x = threadIdx.x + blockIdx.x * blockDim.x;
int y = threadIdx.y + blockIdx.y * blockDim.y;
int offset = x + y * blockDim.x * gridDim.x;
float ox = (x - DIM/2);
float oy = (y - DIM/2);
float r=0,g=0,b=0;
//获得最近的交点
float maxz = -INF;
for (int i=0; i<SPHERES; i++)
{
float n;
float t = s[i].hit(ox, oy, &n);
if (t>maxz)
{
float fscale = n;
r = s[i].r * fscale;
g = s[i].g * fscale;
b = s[i].b * fscale;
maxz = t;
}
}
ptr[offset*4 + 0] = (int)(r*255);
ptr[offset*4 + 1] = (int)(g*255);
ptr[offset*4 + 2] = (int)(b*255);
ptr[offset*4 + 3] = 255;
}
int main(void)
{
//使用cuda事件来测试性能
cudaEvent_t start, stop;
cudaEventCreate(&start);
cudaEventCreate(&stop);
cudaEventRecord(start, 0);
CPUBitmap bitmap(DIM, DIM);
unsigned char * dev_bitmap;
cudaMalloc((void**) &dev_bitmap, bitmap.image_size());
#ifdef CONSTANT
// __constant__常量内存不需要动态分配内存
#else
// 在GPU设备上分配内存给球数组
cudaMalloc((void**) &s, sizeof(Sphere) * SPHERES);
#endif
// 在CPU上生成求数据数据
Sphere *temp_s = (Sphere *)malloc( sizeof(Sphere) * SPHERES);
for (int i=0; i<SPHERES; i++)
{
temp_s[i].r = rnd(1.0f);
temp_s[i].g = rnd(1.0f);
temp_s[i].b = rnd(1.0f);
temp_s[i].x = rnd(1000.f) - 500;
temp_s[i].y = rnd(1000.f) - 500;
temp_s[i].z = rnd(1000.f) - 500;
temp_s[i].radius = rnd(100.f) + 20;
}
#ifdef CONSTANT
//从CPU拷贝到__constant__常量内存
cudaMemcpyToSymbol(s, temp_s, sizeof(Sphere) * SPHERES);
#else
//从CPU拷贝到GPU
cudaMemcpy(s, temp_s, sizeof(Sphere) * SPHERES, cudaMemcpyHostToDevice);
#endif
free(temp_s);
dim3 grids(DIM/16, DIM/16);
dim3 threads(16, 16);
#ifdef CONSTANT
kernel<<<grids, threads>>>(dev_bitmap);
#else
kernel<<<grids, threads>>>(dev_bitmap, s);
#endif
cudaMemcpy(bitmap.get_ptr(), dev_bitmap, bitmap.image_size(), cudaMemcpyDeviceToHost);
cudaEventRecord(stop, 0);
//事件同步
cudaEventSynchronize(stop);
float elapseTime;
cudaEventElapsedTime(&elapseTime, start, stop);
printf("Time to generate: %3.1f ms\n", elapseTime);
cudaEventDestroy(start);
cudaEventDestroy(stop);
bitmap.display_and_exit();
cudaFree(dev_bitmap);
#ifdef CONSTANT
// __constant__ 常量内存不需要free
#else
cudaFree(s);
#endif
return 1;
}
- 知识点:
- 使用__constant__修饰符来声明变量为常量内存;
- 常量内存为静态分配空间,所以不需要调用cudaMalloc(),cudaFree();
- CUDA中的时间本质上是一个GPU时间戳,这个时间戳是在用户指定的时间点上记录的;
CUDA学习之10:基于共享内存的位图
#include <stdio.h>
#include "common/cpu_bitmap.h"
#define DIM 512
#define PI 3.1415926535897932f
__global__ void kernel( unsigned char * ptr )
{
int x = threadIdx.x + blockIdx.x * blockDim.x;
int y = threadIdx.y + blockIdx.y * blockDim.y;
int offset = x + y * blockDim.x * gridDim.x;
//共享内存
__shared__ float shared[16][16];
const float period = 128.0f;
//对共享内存进行赋值操作
shared[threadIdx.x][threadIdx.y] = 255*(sinf(x*2.0f*PI/period)+1.0f)*
(sinf(y*2.0f*PI/period)+1.0)/4.0f;
//线程同步
__syncthreads();
ptr[ offset*4 + 0 ] = 0;
ptr[ offset*4 + 1 ] = shared[15-threadIdx.x][15-threadIdx.y];
ptr[ offset*4 + 2 ] = 0;
ptr[ offset*4 + 3 ] = 255;
}
int main()
{
CPUBitmap bitmap( DIM, DIM );
unsigned char *dev_bitmap;
cudaMalloc( (void**) &dev_bitmap, bitmap.image_size() );
dim3 grids( DIM/16, DIM/16 );
dim3 threads( 16,16 );
kernel<<<grids, threads>>>( dev_bitmap );
cudaMemcpy( bitmap.get_ptr(), dev_bitmap, bitmap.image_size(), cudaMemcpyDeviceToHost );
bitmap.display_and_exit();
cudaFree(dev_bitmap);
return 0;
}
- 知识点:
- 线程同步会导致程序效率变低,但是需要同步的地方未同步会导致结果错误,所以,该同步时就同步。基本上在对共享内存进行赋值之后,就该加上线程同步;
- 若将kernel()函数中的:
ptr[ offset*4 + 1 ] = shared[15-threadIdx.x][15-threadIdx.y];
改成:
ptr[ offset*4 + 1 ] = shared[threadIdx.x][threadIdx.y];
则不需要同步,因为线程在该函数中是串行执行的。
CUDA学习之9:点积运算(共享内存和同步)
#include <stdio.h>
#define min(a, b) ( (a) < (b) ? (a) : (b) )
#define sum_squares(x) ((x)*((x)+1)*(2*(x)+1)/6)
const int N = 33 * 1024 ;
//每个线程块中线程的数量
const int threadPerBlock = 256;
//每个线程格中线程块的数量
const int blocksPerGrid = min( 32, ( (N + threadPerBlock - 1) / threadPerBlock ) );
//点积运算
__global__ void dot( float *a, float *b, float *c )
{
//当前线程在当前线程块中的索引
int tid = threadIdx.x + blockIdx.x * blockDim.x;
//当前线程块中的共享内存,用来存储计算结果的缓存
__shared__ float cache[threadPerBlock];
//缓存索引
int cacheIndex = threadIdx.x;
//计算
float temp = 0;
while ( tid < N )
{
temp += a[tid] * b[tid];
tid += blockDim.x * gridDim.x;
}
//将计算值存储到缓存中
cache[cacheIndex] = temp;
//对当前线程块中的线程进行同步
__syncthreads();
//归约法求当前线程块的计算值的总和
int i = blockDim.x / 2;
while( i != 0 )
{
if ( cacheIndex < i )
cache[cacheIndex] += cache[cacheIndex + i];
//线程同步
__syncthreads();
i/=2;
}
//存储总结果
if (cacheIndex == 0)
c[blockIdx.x] = cache[0];
}
int main()
{
float *a, *b, *c, ans;
float *dev_a, *dev_b, *dev_c;
a = new float[N];
b = new float[N];
c = new float[blocksPerGrid];
//CPU a,b数组赋初值
for (int i=0; i<N; i++)
{
a[i] = i;
b[i] = i;
}
//GPU上分配内存
cudaMalloc( (void**)&dev_a, N * sizeof(float) );
cudaMalloc( (void**)&dev_b, N * sizeof(float) );
cudaMalloc( (void**)&dev_c, blocksPerGrid * sizeof(float) );
//将a,b数组拷贝到GPU中
cudaMemcpy( dev_a, a, N * sizeof(float), cudaMemcpyHostToDevice );
cudaMemcpy( dev_b, b, N * sizeof(float), cudaMemcpyHostToDevice );
//计算
dot<<<blocksPerGrid, threadPerBlock>>>(dev_a, dev_b, dev_c);
//将c数组从GPU复制到CPU
cudaMemcpy( c, dev_c, blocksPerGrid * sizeof(float), cudaMemcpyDeviceToHost );
//在CPU上完成最后求和运算
ans = 0;
for (int i = 0; i<blocksPerGrid; i++)
ans += c[i];
//验证结果
printf("the CUDA ans =%.6g; the CPU ans = %.6g\n", ans, sum_squares((float)(N - 1)));
//释放GPU内存
cudaFree(dev_a);
cudaFree(dev_b);
cudaFree(dev_c);
//释放CPU内存
delete [] a;
delete [] b;
delete [] c;
return 0;
}
- 知识点
- 同一个线程块中的线程能进行共享内存和同步,使用__shared__关键字来定义变量为共享内存,使用__syncthreads()来对线程块中的线程进行同步;
- CUDA架构中的线程同步的意思是:除非线程块中的每个线程都执行了__syncthreads(),否则没有任何线程能执行__syncthreads()之后的指令,例如,如果在dot函数中,将__syncthread()放到if语句中:
if(cacheIndex < i){ cache[cacheIndex] += cache[cacheIndex + i]; __syncthreads(); }这样看似使得程序变快:因为不符合条件的线程将直接跳过,而不用同步,但是这样会导致线程发散(Thread Divergence)(某些线程需要执行一条指令,而其他线程不需要执行),使得有些线程永远都无法执行__syncthreads(),从而导致GPU一直等待这些线程,使得程序进入死循环。
CUDA学习之8:波纹效果
#include <stdio.h>
#include "common/cpu_anim.h"
#define DIM 1024
struct DataBlock {
unsigned char *dev_bitmap;
CPUAnimBitmap * bitmap;
};
void cleanup( DataBlock *d )
{
cudaFree( d->dev_bitmap );
}
__global__ void kernel( unsigned char *ptr, int ticks )
{
int x = threadIdx.x + blockIdx.x * blockDim.x;
int y = threadIdx.y + blockIdx.y * blockDim.y;
int offset = x + y * blockDim.x * gridDim.x;
float fx = x - DIM/2;
float fy = y - DIM/2;
float d = sqrtf(fx * fx + fy * fy);
unsigned char grey = (unsigned char)(128.0f + 127.0f *
cos(d/10.0f - ticks/7.0f) /
(d/10.0f + 1.0f));
ptr[offset*4 + 0] = grey;
ptr[offset*4 + 1] = grey;
ptr[offset*4 + 2] = grey;
ptr[offset*4 + 3] = 255;
}
void generate_frame( DataBlock *d, int ticks )
{
dim3 blocks(DIM/16, DIM/16);
dim3 threads(16, 16);
kernel<<<blocks, threads>>>( d->dev_bitmap, ticks );
cudaMemcpy( d->bitmap->get_ptr(), d->dev_bitmap, d->bitmap->image_size(), cudaMemcpyDeviceToHost );
}
int main()
{
DataBlock data;
CPUAnimBitmap bitmap( DIM, DIM, &data );
data.bitmap = &bitmap;
cudaMalloc( (void**)&data.dev_bitmap, bitmap.image_size() );
bitmap.anim_and_exit( (void (*)(void*, int))generate_frame, (void (*)(void*))cleanup );
}
- 知识点
CUDA学习之7:并行线程块的分解
#include <stdio.h>
#define N ( 33 + 1024 )
__global__ void add( int *a, int *b, int *c)
{
//计算该索引处的数据
int tid = threadIdx.x + blockIdx.x * blockDim.x;
while(tid < N)
{
c[tid] = a[tid] + b[tid];
tid += blockDim.x * gridDim.x;
}
}
int main()
{
int a[N], b[N], c[N];
int *dev_a, *dev_b, *dev_c;
//在GPU上分配内存
cudaMalloc((void **) &dev_a, N * sizeof(int));
cudaMalloc((void **) &dev_b, N * sizeof(int));
cudaMalloc((void **) &dev_c, N * sizeof(int));
//在CPU上给数组a,b赋值
for (int i=0; i<N; i++)
{
a[i] = -i;
b[i] = i * i;
}
//将数组从 主机 复制到 设备
cudaMemcpy(dev_a, a, N * sizeof(int), cudaMemcpyHostToDevice);
cudaMemcpy(dev_b, b, N * sizeof(int), cudaMemcpyHostToDevice);
//计算
add<<<(N+127)/128,128>>>(dev_a, dev_b, dev_c);
//将数组从 设备 复制到 主机
cudaMemcpy(c, dev_c, N * sizeof(int), cudaMemcpyDeviceToHost);
//打印结果
for (int i=0; i<N; i++)
printf( "%d + %d = %d\n", a[i],b[i],c[i]);
//释放分配的设备内存
cudaFree(dev_a);
cudaFree(dev_b);
cudaFree(dev_c);
return 0;
}
- 知识点:
- <<<N, M>>>中的N为线程块的个数,M为每个线程块中线程的个数,即CUDA运行时有N个线程块,每个线程块中有M个线程;
- 对应的索引计算方式为:tid = threadIdx.x + blockIdx.x * blockDim.x;
- (N+127)/128是为了防止N/128为非整数,导致启动的线程数量少于预期数量,而(N+127)/128大于等于N/128,使得启动线程数量大于等于预期数量,继而if(tid<N)的判断则非常重要;
- 由于CUDA限制线程格的每一维的大小不能超过65535,而(N+127)/128可能大于65535,解决方法可以是将线程格数量固定为某个大小,比如<<<128, 128>>>,继而在函数中使用
while(tid<N){tid += blockDim.x * gridDim.x}让某些线程串行执行的方式解决该问题。
CUDA学习之6:julia集
#include <stdio.h>
#include "common/cpu_bitmap.h"
#define DIM 1000
struct cuComplex {
float r;
float i;
__device__ cuComplex( float a, float b ) : r(a), i(b) {}
__device__ float magnitude2( void ) { return r * r + i * i; }
__device__ cuComplex operator*( const cuComplex& a ) {
return cuComplex(r*a.r - i*a.i, i*a.r + r*a.i);
}
__device__ cuComplex operator+( const cuComplex& a ) {
return cuComplex( r+a.r, i+a.i );
}
};
__device__ int julia( int x, int y )
{
const float scale = 1.5;
float jx = scale * (float)(DIM/2 - x)/(DIM/2);
float jy = scale * (float)(DIM/2 - y)/(DIM/2);
cuComplex c(-0.8, 0.156);
cuComplex a(jx, jy);
for (int i=0; i<200; i++)
{
a = a * a + c;
if (a.magnitude2() > 1000)
return 0;
}
return 1;
}
__global__ void kernel( unsigned char * ptr )
{
int x = blockIdx.x;
int y = blockIdx.y;
int offset = x + y * gridDim.x;
int juliaValue = julia( x, y );
ptr[offset * 4 + 0] = 255 * juliaValue;
ptr[offset * 4 + 1] = 0;
ptr[offset * 4 + 2] = 0;
ptr[offset * 4 + 3] = 255;
}
int main()
{
CPUBitmap bitmap( DIM, DIM );
unsigned char * dev_bitmap;
//在GPU上分配内存
cudaMalloc((void**)&dev_bitmap, bitmap.image_size());
//声明一个二维线程格
dim3 grid(DIM, DIM);
//将dim3变量传递给CUDA运行时
kernel<<<grid, 1>>>(dev_bitmap);
cudaMemcpy(bitmap.get_ptr(), dev_bitmap, bitmap.image_size(), cudaMemcpyDeviceToHost);
bitmap.display_and_exit();
return 0;
}
- 知识点:
- dim3为CUDA头文件定义的类型,表示一个三维数组,可以用于指定启动的线程块的数量,声明dim3 grid(DIM, DIM)时,第三维指定为1;
- 二维线程块的索引计算方式为:x = blockIdx.x; y = blockIdx.y; offset = x + y * gridDim.x;
- 被__device__修饰符修饰的函数可被其他__device__或__global__函数调用;
CUDA学习之5:矢量求和
#include <stdio.h>
#define N 10
__global__ void add( int *a, int *b, int *c)
{
//计算该索引处的数据
int tid = blockIdx.x;
if(tid < N)
c[tid] = a[tid] + b[tid];
}
int main()
{
int a[N], b[N], c[N];
int *dev_a, *dev_b, *dev_c;
//在GPU上分配内存
cudaMalloc((void **) &dev_a, N * sizeof(int));
cudaMalloc((void **) &dev_b, N * sizeof(int));
cudaMalloc((void **) &dev_c, N * sizeof(int));
//在CPU上给数组a,b赋值
for (int i=0; i<N; i++)
{
a[i] = -i;
b[i] = i * i;
}
//将数组从 主机 复制到 设备
cudaMemcpy(dev_a, a, N * sizeof(int), cudaMemcpyHostToDevice);
cudaMemcpy(dev_b, b, N * sizeof(int), cudaMemcpyHostToDevice);
//计算
add<<<N,1>>>(dev_a, dev_b, dev_c);
//将数组从 设备 复制到 主机
cudaMemcpy(c, dev_c, N * sizeof(int), cudaMemcpyDeviceToHost);
//打印结果
for (int i=0; i<N; i++)
printf( "%d + %d = %d\n", a[i],b[i],c[i]);
//释放分配的设备内存
cudaFree(dev_a);
cudaFree(dev_b);
cudaFree(dev_c);
return 0;
}
- 知识点:
- <<<N, 1>>>其中的N表示设备在执行核函数时使用的并行线程块(block)的数量,这个并行线程块集合称为一个线程格(grid);
- blockIdx为CUDA运行时中已经预先定义的内置变量,表示当前执行设备代码的线程块的索引;
- blockIdx是一个二维索引,即有blockIdx.x,blockIdx.y,因为二维索引在很多地方比一维索引方便;
- 判断tid < N是防止出现内存非法访问的常用手段。
