密码保护:2012_05_01

这是一篇受密码保护的文章。您需要提供访问密码:


OpenCV2.3在CUDA4.0+VS2008+win7 32bit环境下的配置

  1. 下载相关文件
    1. 在sourceforge上下载OpenCV2.3相关文件,其中:
      • OpenCV-2.3.0-win-superpack.exe 包含了源码和已编译好的版本,本教程以superpack为例;
      • OpenCV-2.3.0-win-src.zip 只包含源码;
      • OpenCV-2.3.0-gpu-support-win32-vs2008.zip 是针对OpenCV-2.3.0-win-superpack.exe 对GPU支持的补充包,如果想直接使用superpack中已编译好的版本且需要CUDA支持,需要下载此包。
    2. 运行OpenCV-2.3.0-win-superpack.exe,得到OpenCV2.3文件夹,其中build问价家为已编译好版本,opencv文件夹为源码;
  2. 源码编译(使用superpack中已编译好的版本可跳过此步骤)
    1. 下载并安装CMake. http://www.cmake.org/
    2. 运行CMake (cmake-gui)(我使用的是CMake2.8.7,各版本界面安排可能稍许不同),填写源码位置,及build输出位置:PIC1
    3. 点击下方 configure 按钮,选择编译器:PIC2 点击Finish,等待configure完毕;
    4. configure完毕后,配置build选项:PIC3 需要CUDA4.0需要注意 勾选WITH_CUDA: PIC4 他默认,或根据自己需要配置即可;
    5. 配置完毕,点击Generate生成build,在第二步中填入的build输出文件夹下会生成一个OpenCV project,运行OpenCV.sln,生成debug、release解决方案,等待生成完毕;
    6. 生成完毕后,在bin和lib文件夹下会生成debug和release的DLL,EXE,LIB,至此,源码编译步骤完毕。
  3. 配置OpenCV2.3
    1. 建立OpenCV23文件夹,以D:\OpenCV23 为例,将解压superpack得到的build问价夹拷贝到该目录下;
    2. 若是自行编译,将2-6中的debug和release 的DLL和EXE覆盖D:\OpenCV23\build\x86\vc9\bin;LIB覆盖D:\OpenCV23\build\x86\vc9\lib;若是使用superpack已编译好版本,将OpenCV-2.3.0-gpu-support-win32-vs2008.zip中的DLL和LIB分别加入上面两个目录替换同名文件;
    3. 将 D:\OpenCV23\build\x86\vc9\bin 加入到PATH变量;
    4. 在VS中 工具->选项->VC++目录中,包含文件添加:D:\OpenCV23\build\include\opencv

      D:\OpenCV23\build\include

      库文件添加:D:\OpenCV23\build\x86\vc9\lib

    5. 在建立的项目中,需要附加依赖项:DEBUG:D:\OpenCV23\build\x86\vc9\lib\opencv_core230d.lib

      D:\OpenCV23\build\x86\vc9\lib\opencv_highgui230d.lib

      D:\OpenCV23\build\x86\vc9\lib\opencv_video230d.lib

      D:\OpenCV23\build\x86\vc9\lib\opencv_ml230d.lib

      D:\OpenCV23\build\x86\vc9\lib\opencv_legacy230d.lib

      D:\OpenCV23\build\x86\vc9\lib\opencv_imgproc230d.lib

      RELEASE:

      D:\OpenCV23\build\x86\vc9\lib\opencv_core230.lib

      D:\OpenCV23\build\x86\vc9\lib\opencv_highgui230.lib

      D:\OpenCV23\build\x86\vc9\lib\opencv_video230.lib

      D:\OpenCV23\build\x86\vc9\lib\opencv_ml230.lib

      D:\OpenCV23\build\x86\vc9\lib\opencv_legacy230.lib

      D:\OpenCV23\build\x86\vc9\lib\opencv_imgproc230.lib

    6. 至此,OpenCV2.3在CUDA4.0+VS2008+win7 32bit环境下的配置完毕。
  4. 遇到的问题

7rack在编译OpenCV2.3和CUDA4.0中遇到了诸如:

1> D:\OpenCV2.3\build\include\opencv2/core/operations.hpp(1265): error: expected a “>”

1>D:\OpenCV2.3\build\include\opencv2/flann/index_testing.h(144): warning: variable “p1″ was set but never used

1>D:\OpenCV2.3\build\include\opencv2/flann/index_testing.h(221): warning: variable “p1″ was set but never used

1>D:\OpenCV2.3\build\include\opencv2/flann/autotuned_index.h(359): error: this operator is not allowed in a template argument expression

1>D:\OpenCV2.3\build\include\opencv2/flann/autotuned_index.h(359): error: expression must have a constant value

1>D:\OpenCV2.3\build\include\opencv2/flann/autotuned_index.h(359): error: expected a “>”

1>D:\OpenCV2.3\build\include\opencv2/flann/autotuned_index.h(394): error: this operator is not allowed in a template argument expression

1>D:\OpenCV2.3\build\include\opencv2/flann/autotuned_index.h(394): error: expression must have a constant value

1>D:\OpenCV2.3\build\include\opencv2/flann/autotuned_index.h(394): error: expected a “>”

1>D:\OpenCV2.3\build\include\opencv2/flann/autotuned_index.h(432): error: this operator is not allowed in a template argument expression

1>D:\OpenCV2.3\build\include\opencv2/flann/autotuned_index.h(432): error: expression must have a constant value

1>D:\OpenCV2.3\build\include\opencv2/flann/autotuned_index.h(432): error: expected a “>”

1>D:\OpenCV2.3\build\include\opencv2/flann/autotuned_index.h(464): error: this operator is not allowed in a template argument expression

1>D:\OpenCV2.3\build\include\opencv2/flann/autotuned_index.h(464): error: expression must have a constant value

1>D:\OpenCV2.3\build\include\opencv2/flann/autotuned_index.h(464): error: expected a “>”

1>D:\OpenCV2.3\build\include\opencv2/flann/autotuned_index.h(519): error: expected a “>”

1>D:\OpenCV2.3\build\include\opencv2/flann/autotuned_index.h(526): error: expected a “>”

的OpenCV2.3头文件的错误,这nvcc编译器的编译方法引起的错误,解决方法是,给每一个变量加上()括号(以上面这个operations.hpp(1265)为例):

operations.hpp 第1265行:

CV_DbgAssert( this->idx < cn ); 改成: CV_DbgAssert( (this->idx) < (cn) );

win7下硬盘安装ubuntu11.10

  1. 下载好ubuntu11.10镜像(本教程同样适用于10.xx);
  2. 硬盘需要20G或者更大的未分配空间(给Ubuntu分配多少空间自己决定)来安装Ubuntu,如若没有,进入磁盘管理,用压缩卷选项从已存在的分区中划分出未分配空间。
  3. 将镜像放到C盘(准确来说是win7系统盘,本文默认系统盘为C盘),提取镜像中 casper文件夹下的 initrd.lz 和 vmlinuz 两个文件,放到C盘根目录;
  4. 下载EasyBCD(官方网站),安装,运行,Add New Entry -> NeoGrub -> Install -> Configure,在打开的menu.lst 中添加如下文本:
    title Install Ubuntu 11.10
    root (hd0,0)
    kernel (hd0,0)/vmlinuz boot=casper iso-scan/filename=/ubuntu-11.10-desktop-i386.iso ro quiet splash locale=zh_CN.UTF-8
    initrd (hd0,0)/initrd.lz
    

    其中,

    filename=/ubuntu-11.10-desktop-i386.iso
    

    这个iso的名字要和步骤2中放到C盘镜像名字一样(大小写也要一样,包括后缀名.iso)。添加完后保存menu.lst,退出;

  5. 重启电脑,选择 NeoGrub Bootloader,再选择 Install Ubuntu 11.10,进入Ubuntu安装;
  6. 在安装Ubuntu之前,先打开终端执行如下命令:
    sudo umount -l /isodevice
    

    该命令是挂载ISO镜像所在的C盘分区;

  7. 安装Ubuntu。可以直接选择“与win7共存”的选项,让Ubuntu自行划分磁盘,也可以手动分配;
  8. 安装完后重启,在选项菜单中可以看到4个Ubuntu的选项和1个win7选项,如果此时没有win7选项则进入Ubuntu,在终端中执行:
    
    sudo update-grub
    

    来恢复win7选项;

  9. 重启进入win7,打开EasyBCD,点击Edit Boot Menu,删除NeoGrub Bootloader选项;
  10. 至此,win7下硬盘安装Ubuntu完成。
  11.  

  • 如果想删除Ubuntu,可以直接用分区工具把Ubuntu的分区格掉,然后恢复win7系统盘的MBR,就可以彻底删除掉Ubuntu。

CUDA学习之11:常量内存与事件


#include <stdio.h>
#include "common/cpu_bitmap.h"

//是否使用__constan__常量内存的开关
#define CONSTANT

#define INF 2e10f
#define rnd(x) (x*rand()/RAND_MAX)
#define SPHERES 200
#define DIM 800

//求结构体
struct Sphere {
float r, g, b; //球的颜色
float radius; //球的半径
float x, y, z; //球心坐标

//判断从像素点(ox, oy)射出的射线是否与该球相交,并返回交点的z坐标
__device__ float hit(float ox, float oy, float *n)
{
float dx = ox - x;
float dy = oy - y;
if (dx*dx + dy*dy < radius*radius)
{
float dz = sqrtf(radius*radius - dx*dx - dy*dy);
*n = dz/sqrtf(radius*radius);
return dz + z;
}
return -INF;
}
};
#ifdef CONSTANT
//__constant__ 常量内存
__constant__ Sphere s[SPHERES];
#else
Sphere *s;
#endif

#ifdef CONSTANT
//使用constant常量内存时,不能将其当作参数传到global函数
__global__ void kernel(unsigned char * ptr)
#else
//普通全局变量必须用传参的形式传递到global函数
__global__ void kernel(unsigned char * ptr, Sphere *s)
#endif
{
int x = threadIdx.x + blockIdx.x * blockDim.x;
int y = threadIdx.y + blockIdx.y * blockDim.y;

int offset = x + y * blockDim.x * gridDim.x;
float ox = (x - DIM/2);
float oy = (y - DIM/2);

float r=0,g=0,b=0;
//获得最近的交点
float maxz = -INF;
for (int i=0; i<SPHERES; i++)
{
float n;
float t = s[i].hit(ox, oy, &n);
if (t>maxz)
{
float fscale = n;
r = s[i].r * fscale;
g = s[i].g * fscale;
b = s[i].b * fscale;
maxz = t;
}
}

ptr[offset*4 + 0] = (int)(r*255);
ptr[offset*4 + 1] = (int)(g*255);
ptr[offset*4 + 2] = (int)(b*255);
ptr[offset*4 + 3] = 255;
}

int main(void)
{
//使用cuda事件来测试性能
cudaEvent_t start, stop;
cudaEventCreate(&start);
cudaEventCreate(&stop);
cudaEventRecord(start, 0);

CPUBitmap bitmap(DIM, DIM);
unsigned char * dev_bitmap;

cudaMalloc((void**) &dev_bitmap, bitmap.image_size());

#ifdef CONSTANT
// __constant__常量内存不需要动态分配内存
#else
// 在GPU设备上分配内存给球数组
cudaMalloc((void**) &s, sizeof(Sphere) * SPHERES);
#endif

// 在CPU上生成求数据数据
Sphere *temp_s = (Sphere *)malloc( sizeof(Sphere) * SPHERES);
for (int i=0; i<SPHERES; i++)
{
temp_s[i].r = rnd(1.0f);
temp_s[i].g = rnd(1.0f);
temp_s[i].b = rnd(1.0f);

temp_s[i].x = rnd(1000.f) - 500;
temp_s[i].y = rnd(1000.f) - 500;
temp_s[i].z = rnd(1000.f) - 500;

temp_s[i].radius = rnd(100.f) + 20;
}

#ifdef CONSTANT
//从CPU拷贝到__constant__常量内存
cudaMemcpyToSymbol(s, temp_s, sizeof(Sphere) * SPHERES);
#else
//从CPU拷贝到GPU
cudaMemcpy(s, temp_s, sizeof(Sphere) * SPHERES, cudaMemcpyHostToDevice);
#endif

free(temp_s);

dim3 grids(DIM/16, DIM/16);
dim3 threads(16, 16);

#ifdef CONSTANT
kernel<<<grids, threads>>>(dev_bitmap);
#else
kernel<<<grids, threads>>>(dev_bitmap, s);
#endif

cudaMemcpy(bitmap.get_ptr(), dev_bitmap, bitmap.image_size(), cudaMemcpyDeviceToHost);

cudaEventRecord(stop, 0);
//事件同步
cudaEventSynchronize(stop);

float elapseTime;
cudaEventElapsedTime(&elapseTime, start, stop);
printf("Time to generate: %3.1f ms\n", elapseTime);
cudaEventDestroy(start);
cudaEventDestroy(stop);

bitmap.display_and_exit();

cudaFree(dev_bitmap);
#ifdef CONSTANT
// __constant__ 常量内存不需要free
#else
cudaFree(s);
#endif

return 1;
}
  • 知识点:
  1. 使用__constant__修饰符来声明变量为常量内存;
  2. 常量内存为静态分配空间,所以不需要调用cudaMalloc(),cudaFree();
  3. CUDA中的时间本质上是一个GPU时间戳,这个时间戳是在用户指定的时间点上记录的;

CUDA学习之10:基于共享内存的位图


#include <stdio.h>
#include "common/cpu_bitmap.h"

#define DIM 512
#define PI 3.1415926535897932f

__global__ void kernel( unsigned char * ptr )
{
 int x = threadIdx.x + blockIdx.x * blockDim.x;
 int y = threadIdx.y + blockIdx.y * blockDim.y;

 int offset = x + y * blockDim.x * gridDim.x;

 //共享内存
 __shared__ float shared[16][16];

 const float period = 128.0f;

 //对共享内存进行赋值操作
 shared[threadIdx.x][threadIdx.y] = 255*(sinf(x*2.0f*PI/period)+1.0f)*
          (sinf(y*2.0f*PI/period)+1.0)/4.0f;

 //线程同步
 __syncthreads();

 ptr[ offset*4 + 0 ] = 0;
 ptr[ offset*4 + 1 ] = shared[15-threadIdx.x][15-threadIdx.y];
 ptr[ offset*4 + 2 ] = 0;
 ptr[ offset*4 + 3 ] = 255;
}

int main()
{
 CPUBitmap bitmap( DIM, DIM );
 unsigned char *dev_bitmap;

 cudaMalloc( (void**) &dev_bitmap, bitmap.image_size() );

 dim3 grids( DIM/16, DIM/16 );
 dim3 threads( 16,16 );

 kernel<<<grids, threads>>>( dev_bitmap );

 cudaMemcpy( bitmap.get_ptr(), dev_bitmap, bitmap.image_size(), cudaMemcpyDeviceToHost );

 bitmap.display_and_exit();

 cudaFree(dev_bitmap);

 return 0;
}
  • 知识点:
  1. 线程同步会导致程序效率变低,但是需要同步的地方未同步会导致结果错误,所以,该同步时就同步。基本上在对共享内存进行赋值之后,就该加上线程同步;
  2. 若将kernel()函数中的:
     ptr[ offset*4 + 1 ] = shared[15-threadIdx.x][15-threadIdx.y];

    改成:

     ptr[ offset*4 + 1 ] = shared[threadIdx.x][threadIdx.y];

    则不需要同步,因为线程在该函数中是串行执行的。

CUDA学习之9:点积运算(共享内存和同步)


#include <stdio.h>
#define min(a, b) ( (a) < (b) ? (a) : (b) )
#define sum_squares(x) ((x)*((x)+1)*(2*(x)+1)/6)

const int N = 33 * 1024 ;
//每个线程块中线程的数量
const int threadPerBlock = 256;
//每个线程格中线程块的数量
const int blocksPerGrid = min( 32, ( (N + threadPerBlock - 1) / threadPerBlock ) );

//点积运算
__global__ void dot( float *a, float *b, float *c )
{
 //当前线程在当前线程块中的索引
 int tid = threadIdx.x + blockIdx.x * blockDim.x;
 //当前线程块中的共享内存,用来存储计算结果的缓存
 __shared__ float cache[threadPerBlock];
 //缓存索引
 int cacheIndex = threadIdx.x;
 
 //计算
 float temp = 0;
 while ( tid < N )
 {
  temp += a[tid] * b[tid];
  tid += blockDim.x * gridDim.x;
 }
 //将计算值存储到缓存中
 cache[cacheIndex] = temp;
 //对当前线程块中的线程进行同步
 __syncthreads();

 //归约法求当前线程块的计算值的总和
 int i = blockDim.x / 2;
 while( i != 0 )
 {
  if ( cacheIndex < i )
   cache[cacheIndex] += cache[cacheIndex + i];
  //线程同步
  __syncthreads();
  i/=2;
 }

 //存储总结果
 if (cacheIndex == 0)
  c[blockIdx.x] = cache[0];
}

int main()
{
 float *a, *b, *c, ans;
 float *dev_a, *dev_b, *dev_c;

 a = new float[N];
 b = new float[N];
 c = new float[blocksPerGrid];

 //CPU a,b数组赋初值
 for (int i=0; i<N; i++)
 {
  a[i] = i;
  b[i] = i;
 }

 //GPU上分配内存
 cudaMalloc( (void**)&dev_a, N * sizeof(float) );
 cudaMalloc( (void**)&dev_b, N * sizeof(float) );
 cudaMalloc( (void**)&dev_c, blocksPerGrid * sizeof(float) );

 //将a,b数组拷贝到GPU中
 cudaMemcpy( dev_a, a, N * sizeof(float), cudaMemcpyHostToDevice );
 cudaMemcpy( dev_b, b, N * sizeof(float), cudaMemcpyHostToDevice );

 //计算
 dot<<<blocksPerGrid, threadPerBlock>>>(dev_a, dev_b, dev_c);

 //将c数组从GPU复制到CPU
 cudaMemcpy( c, dev_c, blocksPerGrid * sizeof(float), cudaMemcpyDeviceToHost );

 //在CPU上完成最后求和运算
 ans = 0;
 for (int i = 0; i<blocksPerGrid; i++)
  ans += c[i];

 //验证结果
 printf("the CUDA ans =%.6g; the CPU ans = %.6g\n", ans, sum_squares((float)(N - 1)));

 //释放GPU内存
 cudaFree(dev_a);
 cudaFree(dev_b);
 cudaFree(dev_c);

 //释放CPU内存
 delete [] a;
 delete [] b;
 delete [] c;

 return 0;
}
  • 知识点
  1. 同一个线程块中的线程能进行共享内存和同步,使用__shared__关键字来定义变量为共享内存,使用__syncthreads()来对线程块中的线程进行同步;
  2. CUDA架构中的线程同步的意思是:除非线程块中的每个线程都执行了__syncthreads(),否则没有任何线程能执行__syncthreads()之后的指令,例如,如果在dot函数中,将__syncthread()放到if语句中:
    if(cacheIndex < i){
    cache[cacheIndex] += cache[cacheIndex + i];
    __syncthreads();
    }

    这样看似使得程序变快:因为不符合条件的线程将直接跳过,而不用同步,但是这样会导致线程发散(Thread Divergence)(某些线程需要执行一条指令,而其他线程不需要执行),使得有些线程永远都无法执行__syncthreads(),从而导致GPU一直等待这些线程,使得程序进入死循环。

 

CUDA学习之8:波纹效果


#include <stdio.h>
#include "common/cpu_anim.h"

#define DIM 1024

struct DataBlock {
 unsigned char *dev_bitmap;
 CPUAnimBitmap * bitmap;
};

void cleanup( DataBlock *d )
{
 cudaFree( d->dev_bitmap );
}

__global__ void kernel( unsigned char *ptr, int ticks )
{
 int x = threadIdx.x + blockIdx.x * blockDim.x;
 int y = threadIdx.y + blockIdx.y * blockDim.y;
 int offset = x + y * blockDim.x * gridDim.x;

 float fx = x - DIM/2;
 float fy = y - DIM/2;
 float d = sqrtf(fx * fx + fy * fy);
 unsigned char grey = (unsigned char)(128.0f + 127.0f *
          cos(d/10.0f - ticks/7.0f) /
          (d/10.0f + 1.0f));
 ptr[offset*4 + 0] = grey;
 ptr[offset*4 + 1] = grey;
 ptr[offset*4 + 2] = grey;
 ptr[offset*4 + 3] = 255;

}

void generate_frame( DataBlock *d, int ticks )
{
 dim3 blocks(DIM/16, DIM/16);
 dim3 threads(16, 16);

 kernel<<<blocks, threads>>>( d->dev_bitmap, ticks );

 cudaMemcpy( d->bitmap->get_ptr(), d->dev_bitmap, d->bitmap->image_size(), cudaMemcpyDeviceToHost );
}

int main()
{
 DataBlock data;
 CPUAnimBitmap bitmap( DIM, DIM, &data );
 data.bitmap = &bitmap;
 cudaMalloc( (void**)&data.dev_bitmap, bitmap.image_size() );
 bitmap.anim_and_exit( (void (*)(void*, int))generate_frame, (void (*)(void*))cleanup );
}
  • 知识点
  1. 变量blocks表示在线程格中包含的并行线程块数量,变量threads表示在每个线程块中包含的线程数量;
  2. threadIdx.x ∈ [0, blockDim.x), blockIdx.x ∈ [0, gridDim.x);

CUDA学习之7:并行线程块的分解

#include <stdio.h>
#define N ( 33 + 1024 )

__global__ void add( int *a, int *b, int *c)
{
 //计算该索引处的数据
 int tid = threadIdx.x + blockIdx.x * blockDim.x;
 while(tid < N)
 {
  c[tid] = a[tid] + b[tid];
  tid += blockDim.x * gridDim.x;
 }
}

int main()
{
 int a[N], b[N], c[N];
 int *dev_a, *dev_b, *dev_c;

 //在GPU上分配内存
 cudaMalloc((void **) &dev_a, N * sizeof(int));
 cudaMalloc((void **) &dev_b, N * sizeof(int));
 cudaMalloc((void **) &dev_c, N * sizeof(int));

 //在CPU上给数组a,b赋值
 for (int i=0; i<N; i++)
 {
  a[i] = -i;
  b[i] = i * i;
 }

 //将数组从 主机 复制到 设备
 cudaMemcpy(dev_a, a, N * sizeof(int), cudaMemcpyHostToDevice);
 cudaMemcpy(dev_b, b, N * sizeof(int), cudaMemcpyHostToDevice);

 //计算
 add<<<(N+127)/128,128>>>(dev_a, dev_b, dev_c);

 //将数组从 设备 复制到 主机
 cudaMemcpy(c, dev_c, N * sizeof(int), cudaMemcpyDeviceToHost);

 //打印结果
 for (int i=0; i<N; i++)
  printf( "%d + %d = %d\n", a[i],b[i],c[i]);

 //释放分配的设备内存
 cudaFree(dev_a);
 cudaFree(dev_b);
 cudaFree(dev_c);

 return 0;
}
  • 知识点:
  1. <<<N, M>>>中的N为线程块的个数,M为每个线程块中线程的个数,即CUDA运行时有N个线程块,每个线程块中有M个线程;
  2. 对应的索引计算方式为:tid = threadIdx.x + blockIdx.x * blockDim.x;
  3. (N+127)/128是为了防止N/128为非整数,导致启动的线程数量少于预期数量,而(N+127)/128大于等于N/128,使得启动线程数量大于等于预期数量,继而if(tid<N)的判断则非常重要;
  4. 由于CUDA限制线程格的每一维的大小不能超过65535,而(N+127)/128可能大于65535,解决方法可以是将线程格数量固定为某个大小,比如<<<128, 128>>>,继而在函数中使用
    while(tid<N){tid += blockDim.x * gridDim.x}

    让某些线程串行执行的方式解决该问题。

CUDA学习之6:julia集


#include <stdio.h>
#include "common/cpu_bitmap.h"
#define DIM 1000

struct cuComplex {
 float r;
 float i;
 __device__ cuComplex( float a, float b ) : r(a), i(b) {}
 __device__ float magnitude2( void ) { return r * r + i * i; }
 __device__ cuComplex operator*( const cuComplex& a ) {
  return cuComplex(r*a.r - i*a.i, i*a.r + r*a.i);
 }
 __device__ cuComplex operator+( const cuComplex& a ) {
  return cuComplex( r+a.r, i+a.i );
 }
};

__device__ int julia( int x, int y )
{
 const float scale = 1.5;
 float jx = scale * (float)(DIM/2 - x)/(DIM/2);
 float jy = scale * (float)(DIM/2 - y)/(DIM/2);

 cuComplex c(-0.8, 0.156);
 cuComplex a(jx, jy);

 for (int i=0; i<200; i++)
 {
  a = a * a + c;
  if (a.magnitude2() > 1000)
   return 0;
 }

 return 1;
}
__global__ void kernel( unsigned char * ptr )
{
 int x = blockIdx.x;
 int y = blockIdx.y;
 int offset = x + y * gridDim.x;

 int juliaValue = julia( x, y );
 ptr[offset * 4 + 0] = 255 * juliaValue;
 ptr[offset * 4 + 1] = 0;
 ptr[offset * 4 + 2] = 0;
 ptr[offset * 4 + 3] = 255;
}

int main()
{
 CPUBitmap bitmap( DIM, DIM );
 unsigned char * dev_bitmap;

 //在GPU上分配内存
 cudaMalloc((void**)&dev_bitmap, bitmap.image_size());
 
 //声明一个二维线程格
 dim3 grid(DIM, DIM);
 //将dim3变量传递给CUDA运行时
 kernel<<<grid, 1>>>(dev_bitmap);

 cudaMemcpy(bitmap.get_ptr(), dev_bitmap, bitmap.image_size(), cudaMemcpyDeviceToHost);

 bitmap.display_and_exit();

 return 0;
}
  • 知识点:
  1. dim3为CUDA头文件定义的类型,表示一个三维数组,可以用于指定启动的线程块的数量,声明dim3 grid(DIM, DIM)时,第三维指定为1;
  2. 二维线程块的索引计算方式为:x = blockIdx.x; y = blockIdx.y; offset = x + y * gridDim.x;
  3. 被__device__修饰符修饰的函数可被其他__device__或__global__函数调用;

CUDA学习之5:矢量求和

#include <stdio.h>
#define N 10

__global__ void add( int *a, int *b, int *c)
{
	//计算该索引处的数据
	int tid = blockIdx.x;
	if(tid < N)
		c[tid] = a[tid] + b[tid];
}

int main()
{
	int a[N], b[N], c[N];
	int *dev_a, *dev_b, *dev_c;

	//在GPU上分配内存
	cudaMalloc((void **) &dev_a, N * sizeof(int));
	cudaMalloc((void **) &dev_b, N * sizeof(int));
	cudaMalloc((void **) &dev_c, N * sizeof(int));

	//在CPU上给数组a,b赋值
	for (int i=0; i<N; i++)
	{
		a[i] = -i;
		b[i] = i * i;
	}

	//将数组从 主机 复制到 设备
	cudaMemcpy(dev_a, a, N * sizeof(int), cudaMemcpyHostToDevice);
	cudaMemcpy(dev_b, b, N * sizeof(int), cudaMemcpyHostToDevice);

	//计算
	add<<<N,1>>>(dev_a, dev_b, dev_c);

	//将数组从 设备 复制到 主机
	cudaMemcpy(c, dev_c, N * sizeof(int), cudaMemcpyDeviceToHost);

	//打印结果
	for (int i=0; i<N; i++)
		printf( "%d + %d = %d\n", a[i],b[i],c[i]);

	//释放分配的设备内存
	cudaFree(dev_a);
	cudaFree(dev_b);
	cudaFree(dev_c);

	return 0;
}
  • 知识点:
  1. <<<N, 1>>>其中的N表示设备在执行核函数时使用的并行线程块(block)的数量,这个并行线程块集合称为一个线程格(grid);
  2. blockIdx为CUDA运行时中已经预先定义的内置变量,表示当前执行设备代码的线程块的索引;
  3. blockIdx是一个二维索引,即有blockIdx.x,blockIdx.y,因为二维索引在很多地方比一维索引方便;
  4. 判断tid < N是防止出现内存非法访问的常用手段。