CUDA issue in a simple program_问答_开发者_运维开发者技术经验分享

I've spent so much time trying to find out what is going on? The problem is that I'm not able to invoke this simple kernel from my host code. I'm sure that the error will be notable immediately for some people but I feel I'm wasting a lot of time without reason probably. So I'd really appreciate any help.

This is my .cpp code

#include <stdio.h>
#include <stdlib.h>
#include <math.h>
#include <time.h>
#include <windows.h>
#include <shrUtils.h>
#include <cutil_inline.h>
#include <cutil_gl_inline.h>
#include <cuda.h>


CUfunction reduce0;    //i've used many ways to declare my kernel function,but.....


int main( int argc , char *argv[] ){

    int i,N,sum;
    int *data;
    int *Md;
    srand ( time(NULL) );
    N=(int)pow((float)2,(float)atoi(argv[1]));
    data=(int *)malloc(N * sizeof(int));

    for (i=0;i<N;i++){
        data[i]=rand() % 10 + 1;    
    }
    cudaMalloc((void**) &Md, N );

    clock_t start = clock();

    dim3 dimBlock(512,0);
    dim3 dimGrid(1,1);

    reduce0<<< dimGrid,dimBlock >>>(Md,Md);    



    sum=0;
    for(i=0;i<N;i++){
        sum=sum+data[i];
    } 

    printf("Sum of the %d-array is %d \n", N , sum);  
    printf("Time elapsed: %f\n", ((double)clock() - start) / CLOCKS_PER_SEC);   

return 0;

}

and here is my .cu code

 __global__ void reduce0(int*g_idata, int*g_odata){

extern __shared__ int sdata[];

// each thread loadsone element from global to shared mem

unsigned int tid = threadIdx.x;
unsigned int i= blockIdx.x*blockDim.x+ threadIdx.x;
sdata[tid] = g_idata[i];

__syncthreads();

// do reduction in shared mem

for(unsigned int s=1; s < blockDim.x; s *= 2) {
if(tid % (2*s) == 0){
sdata[tid] += sdata[tid + s];
}

__syncthreads();
}

// write result for this block to global mem
if(tid == 0) g_odata[blockIdx.x] = sdata[0];
}

So I ask what should I do to in开发者_如何学编程voke the kernel? On compile it doesn't recognise this symbol "<<<" and as far as reduce0() it recognises it only if I declare in .cpp! Please someone help me to finally start with real cuda things!

CUfunction is a driver API abstraction - not needed if you are going to use the language integration feature that enables the <<<>>> syntax of a kernel invocation.

If you don't have to use the driver API (and most people don't), just move your C++ code into the .cu file and invoke the kernel much as you are doing now.

The cudaMalloc() call allocates device memory that the CPU cannot read or write. You have to copy the input for the reduction into your device memory using cudaMemcpy(...,cudaMemcpyHostToDevice); then, after you are done processing, copy the output to host memory using cudaMemcpy(..., cudaMemcpyDeviceToHost);

ps That reduction kernel is very slow. I would recommend you open the reduction SDK and use one of the kernels from there.

Alternatively, use the Thrust library that will be included in CUDA 4.0. Thrust supports very fast and flexible reductions.

Your code invoking the kernel must be processed by the NVCC compiler. ( <<< is not valid C++) Typically than means putting it in the .cu file. You do not want to move all you cpp code into cu (as you asked in a comment), just the code invoking the kernel.

Change

CUfunction reduce0;

void reduce_kernel(int*g_idata, int*g_odata);

and replace these lines:

dim3 dimBlock(512,0);
dim3 dimGrid(1,1);

reduce0<<< dimGrid,dimBlock >>>(Md,Md);

with:

reduce_kernel(Md, Md);

and add this to your .cu file:

void reduce_kernel(int*g_idata, int*g_odata)
{
    dim3 dimBlock(512,0);
    dim3 dimGrid(1,1);

    reduce0<<< dimGrid,dimBlock >>>(g_idata, g_odata);  
}

This off the top of my head, so might be slightly off, but you can get the idea.

In addition to the above, I think I found an error in your cudaMalloc call. Even if this isn't an actual error I think it's better programming practice for portability. It should instead be:

cudaMalloc((void**) &Md, sizeof(int)*N);

If you are on a Windows machine, check the article for setting up Visual Studio 2010 for CUDA 3.2: http://www.codeproject.com/Tips/186655/CUDA-3-2-on-VS2010-in-9-steps.aspx