Question about memory allocation in CUDA kernel_问答_开发者

Hey there, I have an array with the size SIZE*sizeof(double) on my host. I allocate a device pointer of the size of the host-array and copy the array to the device. Now I pass this device array dev_point to my kernel function. Each threads need to modify some values of the passed array and than calculates another _device__-function with the new array-values (different for each thread). Now I wonder how to do this? Before I had a complete CPU-version (serial code) of my program and I simply always created a new array like double new_array[ SIZE ] and than copied the data from the original array to it, modified it and than deleted it again. But how to do it in CUDA, since I can't allocate memory from within a kernel-function. Is there any possibility to use 'local' memory of the thread to store a new array in? Or do I have to allocate a big array of the size SIZE * number_of_total_threads * sizeof(double) before calling my kernel function so that each thread can store the modified array in it? Thanks a lot in advance!

EDIT: FULL DESCRIPTION! Okay here's a better problem description on my 'current' case: In my host program I have a array, let's say with 300 values (dependent on the user input between 100 and 400, let's call that variable numValues which is a size depending on the program parameters and not hardcoded into the program!) of doubles. Now I want each kernel-execution to take exactly this array (it is actually copied to GPU and passed to kernel function as a pointer), change the value to the n-th element (n = unique identifier which goes from 0 to numValues), whereas all other array elements stay the same. The modification is a simple addition of a certain constant value, which is also passed to the program by the user. And than call a function which is defined like that __device__ double thefunction(double *ary) passing the modified array.

So the first solution which I thought about was the one I asked here: Give each thread (each kernel execution) an own array-copy (I thought this could be done locally, but obviously can't because numValues is runtime specific) and than let each thread modify value n and calculate thefun开发者_如何学Goction with it.

Now I just came up with another idea while writing this here: Perhaps it would be better to have the array in constant or shared memory ONCE so that each thread passes the array as unmodified array to thefunction but specifies as additional parameters to thefunction an index int idx about which element to modify and another parameter double *add about the value to add to the idx-th element. The only thing which I wonder about is than: How to add the value *add to the idx-th Element without modifying the array ary passed to the function because this is passed as a pointer and I don't want to modify the original one!

Thanks!

If you only need to modify some values, as you say, then at the level of the data required to compute a single result, it follows you should only need a local copy of some of the input data. That local copy should, ideally, be held in registers, which is the fastest place to store thread local data.

If there is data dependency between the local modifications required for one result and the local modifications required for another, the usual solution is to group the computation of interdependent results into a single block and use shared memory to hold the modified data, which allows data exchange between threads within that block. If you algorithm is something like a recurrence relation where there is sequential dependence on data from a previous calculation, you need a new algorithm, because those types of calculations cannot be easily executed in parallel in any useful way.

That is about as good a answer as you will get without specifics of the code and algorithms involved.

If an array will be used by a single GPU thread, you can allocate it in local memory or global memory. Local memory is declared inside the kernel function, and works exactly like declaring a local variable in C. You can only use local memory if SIZE is an integer constant, not a run-time value. Otherwise, you will have to use global memory. To use global memory, as you said, allocate a big array. Each thread would compute the offset into the big array that it would use as its private array.

However, you should consider reorganizing the algorithm on the GPU. Unlike a CPU, where one thread can use a large cache, the on-chip memory on a GPU is shared by hundreds of threads. There is not enough on-chip memory for each thread to have a large private array, and off-chip memory is too slow to be useful. (For example, a typical kernel on a G80 architecture would have 256 threads using 16KB of shared memory, which limits the maximum array size to 64 bytes.) Is it possible to parallelize the computation of a private array?