开发者

cuda algorithm structure

开发者 https://www.devze.com 2023-03-28 11:05 出处:网络
I would like to understand the general way of doing the following on a GPU using CUDA. I have an algorithm t开发者_运维知识库hat might look something like this:

I would like to understand the general way of doing the following on a GPU using CUDA.

I have an algorithm t开发者_运维知识库hat might look something like this:

void DoStuff(int[,] inputMatrix, int[,] outputMatrix)
{
   forloop {
     forloop {
         if (something) {
                DoStuffA(inputMatrix,a,b,c,outputMatrix)
         }
         else {
               DoStuffB(inputMatrix,a,b,c,outputMatrix)
         }
     }
   }
}

DoStuffA and DoStuffB are simple paralleizable functions (e.g. doing a matrix row operation) that the CUDA examples have plenty of.

What I want to do is to know how to put the main algorithm "DoStuff" onto the GPU and then call DoStuffA and DoStuffB as and when I need to (and they execute in parallel). i.e. the outer loop part is single threaded, but the inner calls are not.

The examples I have seen seem to be multithreaded from the get-go. I assume there is a way to just call a single GPU based method from the outside world and have it control all of the parallel bits by itself?


It depends on how the data inter relates to each other in the for loops, but roughly I would

  1. Pack all input matrices into a block of memory
  2. Upload input matrices
  3. Do for loops on CPU, calling kernels for DoStuffA and DoStuffB
  4. Download output matrices in one block

This way, the biggest problem is overhead for calling each kernel. If your input data is large then it won't be so bad.

0

精彩评论

暂无评论...
验证码 换一张
取 消