开发者

How to synchronize global memory between multiple kernel launches?

开发者 https://www.devze.com 2023-03-17 03:15 出处:网络
I want to launch multiple times the followi开发者_开发问答ng kernel in a FOR LOOP (pseudo): __global__ void kernel(t_dev is input array in global mem) {

I want to launch multiple times the followi开发者_开发问答ng kernel in a FOR LOOP (pseudo):

 __global__ void kernel(t_dev is input array in global mem) {

    __shared__ PREC tt[BLOCK_DIM];

    if (thid < m) {
       tt[thid] = t_dev.data[ii];          // MEM READ!
    }

    ... // MODIFY 

    __syncthreads();

    if (thid < m) {
        t_dev.data[thid] = tt[thid];    // MEM WRITE!
    }

    __threadfence(); // or __syncthreads();  //// NECESSARY!! but why?

}

What I do conceptually is I read in values from t_dev . modify them, and write out to global mem again! and then I start the same kernel again!!

Why do I need obviously the _threadfence or __syncthread otherwise the result get wrong, because, memory writes are not finished when the same kernel starts again. Thats what happens here, my GTX580 has device overlap enabled,

But why are global mem writes not finished when the next kernel starts... is this because of the device overlap or because its always like that? I thought, when we launch kernel after kernel, mem write/reads are finished after one kernel... :-)

Thanks for your answers!

SOME CODE :

 for(int kernelAIdx = 0; kernelAIdx < loops; kernelAIdx++){

      proxGPU::sorProxContactOrdered_1threads_StepA_kernelWrap<PREC,SorProxSettings1>(
              mu_dev,x_new_dev,T_dev,x_old_dev,d_dev,
              t_dev,
              kernelAIdx,
              pConvergedFlag_dev,
              m_absTOL,m_relTOL);


      proxGPU::sorProx_StepB_kernelWrap<PREC,SorProxSettings1>(
              t_dev,
              T_dev,
              x_new_dev,
              kernelAIdx
              );

        }

These are thw two kernels which are in the loop, the t_dev and x_new_dev, is moved from Step A to Step B,

Kernel A looks as follows:

 template<typename PREC, int THREADS_PER_BLOCK, int BLOCK_DIM, int PROX_PACKAGES, typename TConvexSet>
 __global__ void sorProxContactOrdered_1threads_StepA_kernel( 
  utilCuda::Matrix<PREC> mu_dev,
  utilCuda::Matrix<PREC> y_dev,
  utilCuda::Matrix<PREC> T_dev,  
  utilCuda::Matrix<PREC> x_old_dev,
  utilCuda::Matrix<PREC> d_dev, 
  utilCuda::Matrix<PREC> t_dev, 
  int kernelAIdx,
  int maxNContacts, 
  bool * convergedFlag_dev, 
  PREC _absTOL, PREC _relTOL){

 //__threadfence() HERE OR AT THE END; THEN IT WORKS???? WHY

 // Assumend 1 Block, with THREADS_PER_BLOCK Threads and Column Major Matrix T_dev 

     int thid = threadIdx.x;
     int m = min(maxNContacts*PROX_PACKAGE_SIZE, BLOCK_DIM); // this is the actual size of the diagonal block!
     int i = kernelAIdx * BLOCK_DIM;
     int ii = i + thid;

     //First copy x_old_dev in shared
     __shared__ PREC xx[BLOCK_DIM]; // each thread writes one element, if its in the limit!!
     __shared__ PREC tt[BLOCK_DIM];

     if(thid < m){
        xx[thid] = x_old_dev.data[ii];
        tt[thid] = t_dev.data[ii];
     }
     __syncthreads();


     PREC absTOL = _absTOL;
     PREC relTOL = _relTOL;

     int jj;
     //PREC T_iijj;
     //Offset the T_dev_ptr to the start of the Block
     PREC * T_dev_ptr  = PtrElem_ColM(T_dev,i,i);
     PREC * mu_dev_ptr = &mu_dev.data[PROX_PACKAGES*kernelAIdx];
     __syncthreads();
     for(int j_t = 0; j_t < m ; j_t+=PROX_PACKAGE_SIZE){

        //Select the number of threads we need!

        // Here we process one [m x PROX_PACKAGE_SIZE] Block

        // First  Normal Direction ==========================================================
        jj =  i  +  j_t;
        __syncthreads();

        if( ii == jj ){ // select thread on the diagonal ...

           PREC x_new_n = (d_dev.data[ii] + tt[thid]);

           //Prox Normal! 
           if(x_new_n <= 0.0){
              x_new_n = 0.0;
           }
          /* if( !checkConverged(x_new,xx[thid],absTOL,relTOL)){
              *convergedFlag_dev = 0;
           }*/

           xx[thid] = x_new_n;
           tt[thid] = 0.0;
        }
        // all threads not on the diagonal fall into this sync!
        __syncthreads();


        // Select only m threads!
        if(thid < m){
           tt[thid] += T_dev_ptr[thid] * xx[j_t];
        }
        // ====================================================================================


        // wee need to syncronize here because one threads finished lambda_t2 with shared mem tt, which is updated from another thread!
        __syncthreads();



         // Second  Tangential Direction ==========================================================
        jj++;
        __syncthreads();
        if( ii == jj ){ // select thread on diagonal, one thread finishs T1 and T2 directions.

           // Prox tangential
           PREC lambda_T1 =  (d_dev.data[ii] + tt[thid]);
           PREC lambda_T2 =  (d_dev.data[ii+1] + tt[thid+1]);
           PREC radius = (*mu_dev_ptr) * xx[thid-1];
           PREC absvalue = sqrt(lambda_T1*lambda_T1 + lambda_T2*lambda_T2);

           if(absvalue > radius){
              lambda_T1   =  (lambda_T1  * radius ) / absvalue;
              lambda_T2   =  (lambda_T2  * radius ) / absvalue;
           }


           /*if( !checkConverged(lambda_T1,xx[thid],absTOL,relTOL)){
              *convergedFlag_dev = 0;
           }

           if( !checkConverged(lambda_T2,xx[thid+1],absTOL,relTOL)){
              *convergedFlag_dev = 0;
           }*/

           //Write the two values back!
           xx[thid] = lambda_T1;
           tt[thid] = 0.0;
           xx[thid+1] = lambda_T2;
           tt[thid+1] = 0.0;
        }

        // all threads not on the diagonal fall into this sync!
        __syncthreads();


        T_dev_ptr = PtrColOffset_ColM(T_dev_ptr,1,T_dev.outerStrideBytes);
        __syncthreads();
        if(thid < m){
           tt[thid] += T_dev_ptr[thid] * xx[j_t+1];
        }
        __syncthreads();
        T_dev_ptr = PtrColOffset_ColM(T_dev_ptr,1,T_dev.outerStrideBytes);
        __syncthreads();
        if(thid < m){
           tt[thid] += T_dev_ptr[thid] * xx[j_t+2];
        }
        // ====================================================================================


        __syncthreads();
        // move T_dev_ptr 1 column
        T_dev_ptr = PtrColOffset_ColM(T_dev_ptr,1,T_dev.outerStrideBytes);
        // move mu_ptr to nex contact
        __syncthreads();
        mu_dev_ptr = &mu_dev_ptr[1];
        __syncthreads();

     }
     __syncthreads();
     // Write back the results, dont need to syncronize because 
     // do it anyway to be safe for testing first!

     if(thid < m){
        y_dev.data[ii] = xx[thid]; THIS IS UPDATED IN KERNEL B
        t_dev.data[ii] = tt[thid]; THIS IS UPDATED IN KERNEL B
     }


     //__threadfence(); /// THIS STUPID THREADFENCE MAKES IT WORKING!

I compare the solution at the end with the CPU, and HERE I put everywhere I can a syncthread around only to be safe, for the start! (this code does gauss seidel stuff) but it does not work at all without the THREAD_FENCE at the END or at the BEGINNIG where it does not make sense...

Sorry for so much code, but probably you can guess where the problem comes, frome because I am bit at my end, with explainig why this happens? We checked the algorithm several times, there is no memory error (reported from Nsight) or other stuff, every thing works fine... Kernel A is launched with ONE Block only!


If you launch the successive instances of the kernel into the same stream, each kernel launch is synchronous compared to the kernel instance before and after it. The programming model guarantees it. CUDA only permits simultaneous kernel execution on kernels launched into different streams of the same context, and even then overlapping kernel execution only happens if the scheduler determines that sufficient resources are available to do so.

Neither __threadfence nor __syncthreads will have the effect you seem to be thinking about - __threadfence works only at the scope of all active threads and __syncthreads is an intra-block barrier operation. If you really want kernel to kernel synchronization, you need to use one of the host side synchronization calls, like cudaThreadSynchronize (pre CUDA 4.0) or cudaDeviceSynchronize (cuda 4.0 and later), or the per-stream equivalent if you are using streams.


While I am a bit surprised by what you are experiencing, I believe your explanation may be correct.

Writes to global memory, with an exception of atomic functions, are not guaranteed to be immediately visible by other threads (from the same, or from different blocks). By putting __threadfence() you halt the current thread until the writes are in fact visible. This might be important in particular when you are using global memory with a cache (the Fermi series).

One thing to note: Kernel calls are asynchronous. While your first kernel call is being handled by the GPU, the host may issue another call. The next kernel will not run in parallel with your current one, but will launch as soon as the current one finishes, esentially hiding the latency caused by the CPU->GPU communication.

Using cudaThreadSynchronise halts the host thread until all the CUDA tasks are done. It may help you, but it will also prevent you from hiding the CPU->GPU communication latency. Do note, that using synchronous memory access (e.g. cudaMemcpy, without "Async" suffix) esentially behaves like cudaThreadSynchronise too.

0

精彩评论

暂无评论...
验证码 换一张
取 消