开发者

CUDA: Shared memory allocation with overlapping borders

开发者 https://www.devze.com 2023-02-23 01:02 出处:网络
Is there an easy way (google hasn\'t delivered...) to allocate per-block shared memory regions from a single input array such that there can be an overlap?

Is there an easy way (google hasn't delivered...) to allocate per-block shared memory regions from a single input array such that there can be an overlap?

The simple example is string searching; Saw I want to dice up the input text, have each thread in each block search for a pattern starting from text[thread_id], but want the data assigned to each block to overlap by the pattern length so matching cases that fall across the border are still found.

I.e the total memory size allocated to shared memory on each block is

(blocksize+patternlength)*sizeof(char)

I'm probably missing something simple and am currently diving through the CUDA guide, but would appreciate some guidance.

UPDATE: I suspect some people have misunderstood my question (or I mis-explained it).

Say I have a dataset QWERTYUIOP, and i want to search for a 3 character match, and i dice up the dataset (arbitrarily) into 4's for each thread block; QWER TYUI OPxx

This is simple enough to accomplish but the algorithm fails if the 3 character match is actually looking for IOP.

In this case, what I want is for each block to have in shared memory:

QWERTY TYUIOP OPxxxx

ie each block gets assigned the blocksize+patternlength-1 characters so no memory border issues occur.

Hope that explains things better.

Since @jmilloy is being persistent... :P

//VERSION 1: Simple
__global__ void gpuSearchSimple(char *T, int lenT, char *P, int lenP, int *pFound)
{
  int startIndex = blockDim.x*blockIdx.x + threadIdx.x;
    int fMatch = 1;
    for (int i=0; i < lenP; i++)
    {
      if (T[startIndex+i] != P[i]) fMatch = 0;
    }
    if (fMatch) atomicMin(pFound, startIndex);
}
//VERSION 2: Texture
__global__ void gpuSearchTexture(int lenT, int lenP, int *pFound)
{
  int startIndex = blockDim.x*blockIdx.x + threadIdx.x;
    int fMatch = 1;开发者_Go百科
    for (int i=0; i < lenP; i++)
    {
      if (tex1Dfetch(texT,startIndex+i) != tex1Dfetch(texP,i)) fMatch = 0;
    }
    if (fMatch) atomicMin(pFound, startIndex);
}
//Version 3: Shared
__global__ void gpuSearchTexSha(int lenT, int lenP, int *pFound)
{
  extern __shared__ char shaP[];
  for (int i=0;threadIdx.x+i<lenP; i+=blockDim.x){
    shaP[threadIdx.x+i]= tex1Dfetch(texP,threadIdx.x+i);
  }
  __syncthreads();

  //At this point shaP is populated with the pattern
  int startIndex = blockDim.x*blockIdx.x + threadIdx.x;
    // only continue if an earlier instance hasn't already been found
    int fMatch = 1;
    for (int i=0; i < lenP; i++)
    {
      if (tex1Dfetch(texT,startIndex+i) != shaP[i]) fMatch = 0;
    }
    if (fMatch) atomicMin(pFound, startIndex);
}

What I would like to have done is to put the text into shared memory chunks, as described in the rest of the question, instead of keeping the text in texture memory for the later versions.


I am not sure that question makes all that much sense. You can dynamically size a shared allocation memory at runtime like this:

__global__ void kernel()
{
    extern __shared__ int buffer[];
    ....
}

kernel<<< gridsize, blocksize, buffersize >>>();

but the contents of the buffer are undefined at the beginning of the kernel. You will have to devise a scheme in the kernel to load from global memory with the overlap that you want to ensure that your pattern matching will work as you want it to.


No. Shared memory is shared between threads in a block, and is ONLY accessible to the block it is assigned to. You cannot have shared memory that is available to two different blocks.

As far as I know, shared memory actually resides on the multiprocessors, and a thread can only access the shared memory from the multiprocessor that it is running on. So this is a physical limitation. (I guess if two blocks reside on one mp, a thread from one block may be able to unpredictably access the shared memory that was allocated to the other block).

Remember that you need to explicitly copy the data from global memory to shared memory. It is a simple matter to copy overlapping regions of the string to non-overlapping shared memory.

I think getting your data where you need it is the majority of the work required in developing CUDA programs. My guidance is that you start with a version that solves the problem without using any shared memory first. In order for that to work, you will solve your overlapping problem and the shared memory implementation will be easy!


edit 2
after answer was marked as correct

__global__ void gpuSearchTexSha(int lenT, int lenP, int *pFound)
{
    extern __shared__ char* shared;

    char* shaP = &shared[0];
    char* shaT = &shared[lenP];

    //copy pattern into shaP in parallel
    if(threadIdx.x < lenP)
        shaP[threadIdx.x] = tex1Dfetch(texP,threadIdx.x);

    //determine texT start and length for this block
    blockStartIndex = blockIdx.x * gridDim.x/lenT;
    lenS = gridDim.x/lenT + lenP - 1;

    //copy text into shaT in parallel
    shaT[threadIdx.x] = tex1Dfetch(texT,blockStartIndex + threadIdx.x);
    if(threadIdx.x < lenP)
        shaP[blockDim.x + threadIdx.x] = text1Dfetch(texT,blockStartIndex + blockDim.x + threadIdx.x)

    __syncthreads();

    //We have one pattern in shaP for each thread in the block
    //We have the necessary portion of the text (with overlaps) in shaT

    int fMatch = 1;
    for (int i=0; i < lenP; i++)
    {
        if (shaT[threadIdx.x+i] != shaP[i]) fMatch = 0;
    }
    if (fMatch) atomicMin(pFound, blockStartIndex + threadIdx.x);
}

key notes:

  • we only need one copy of the pattern in shared memory per block - they can all use it
  • shared memory needed per block is lenP + lenS (where lenS is your blocksize + patternlength)
  • the kernel assumes that gridDim.x * blockDim.x = lenT (the same as version 1)
  • we can copy into shared memory in parallel (no need for for loops if you have enough threads)


Overlapping shared memory is not good, the thread will have to synchronize each time they want to access the same address in shared memory (although in architecture >= 2.0 this has been mitigated).

The simplest idea that comes into my mind is to duplicate the portion of the text that you want to be overlapped.

Instead of reading from the global memory in exact chuncks:

AAAA BBBB CCCC DDDD EEEE

Read with overlapping:

AAAA BBBB CCCC CCCC DDDD EEEEE

0

精彩评论

暂无评论...
验证码 换一张
取 消