开发者

How to optimize Conway's game of life for CUDA?

开发者 https://www.devze.com 2023-02-02 17:51 出处:网络
I\'ve written this CUDA kernel for Conway\'s game of life: __global__ void gameOfLife(float* returnBuffer, int width, int height) {

I've written this CUDA kernel for Conway's game of life:

__global__ void gameOfLife(float* returnBuffer, int width, int height) {  
    unsigned int x = blockIdx.x*blockDim.x + threadIdx.x;  
    unsigned int y = blockIdx.y*blockDim.y + threadIdx.y;  
    float p = tex2D(inputTex, x, y);  
    float neighbors = 0;  
    neighbors += tex2D(inputTex, 开发者_JS百科x+1, y);  
    neighbors += tex2D(inputTex, x-1, y);  
    neighbors += tex2D(inputTex, x, y+1);  
    neighbors += tex2D(inputTex, x, y-1);  
    neighbors += tex2D(inputTex, x+1, y+1);  
    neighbors += tex2D(inputTex, x-1, y-1);  
    neighbors += tex2D(inputTex, x-1, y+1);  
    neighbors += tex2D(inputTex, x+1, y-1);  
    __syncthreads();  
    float final = 0;  
    if(neighbors < 2) final = 0;  
    else if(neighbors > 3) final = 0;  
    else if(p != 0) final = 1;  
    else if(neighbors == 3) final = 1;  
    __syncthreads();  
    returnBuffer[x + y*width] = final;  
}

I am looking for errors/optimizations. Parallel programming is quite new to me and I am not sure if I get how to do it right.

The rest is a memcpy from an input array to the 2D texture inputTex bound to a CUDA array. Output is memcpy-ed from global memory to host and then dealt with.

As you can see a thread deals with a single pixel. I am unsure if that is the fastest way as some sources suggest doing a row or more per thread. If I understand correctly NVidia themselves say that the more threads, the better. I would love advice on this from someone with practical experience.


My two cents.

The whole thing looks likely to be bounded by the latency of communication between multiprocessors and the GPU memory. You have code that should take something like 30-50 clock ticks to execute on its own, and it generates at least 3 memory accesses which take 200+ clock ticks each if the requisite data is not in the cache.

Using texture memory is a good way to address that, but it is not necessarily the optimal way.

At the very least, try to do 4 pixels at a time (horizontally) per thread. Global memory can be accessed 128 bytes at a time (as long as you have a warp trying to access any byte in a 128-byte interval, you might as well pull in the whole cache line at almost no additional cost). Since a warp is 32 threads, having each thread work on 4 pixels should be efficient.

Furthermore, you want to have vertically-adjacent pixels worked on by the same multiprocessor. The reason is that adjacent rows share the same input data. If you have the pixel (x=0,y=0) worked on by one MP and the pixel (x=0,y=1) is worked on by a different MP, both MPs have to issue three global memory requests each. If they are both worked on by the same MP and the results are properly cached (implicitly or explicitly), you only need a total of four. This can be done by having each thread work on several vertical pixels, or by having blockDim.y>1.

More generally, you'd probably want to have each 32-thread warp load as much memory as you have available on the MP (16-48 kb, or at least a 128x128 block), and then process all pixels within that window.

On devices of compute compatibility before 2.0, you'll want to use shared memory. On devices of compute compatibility 2.0 and 2.1, caching capabilities are much improved, so global memory may be fine.

Some nontrivial savings could be had by making sure that each warp only accesses two cache lines in each horizontal row of input pixels, instead of three, as would happen in a naive implementation that works on 4 pixels per thread, 32 threads per warp.

There's no good reason to use float as the buffer type. Not only do you end up with four times the memory bandwidth, but the code becomes unreliable and bug-prone. (For example, are you sure that if(neighbors == 3) works correctly, since you're comparing a float and an integer?) Use unsigned char. Better yet, use uint8_t and typedef it to mean unsigned char if it's not defined.

Finally, don't underestimate the value of experimenting. Quite often performance of CUDA code can't be easily explained by logic and you have to resort to tweaking parameters and seeing what happens.


TL;DR: see: http://golly.sourceforge.net

The problem is that most CUDA implementations follow the brain dead idea of manual counting of the neighbors. This is so dead slow that any smart serial CPU implementation will outperform it.

The only sensible way to do GoL calculations is using a lookup table.
The currently fastest implementations on a CPU use lookup a square 4x4 = 16 bit block to see get the future 2x2 cells inside.

in this setup the cells are laid out like so:

 01234567
0xxxxxxxx //byte0
1xxxxxxxx //byte1 
2  etc
3
4
5
6
7

Some bit-shifting is employed to get a 4x4 block to fit into a word and that word is looked up using a lookup table. The lookup tables holds words as well, this way 4 different versions of the outcome can be stored in the lookup table, so you can minimize the amount of bitshifting needed to be done on the input and/or the output.

In addition the different generations are staggered, so that you only have to look at 4 neighboring slabs, instead of 9. Like so:

AAAAAAAA 
AAAAAAAA   BBBBBBBB
AAAAAAAA   BBBBBBBB
AAAAAAAA   BBBBBBBB
AAAAAAAA   BBBBBBBB
AAAAAAAA   BBBBBBBB
AAAAAAAA   BBBBBBBB
AAAAAAAA   BBBBBBBB
           BBBBBBBB
//odd generations (A) are 1 pixel above and to the right of B,
//even generations (B) are 1 pixels below and to the left of A.

This alone results in a 1000x+ speed-up compared to silly counting implementations.

Then there is the optimization of not calculating slabs that are static or have a periodicity of 2.

And then there is HashLife, but that's a different beast altogether.
HashLife can generate Life patterns in O(log n) time, instead of the O(n) time normal implementations can. This allows you to calculate generation: 6,366,548,773,467,669,985,195,496,000 (6 octillion) in mere seconds.
Unfortunately Hashlife requires recursion, and thus is difficult on CUDA.


have a look at this thread, we did allot of improvements over there ...

http://forums.nvidia.com/index.php?showtopic=152757&st=60

0

精彩评论

暂无评论...
验证码 换一张
取 消