开发者

HELP! CUDA kernel will no longer launch after using too much memory

开发者 https://www.devze.com 2023-03-01 03:17 出处:网络
I\'m writing a program that requires the following kernel launch: dim3 blocks(16,16,16); //grid dimensions

I'm writing a program that requires the following kernel launch:

dim3 blocks(16,16,16); //grid dimensions
dim3 threads(32,32); //block dimensions
get_gaussian_responses<<<blocks,threads>>>(pDeviceIntegral,itgStepSize,pScaleSpace);

I forgot to free the pScaleSpace array at the end of the program, and then ran the program through the CUDA profiler, which runs it 15 times in succession, using up a lot of memory / causing a lot of fragmentation. Now whenever I run the program, the kernel doesn't even launch. If I look at the list of function calls recorded by the profiler, the kernel is not there. I realize this is a pretty stupid error, but I don't know what I can do at this point to get the program to run again. I have restarted my computer, but that did not help. If I reduce the dimensions of the kernel, it runs fine, but the current dimensions are 开发者_运维百科well within the allowed maximum for my card.

Max threads per block: 1024
Max grid dimensions: 65535,65535,65535

Any suggestions appreciated, thanks in advance!


Try launching with lesser number of threads. If that works, it means that each of your threads is doing a lot of work or using a lot of memory. Thus the maximum possible number of threads cannot possibly be practically launched by CUDA on your hardware.

You may have to make your CUDA code more efficient to be able to launch more threads. You could try slicing your kernel into smaller pieces if it has complex logic inside it. Or get more powerful hardware.


If you compile your code like this:

nvcc -Xptxas="-v" [other compiler options]

the assembler will report the number of local heap memory that the code requires. This can be a useful diagnostic to see what the memory footprint of the kernel is. There is also an API call cudaThreadSetLimit which can be used to control the amount of per thread heap memory which a kernel will try and consume during execution.

Recent toolkits ship with a utility called cuda-memchk, which provides valgrind like analysis of kernel memory access, including buffer overflows and illegal memory usage. It might be that your code is overflowing some memory somewhere and overwriting other parts of GPU memory, leaving the card in a parlous state.


I got it! nVidia NSight 2.0 - which supposedly supports CUDA 4 - changed my CUDA_INC_PATH to use CUDA 3.2. No wonder it wouldn't let me allocate 1024 threads per block. All relief and jubilation aside, that is a really stupid and annoying bug considering I already had CUDA 4.0 RC2 installed.

0

精彩评论

暂无评论...
验证码 换一张
取 消