CUDA Card Occasionally Crapping out with mid-run 'Launch Failure', along with Snow_问答_开发者

CUDA Card Occasionally Crapping out with mid-run 'Launch Failure', along with Snow

开发者 https://www.devze.com 2023-02-27 03:27 出处：网络

I would like to take a picture of whats happenning to my screen, but screenshot won\'t capture it, but the best description is snow.

I would like to take a picture of whats happenning to my screen, but screenshot won't capture it, but the best description is snow.

One of my projects has a habit of randomly failing on a new iteration, and I always assumed it was a 'You're using too much memory fool开发者_开发技巧!' error, so was happy to restart, deal with it, and try to fix the problem.

Then I started to actually monitor the global memory assigned; Its constant at around 70% free throughout execution until suddenly dying on a fresh malloc.

To make matters more worrying, these Guru Meditations have started to habitually appear in my dmesg; all (that I've noticed) with the same address.

NVRM: Xid (0000:01:00): 13, 0008 00000000 000050c0 00000368 00000000 00000080

Any words from the wise on what the hell is going on? I'm still continuing investigation into issues with register and shared memory, but wanted to start this question for any ideas anyone else has.

If none of your CUDA memory allocations fail, then your problem isn't that you are out of memory (if you were it could be due to fragmentation, not necessarily due to 100%+ consumption).

If you are getting a x-mas tree effect, then you probably have a kernel that is writing outside of allocated memory. Check the indexes of pixels/array-cells you are accessing and the memory offset calculation of their position in the output buffers.

You can also try using 1D index while invoking the kernels, to make calculations simpler. (You can model any multi-dimensional array as a long 1D array.)

Please wrap all calls to CUDA Runtime API with cudaSafeCall() and add a cudaCheckError() after all kernel invocations. These utility functions are exposed in cutil.h. This should help you catch any CUDA errors at the point they actually happen and their error message should help your investigation.