Context: CUDA 4.0, Linux 64bit, NVIDIA UNIX x86_64 Kernel Module 270.41.19, on a GeForce GTX 480.
I try to find a (device) memory leak in my program. I use the runtime API and cudaGetMemInfo(free,total) to measure device memory usage. I notice a significant loss (in this case 31M) after kernel execution. The kernel code itself does not allocate any device memory. So I guess its the kernel code that remains in device memory. Even I would have thought the kernel isn't that big. (Is there a way to determine the size of a kernel?)
When is the kernel code loaded into device memory? I guess at execution of the host code line:
kernel<<<geom>>>(params);
Right? And does the code remain in device memory after the call? If so, can I explicitly unload the code?
What concerns me is device memory fragmentat开发者_运维技巧ion. Think of a large sequence of alternating device memory allocation and kernel executions (different kernels). Then after a while device memory gets quite scarce. Even if you free some memory the kernel code remains leaving only the space between the kernels free for new allocation. This would result in a huge memory fragmentation after a while. Is this the way CUDA was designed?
The memory allocation you are observing is used by the CUDA context. It doesn't only hold kernel code, it holds any other static scope device symbols, textures, per-thread scratch space for local memory, printf and heap, constant memory, as well as gpu memory required by the driver and CUDA runtime itself. Most of this memory is only ever allocated once, when a binary module is loaded, or PTX code is JIT compiled by the driver. It is probably best to think of it as a fixed overhead, rather than a leak. There is a 2 million instruction limit in PTX code, and current hardware uses 32 bit words for instructions, so the memory footprint of even the largest permissible kernel code is small compared to the other global memory overheads it requires.
In recent versions of CUDA there is a runtime API call cudaDeviceSetLimit
which permits some control over the amount of scratch space a given context can consume. Be aware that it is possible to set the limits to values which are lower than the device code requires, in which case runtime execution failures can result.
精彩评论