开发者

is using cudaHostAlloc good for my case

开发者 https://www.devze.com 2023-03-18 05:32 出处:网络
i have a kernel launched several times, untill a solution is found. the solution will be found by at least one block.

i have a kernel launched several times, untill a solution is found. the solution will be found by at least one block.

therefore when a block finds the solution it sh开发者_JAVA技巧ould inform the cpu that the solution is found, so the cpu prints the solution provided by this block.

so what i am currently doing is the following:

__global__ kernel(int sol)
{
   //do some computations
   if(the block found a solution)
        sol = blockId.x //atomically
}

now on every call to the kernel i copy sol back to the host memory and check its value. if its set to 3 for example, i know that blockid 3 found the solution so i now know where the index of the solution start, and copy the solution back to the host.

in this case, will using cudaHostAlloc be a better option? more over would copying the value of a single integer on every kernel call slows down my program?


Issuing a copy from GPU to CPU and then waiting for its completion will slow your program a bit. Note that if you choose to send 1 byte or 1KB, that won't make much of a difference. In this case bandwidth is not a problem, but latency.

But launching a kernel does consume some time as well. If the "meat" of your algorithm is in the kernel itself I wouldn't spend too much time on that single, small transfer.

Do note, if you choose to use the mapped memory, instead of using cudaMemcpy, you will need to explicitly put a cudaDeviceSynchronise (or cudaThreadSynchronise with older CUDA) barrier (as opposed to an implicit barrier at cudaMemcpy) before reading the status. Otherwise, your host code may go achead reading an old value stored in your pinned memory, before the kernel overwrites it.

0

精彩评论

暂无评论...
验证码 换一张
取 消