If I understand correctly, when you launch a CUDA kernel asynchronously, it may begin execution immediately or it may wait for previous asynchronous calls (transfers, kernels, etc) to complete first. (I also understand that kernels can run concurrently in some cases, but I want to ignore that for now).
How can I find out the time between launching a kernel ("queuing") and when it actually begins execution. In fact, I really just want to know the average "queued time" for all launches in a single run of my prog开发者_JAVA技巧ram (generally in the tens or hundreds of thousands of kernel launches.)
I can easily calculate the average execution time per kernel with events (~500us). I tried to simulate - I dropped the results of CLOCK()
every time a kernel is launched, with the idea that I could then determine how long the launch queue was when each kernel was launched. But CLOCK()
does not have high enough precision (0.01s) - sometimes as many as 60 kernels appear to be launched at a single time, when of course in reality many are not.
Rather than clock
use the QueryPerformanceTimer
which counts based on machine clock cycles.
Code for QueryPerformanceTimer
Secondly, the profiling tool (Visual Profiler) only measures serial launches [see page 24] and [see post number 3].
Thus the best option is (1) use QueryPerformanceTimer
(or the Visual Profiler) such that you get an accurate measurement of a single launch and (2) use QueryPerformanceTimer
to get the timing of multiple launches and observe whether the timing results suggest that asynchronous launching took place.
精彩评论