开发者

CUDA Pinned memory for small data

开发者 https://www.devze.com 2023-03-25 07:06 出处:网络
I am running host to device bandwidthtests for different sizes of data, and have noticed an increased bandwidth when the host memory is pinned against pageable. Following is my plot of bandwidth in MB

I am running host to device bandwidthtests for different sizes of data, and have noticed an increased bandwidth when the host memory is pinned against pageable. Following is my plot of bandwidth in MB/s vs data 开发者_StackOverflow中文版transfer size in bytes. One could notice that for small amount of data (<300K) pageable fares better than pinned...is it related to memory allocation by the O/S? This bandwidthtest program is from NVidia's code sample sdk (with slight modifications from my side), and I am testing against Tesla C2050 using CUDA 4.0. The O/S is 64-bit Linux.

CUDA Pinned memory for small data


The cudaMemcpy implementation has different code paths for different devices, source and destination locations, and data sizes, in order to try to achieve the best possible throughput.

The different rates you are seeing are probably due to the implementation switching as the array size changes.

For example, Fermi GPUs have both dedicated copy engines (which can run in parallel with kernels running on the SMs), and SMs which can access host memory over PCI-e. For smaller arrays, it may be more efficient for cudaMemcpy to be implemented as a kernel running on the SMs that reads host memory directly, and stores the loaded data in device memory (or vice versa), so the driver may choose to do this. Or it may be more efficient to use the copy engine -- I'm not sure which it does in practice, but I think switching between them is the cause of the crossover you see in your graph.


It is possible that test is cheating.

Here is one of timed code:

cutilSafeCall( cudaEventRecord( start, 0 ) );
if( PINNED == memMode )
{
    for( unsigned int i = 0; i < MEMCOPY_ITERATIONS; i++ )
    {
        cutilSafeCall( cudaMemcpyAsync( h_odata, d_idata, memSize,
                                cudaMemcpyDeviceToHost, 0) );
    }
}
else
{
    for( unsigned int i = 0; i < MEMCOPY_ITERATIONS; i++ )
    {
        cutilSafeCall( cudaMemcpy( h_odata, d_idata, memSize,
                                cudaMemcpyDeviceToHost) );
    }
}
cutilSafeCall( cudaEventRecord( stop, 0 ) );

Note, that test uses different functions to do a MemCPY for different kinds of memory. I think, this is a cheating, because main difference between modes is how the memory is allocated, with cudaHostAlloc for pinned and with malloc for unpinned.

Different Memcpy functions can have vary paths of error checking and transfer setup.

So, try to modify the test and do copy in both modes with cudaMemcpy(), e.g. with changing all ifs after cudeEventRecord(...) to if( 0 && (PINNED == memMode) )

0

精彩评论

暂无评论...
验证码 换一张
取 消