开发者

CUDA texture memory to bind a sub-portion of global memory

开发者 https://www.devze.com 2023-02-20 15:05 出处:网络
I am having problem binding to texture memory a sub-port开发者_JAVA百科ion of global device memory.

I am having problem binding to texture memory a sub-port开发者_JAVA百科ion of global device memory.

I have a large global device array filled with memory as follows:

double * device_global;

cudaMalloc((void **)&device_global, sizeof(double)*N));

cudaMemcpy(device_global, host, sizeof(double)*N, cudaMemcpyHostToDevice) );

I am running numerous kernels in a for loop.

Each kernel required a small portion (int offset = 100) of device_global which I am binding to a texture through:

cudaBindTexture(0, texRef, device_global, channelDesc, sizeof(double)*10);

However the problem I am facing is that I am unable to use pointer arithmetic to only bind a looping section of device_global via an offset that loops.

I would like to do something like:

cudaBindTexture(0, texRef, device_global+ offsett * i , channelDesc, sizeof(double)*10);

it should be noted that the above approach does work if the offset is set to 0, somehow the pointer arithmetic does not work.

Any help or other guidelines would be much appreciated.


It's a bad practice to pass 0 or NULL as the first argument of cudaBindTexture. CUDA texture binding requires that the pointer to be bound must be aligned. The alignment requirement can be determined by cudaDeviceProp::textureAlignment device property.

cudaBindTexture can bind any device pointer to the texture. If the pointer is not aligned, it returns an offset in bytes from the nearest preceding aligned address in the first argument of cudaBindTexture. If the first argument is NULL, the function call fails.

Binding should be done as:

size_t texture_offset = 0;
cudaBindTexture(&texture_offset, texRef, device_global+ offsett * i , channelDesc, sizeof(double)*10);


The offset of the Texture Memory must be aligned. You can't bind any portion of the memory only the one that is properly aligned and this is because of how the internal high performance hardware works.

One solution would be to use Pitched Memory and instead of having very small texture have several big ones each starting at an aligned row of the matrix.

I am guessing here but I think that using

sizeof(double)*10

as a datasize for texture memory, takes more to setup the memory itself than to read it.

How big is the total matrix?


I don't believe it is possible to do what you are trying to do. I suspect there is some behind the scenes address translation that means that if the pointer you pass to the binding call isn't already known to the runtime memory manager and suitably aligned to a page boundary, it won't permit a texture to be bound to the address.

It might be better to bind the whole array to the texture and then pass an indexing offset into each kernel to be used in the texture fetch.

0

精彩评论

暂无评论...
验证码 换一张
取 消