开发者

CUDA extending std::vector to manage host and device data [closed]

开发者 https://www.devze.com 2023-03-08 13:44 出处:网络
Closed. This question is opinion-based. It is not currently accepting answers. Want to improve this question? Update the question so it开发者_Go百科 can be answered with facts and citation
Closed. This question is opinion-based. It is not currently accepting answers.

Want to improve this question? Update the question so it开发者_Go百科 can be answered with facts and citations by editing this post.

Closed 7 years ago.

Improve this question

I learned that std::vector is a nice wrapper around raw arrays in C++ so I started to use it for managing host data in my CUDA app [1]. Since having to allocate and copying things by hand makes the code more complex and less readable I thought about extending std::vector. Since I'm not very experienced I would like to know what you think about it. Especially weather it's correctly done (eg destructor of std::vector is called implicitly, right?) and if you consider it a good idea.

I wrote a small example illustrating this

#include <vector>
#include <cuda.h>

#include <cstdio>

void checkCUDAError(const char *msg)
{
    cudaError_t err = cudaGetLastError();
    if( cudaSuccess != err) {
        fprintf(stderr, "Cuda error: %s: %s.\n", msg, cudaGetErrorString(err));
        exit(EXIT_FAILURE);
    }
}

// Wrapper around CUDA memory
template<class T>
class UniversalVector: public std::vector<T>
{
    T* devicePtr_;
    bool allocated;

public:

    // Constructor
    UniversalVector(unsigned int length)
        :std::vector<T>(length), 
         allocated(false)
    {}

    // Destructor
    ~UniversalVector()
     {
        if(allocated)
            cudaFree(devicePtr_);
     }

    cudaError_t allocateDevice()
    {
        if(allocated) free(devicePtr_);
        cudaError_t err = 
            cudaMalloc((void**)&devicePtr_, sizeof(T) * this->size());
        allocated = true;
        return err;
    }

    cudaError_t loadToDevice()
    {
        return cudaMemcpy(devicePtr_, &(*this)[0], sizeof(T) * this->size(),
            cudaMemcpyHostToDevice);
    }

    cudaError_t loadFromDevice()
    {
        return cudaMemcpy(&(*this)[0], devicePtr_, sizeof(T) * this->size(),
            cudaMemcpyDeviceToHost);
    }

    // Accessors

    inline T* devicePtr() {
        return devicePtr_;
    }

};

__global__ void kernel(int* a)
{
    int i = threadIdx.x;
    printf("%i\n", a[i]);
}

int main()
{
    UniversalVector<int> vec(3);
    vec.at(0) = 1;
    vec.at(1) = 2;
    vec.at(2) = 3;

    vec.allocateDevice();
    vec.loadToDevice();

    kernel<<<1, 3>>>(vec.devicePtr());

    checkCUDAError("Error when doing something");

    return 0;
}

[1] In CUDA it's distinguished between host and device memory where host memory is the memory accessible by the GPU and device memory the memory on the GPU The programmer has to move memory from the host to the GPU and back.


You might want to have a look at Thrust. It provides some STL containers for CUDA code.


The biggest problem I see with this is that is doesn't really help manage the GPU side of things very much, and it obfuscates a number of very important pieces of information in the process.

While the container class contains information about whether the device pointer has been allocated, there is no way of knowing whether the contents of the host container has been copied to the GPU memory it holds, or whether the GPU memory has been copied back to the device. As a result you will have to call the loadToDevice() and loadFromDevice() methods every time you wish to use the container in either host or device code. That probably means unnecessary PCI-e memory transfers at least some of the time. And because you have chosen to wrap only the synchronous CUDA memory copy routines, there will be host blocking every time you do this.

Ultimately I don't see much net gain in this idea over a well designed set of helper routines which abstract away the ugliest bits of the CUDA APIs and operate on standard STL types.


I'll extend David Rodríguez - dribeas comment a bit:

The question why you should prefer composition over inheritance (even though it requires additional façade work) has been asked and answered multiple times. A good answer is this one: Prefer composition over inheritance?

The determining factor is the interface: Do you want all or some methods of the underlying class?

In your case std::vector methods that modify the size of the vector like resize, push_back, pop_back, erase, insert, etc. are likely to cause mayhem if called between the call of loadToDevice and loadFromDevice.

In your question you state that you need a wrapper around raw arrays. Those are of fixed size! Therefore you might very well use a std::vector in your wrapper class internally (composition!), but you need to hide away all the dynamic size stuff of it.


You better off having those functions like allocateDevice and loadToDevice as free function rather than members of class inherited from std::vector. It could possible save you a great deal of integrating other libraries/classes with your stuff. Overall looks good.

0

精彩评论

暂无评论...
验证码 换一张
取 消

关注公众号