CUDA and MATLAB for loop optimization_问答_开发者

开发者 https://www.devze.com 2023-01-29 13:50 出处：网络

I\'m going to attempt to optimize some code written in MATLAB, by using CUDA. I recently started programming CUDA, but I\'ve got a general idea of how it works.

I'm going to attempt to optimize some code written in MATLAB, by using CUDA. I recently started programming CUDA, but I've got a general idea of how it works.

So, say I want to add two matrices together. In CUDA, I could write an algorithm that w开发者_JS百科ould utilize a thread to calculate the answer for each element in the result matrix. However, isn't this technique probably similar to what MATLAB already does? In that case, wouldn't the efficiency be independent of the technique and attributable only to the hardware level?

The technique might be similar but remember with CUDA you have hundreds of threads running simultaneously. If MATLAB is using threads and those threads are running on a Quad core, you are only going to get 4 threads excuted per clock cycle while you might achieve a couple of hundred threads to run on CUDA with that same clock cycle.

So to answer you question, YES, the efficiency in this example is independent of the technique and attributable only to the hardware.

The answer is unequivocally yes, all the efficiencies are hardware level. I don't how exactly matlab works, but the advantage of CUDA is that mutltiple threads can be executed simultaneously, unlike matlab.

On a side note, if your problem is small, or requires many read write operations, CUDA will probably only be an additional headache.

CUDA has official support for matlab.

[need link]

You can make use of mex files to run on GPU from MATLAB.

The bottleneck is the speed at which data is transfered from CPU-RAM to GPU. So if the transfer is minimized and done in large chunks, the speedup is great.

For simple things, it's better to use the gpuArray support in the Matlab PCT. You can check it here http://www.mathworks.de/de/help/distcomp/using-gpuarray.html

For things like adding gpuArrays, multiplications, mins, maxs, etc., the implementation they use tends to be OK. I did find out that for making things like batch operations of small matrices like abs(y-Hx).^2, you're better off writing a small Kernel that does it for you.