Is开发者_高级运维 worth replacing all multiplications with the __umul24 function in a CUDA kernel? I read different and opposite opinions and I can't still make a bechmark to figure it out
Only in devices with architecture prior to fermi, that is with cuda capabilities prior to 2.0 where the integer arithmetic unit is 24 bit.
On Cuda Device with capabilities >= 2.0 the architecture is 32 bit the _umul24 will be slower instead of faster. The reason is because it has to emulate the 24 bit operation with 32 bit architecture.
The question is now: Is it worth the effort for the speed gain ? Probably not.
Just wanted to chime in with a slightly different opinion than Ashwin/fabrizioM...
If you're just trying to teach yourself CUDA, their answer is probably more or less acceptable. But if you're actually trying to deploy a production-grade app to a commercial or research setting, that sort of attitude is generally not acceptable, unless you are absolutely sure that your end users' (or you, if you're the end user) is Fermi or later.
More likely, there's many users who will be running CUDA on legacy machines who would receive benefits from using Compute Level appropriate functionality. And it's not as hard as Ashwin/fabrizioM make it out to be.
e.g. in a code I'm working on, I'm using:
//For prior to Fermi use umul, for Fermi on, use
//native mult.
__device__ inline void MultiplyFermi(unsigned int a, unsigned int b)
{ a*b; }
__device__ inline void MultiplyAddFermi(unsigned int a, unsigned int b,
unsigned int c)
{ a*b+c; }
__device__ inline void MultiplyOld(unsigned int a, unsigned int b)
{ __umul24(a,b); }
__device__ inline void MultiplyAddOld(unsigned int a, unsigned int b,
unsigned int c)
{ __umul24(a,b)+c; }
//Maximum Occupancy =
//16384
void GetComputeCharacteristics(ComputeCapabilityLimits_t MyCapability)
{
cudaDeviceProp DeviceProperties;
cudaGetDeviceProperties(&DeviceProperties, 0 );
MyCapability.ComputeCapability =
double(DeviceProperties.major)+ double(DeviceProperties.minor)*0.1;
}
Now there IS a downside here. What is it?
Well any kernel you use a multiplication, you must have two different versions of the kernel.
Is it worth it?
Well consider, this is a trivial copy & paste job, and you're gaining efficiency, yes in my opinion. After all, CUDA isn't the easiest form of programming conceptually (nor is any parallel programming). If performance is NOT critical, ask yourself: why are you using CUDA?
If performance is critical, it's negligent to to code lazy and either abandon legacy devices or post less-than-optimal execution, unless you're absolutely confident you can abandon legacy support for your deployment (allowing optimal execution).
For most, it makes sense to provide legacy support, given that it's not that hard once you realize how to do it. Be aware this means that that you will also need to update your code, in order to adjust in to changes in future architectures.
Generally you should note what the latest version the code was targeted at, when it was written and perhaps print some sort of warning to users if they have a compute capability beyond what your latest implementation is optimized for.
精彩评论