开发者

Which algorithms benefit most from fused multiply add?

开发者 https://www.devze.com 2023-01-13 05:24 出处:网络
fm开发者_开发问答a(a,b,c) is equivalent to a*b+c except it doesn\'t round intermediate result.

fm开发者_开发问答a(a,b,c) is equivalent to a*b+c except it doesn't round intermediate result.

Could you give me some examples of algorithms that non-trivially benefit from avoiding this rounding?

It's not obvious, as rounding after multiplications which we avoid tends to be less problematic than rounding after addition which we don't.


taw hit on one important example; more generally, FMA allows library writers to efficiently implement many other floating-point operations with correct rounding.

For example, a platform that has an FMA can use it to implement correctly rounded divide and square root (PPC and Itanium took this approach), which lets the FPU be basically a single-purpose FMA machine. Peter Tang and John Harrison (Intel), and Peter Markstein (HP) have some papers that explain this use if you're curious.

The example taw gave is more broadly useful than just in tracking error bounds. It allows you to represent the product of two floating point numbers as a sum of two floating point numbers without any rounding error; this is quite useful in implementing correctly-rounded floating-point library functions. Jean-Michel Muller's book or the papers on crlibm would be good starting places to learn more about these uses.

FMA is also broadly useful in argument reduction in math-library style routines for certain types of arguments; when one is doing argument reduction, the goal of the computation is often a term of the form (x - a*b), where (a*b) is very nearly equal to x itself; in particular, the result is often on the order of the rounding error in the (a*b) term, if this is computed without an FMA. I believe that Muller has also written some about this in his book.


The only thing I found so far are "error-free transformations". For any floating point numbers errors from a+b, a-b, and a*b are also floating point numbers (in round to nearest mode, assuming no overflow/underflow etc. etc.).

Addition (and obviously subtraction) error is easy to compute; if abs(a) >= abs(b), error is exactly b-((a+b)-a) (2 flops, or 4-5 if we don't know which is bigger). Multiplication error is trivial to compute with fma - it is simply fma(a,b,-a*b). Without fma it's 16 flops of rather nasty code. And fully generic emulation of correctly rounded fma is even slower than that.

Extra 16 flops of error tracking per flop of real computation is a huge overkill, but with just 1-5 pipeline-friendly flops it's quite reasonable, and for many algorithms based on that 50%-200% overhead of error tracking and compensation results in error as small as if all calculations were done in twice the number of bits they were, avoiding ill-conditioning in many cases.

Interestingly, fma isn't ever used in these algorithms to compute results, just to find errors, because finding error of fma is a slow as finding error of multiplication was without fma.

Relevant keywords to search would be "compensated Horner scheme" and "compensated dot product", with Horner scheme benefiting a lot more.


The primary benefit of FMA is that it can be twice as fast. Rather than take 1 cycle for the multiply and then 1 cycle for the add, the FPU can issue both operations in the same cycle. Obviously, most algorithms will benefit from faster operations.


Some examples: Vector dot products. Fourier transforms. Digital signal processing. Polynomials. All sorts of things.

It's a question of optimization and hardware exploitation more that anything else. A sum of products is a very common requirement in numerical methods, and this way lets you give an explicit instruction to the compiler about how to do a thing fast and perhaps with a little more precision. Unless I'm mistaken, the compiler is free to replace a=b*c+d with an FMA instruction, but it's also free not to. (unless the standard calls for rounding, but real-world compilers routinely violate standards in small ways).


Off the top of my head - Matrix multiplication, Newton's rule, polynomial evaluation, numerical methods


It has been pretty well explained on the Wikipedia entry for FMA that the algorithms which have something to do with accumulation of products benefit most from using FMA:

A fast FMA can speed up and improve the accuracy of 
many computations that involve the accumulation of products:

 * Dot product
 * Matrix multiplication
 * Polynomial evaluation (e.g., with Horner's rule)
 * Newton's method for evaluating functions.
0

精彩评论

暂无评论...
验证码 换一张
取 消