Why hypot() function is so slow?_问答_开发者_运维开发者技术经验分享

I did some testing with C++ hypot() and Java Math.hypot. They both seem to be significantly slower开发者_如何学编程 than sqrt(a*a + b*b). Is that because of a better precision? What method to calculate a hypotenuse hypot function uses? Surprisingly I couldn't find any indication of poor performance in the documentation.

It's not a simple sqrt function. You should check this link for the implementation of the algorithm: http://www.koders.com/c/fid7D3C8841ADC384A5F8DE0D081C88331E3909BF3A.aspx

It has while loop to check for convergence

/* Slower but safer algorithm due to Moler and Morrison.  Never
         produces any intermediate result greater than roughly the
         larger of X and Y.  Should converge to machine-precision
         accuracy in 3 iterations.  */

      double r = ratio*ratio, t, s, p = abig, q = asmall;

      do {
        t = 4. + r;
        if (t == 4.)
          break;
        s = r / t;
        p += 2. * s * p;
        q *= s;
        r = (q / p) * (q / p);
      } while (1);

EDIT (Update from J.M):

Here is the original Moler-Morrison paper, and here is a nice follow-up due to Dubrulle.

Here is a faster implementation, which results are also closer to java.lang.Math.hypot: (NB: for Delorie's implementation, need to add handling of NaN and +-Infinity inputs)

private static final double TWO_POW_450 = Double.longBitsToDouble(0x5C10000000000000L);
private static final double TWO_POW_N450 = Double.longBitsToDouble(0x23D0000000000000L);
private static final double TWO_POW_750 = Double.longBitsToDouble(0x6ED0000000000000L);
private static final double TWO_POW_N750 = Double.longBitsToDouble(0x1110000000000000L);
public static double hypot(double x, double y) {
    x = Math.abs(x);
    y = Math.abs(y);
    if (y < x) {
        double a = x;
        x = y;
        y = a;
    } else if (!(y >= x)) { // Testing if we have some NaN.
        if ((x == Double.POSITIVE_INFINITY) || (y == Double.POSITIVE_INFINITY)) {
            return Double.POSITIVE_INFINITY;
        } else {
            return Double.NaN;
        }
    }
    if (y-x == y) { // x too small to substract from y
        return y;
    } else {
        double factor;
        if (x > TWO_POW_450) { // 2^450 < x < y
            x *= TWO_POW_N750;
            y *= TWO_POW_N750;
            factor = TWO_POW_750;
        } else if (y < TWO_POW_N450) { // x < y < 2^-450
            x *= TWO_POW_750;
            y *= TWO_POW_750;
            factor = TWO_POW_N750;
        } else {
            factor = 1.0;
        }
        return factor * Math.sqrt(x*x+y*y);
    }
}

I found Math.hypot() to be abysmally slow. I found I could code a fast java version using the same algorithm that produces identical results. This can be used to replace it.

/**
 * <b>hypot</b>
 * @param x
 * @param y
 * @return sqrt(x*x +y*y) without intermediate overflow or underflow. 
 * @Note {@link Math#hypot} is unnecessarily slow.  This returns the identical result to 
 * Math.hypot with reasonable run times (~40 nsec vs. 800 nsec). 
 * <p>The logic for computing z is copied from "Freely Distributable Math Library" 
 * fdlibm's e_hypot.c. This minimizes rounding error to provide 1 ulb accuracy.
 */
public static double hypot(double x, double y) {

    if (Double.isInfinite(x) || Double.isInfinite(y)) return Double.POSITIVE_INFINITY;
    if (Double.isNaN(x) || Double.isNaN(y)) return Double.NaN;

    x = Math.abs(x);
    y = Math.abs(y);

    if (x < y) {
        double d = x;
        x = y;
        y = d;
    }

    int xi = Math.getExponent(x);
    int yi = Math.getExponent(y);

    if (xi > yi + 27) return x;

    int bias = 0;
    if (xi > 510 || xi < -511) {
        bias = xi;
        x = Math.scalb(x, -bias);
        y = Math.scalb(y, -bias);           
    }


    // translated from "Freely Distributable Math Library" e_hypot.c to minimize rounding errors
    double z = 0; 
    if (x > 2*y) { 
        double x1 = Double.longBitsToDouble(Double.doubleToLongBits(x) & 0xffffffff00000000L);
        double x2 = x - x1;
        z = Math.sqrt(x1*x1 + (y*y + x2*(x+x1)));
    } else {
        double t = 2 * x;
        double t1 = Double.longBitsToDouble(Double.doubleToLongBits(t) & 0xffffffff00000000L);
        double t2 = t - t1;
        double y1 = Double.longBitsToDouble(Double.doubleToLongBits(y) & 0xffffffff00000000L);
        double y2 = y - y1;
        double x_y = x - y;
        z = Math.sqrt(t1*y1 + (x_y*x_y + (t1*y2 + t2*y))); // Note: 2*x*y <= x*x + y*y
    }

    if (bias == 0) {
        return z; 
    } else {
        return Math.scalb(z, bias);
    }
}

hypot() incurs overhead for avoiding overflow and underflow in intermediate computations, when compared to the naive implementation sqrt(a*a+b*b). This involves scaling operations. In older implementations, the scaling may use division, which is itself a rather slow operation. Even where scaling is accomplished by direct exponent manipulation, it could be rather slow on older processor architectures as transferring data between integer ALUs and floating-point units was rather slow (e.g. involving a round-trip through memory). Data-driven branches are also common to various scaling approaches; these are hard to predict.

A frequent goal of math library designers is to achieve faithful rounding for simple math functions like hypot(), that is, a maximum error less than 1 ulp. This improvement in accuracy versus the naive solution means that intermediate computations must be carried out with higher than native precision. A classical method is splitting operands into "high" and "low" portions and simulating the extended precision. This adds overhead for splitting and increases number of floating-point operations. Lastly, the ISO C99 specification for hypot() (later adopted into the C++ standard) prescribes handling for NaN and infinities that does not naturally fall out of a straightforward computation.

A representative example of older best practices is __ieee754_hypot() in FDLIBM. While it claims a maximum error of 1 ulp, quick testing against an arbitrary-precision library shows that it does not actually achieve that goal, as errors up to 1.16 ulps (e.g. for arguments 0x1.74e729a505778p-1013, 0x0.017e77f6d0e0dp-1022) can be observed. For an even older algorithm design for a faithfully-rounded hypot() by renowned floating-point expert William Kahan, the error bound holds across 50 billion random trials, with the maximum observed error of 0.9917 ulps (for the argument pair 0x1.6d3e085a0fc89p+481, 0x1.62248c23285f1p+481):

/* Algorithm from William M. Kahan, "Interval Arithmetic Options in the
   Proposed IEEE Floating Point Arithmetic Standard." In: Karl L. E. Nickel
   (ed.), Interval Mathematics 1980, New York: Academic Press 1980, pp. 99-128
*/
double kahan_hypot (double a, double b)
{
    const double sqrt2      = 1.4142135623730951e+00; // 0x1.6a09e667f3bcdp+00
    const double sqrt2p1_hi = 2.4142135623730949e+00; // 0x1.3504f333f9de6p+01
    const double sqrt2p1_lo = 1.2537167179050217e-16; // 0x1.21165f626cdd5p-53
    double fa, fb, mn, mx, res, r, s, t;

    fa = fabs (a);
    fb = fabs (b);
    mx = fmax (fa, fb);
    mn = fmin (fa, fb);

    r = mx - mn;
    if (r <= mn) {
        r = r / mn;
        s = r * (r + 2.0);
        r = ((s / (sqrt2 + sqrt (2.0 + s)) + r) + sqrt2p1_lo) + sqrt2p1_hi;
    } else {
        r = mx / mn;
        r = r + sqrt (r * r + 1.0);
    }
    res = (mn / r) + mx;

    /* fixup hypot(0,0) */
    if (mx == 0) res = mx;

    /* check for special cases: infinity and NaN */
    t = a + b;
    if (!(fabs (t) <= INFINITY)) res = t; // isnan(t)
    if (mx == INFINITY) res = mx; // isinf(mx)
    return res;
}

Since this question was asked, two advances in processor architectures have enabled more efficient hypot() implementations. One is the introduction of the fused multiply-add (FMA) operation, which internally uses the full product for the addition and applies only a single rounding at the end. This often alleviates a need for simulating slightly higher precision in intermediate computation, and by combining two operations into one instruction also improves performance. The other architectural advance is allowing both floating-point and integer operations to operate on the same set of registers, making bit manipulation of floating-point operands cheap.

As a consequence, in modern high-performance math libraries hypot() typically has about half the throughput of sqrt(), as can be seen in the performance data Intel publishes for MKL, for example.

For one's own experiments, it is best to start with a single-precision implementation of hypot(), as this allows testing across a larger portion of the total combination of all possible argument values. This also allows exploring trade-offs between speed and accuracy when using low-precision reciprocal square root approximations that are provided by many modern processor architectures. Exemplary code of one such exploration is shown below.

Note that the requirements on NaN handling imposed by ISO C99 (and by extension, C++) on fmin() fmax() do not always match up with the functionality of floating-point minimum / maximum machine instructions, making these function slower than they could be. Since the code below handles NaNs separately, it should be possible to use such machine instructions directly. Code should be compiled with full optimization but also strict IEEE-754 compliance.

Based on testing with 50 billion random argument pairs, maximum error observed with the code variants below is: (USE_BEEBE == 1): 1.51 ulp; (USE_BEEBE == 0 && USE_SQRT == 1): 1.02 ulp; (USE_BEEBE == 0 && USE_SQRT == 0 && USE_2ND_ITERATION == 0): 2.77 ulp; (USE_BEEBE == 0 && USE_SQRT == 0 && USE_2ND_ITERATION == 1): 1.02 ulp.

#include <stdint.h>
#include <string.h>
#include <float.h>
#include <math.h>
#include "immintrin.h"

uint32_t __float_as_uint32 (float a);
float __uint32_as_float (uint32_t a);
float sse_rsqrtf (float a);

float my_hypotf (float a, float b)
{
    float fa, fb, mn, mx, res, r, t;

    fa = fabsf (a);
    fb = fabsf (b);
    mx = fmaxf (fa, fb);
    mn = fminf (fa, fb);

#if USE_BEEBE
    /* Nelson H. F. Beebe, "The Mathematical Function Computation Handbook."
       Springer 2017
    */
    float s, c;

    r = mn / mx;
    t = fmaf (r, r, 1.0f);
    s = sqrtf (t);
    c = fmaf (-s, s, t) / (s + s);
    res = fmaf (mx, s, mx * c);
    if (mx == 0) res = mx; // fixup hypot(0,0)

#else // USE_BEEBE

    float scale_in, scale_out, s, v, w;
    uint32_t expo;

    /* compute scale factors */
    expo = __float_as_uint32 (mx) & 0xfc000000;
    scale_in = __uint32_as_float (0x7e000000 - expo);
    scale_out = __uint32_as_float (expo + 0x01000000);
    
    /* scale mx towards unity */
    mn = mn * scale_in;
    mx = mx * scale_in;

    /* mx in [2**-23, 2**6) */
    s = fmaf (mx, mx, mn * mn); // 0.75 ulp
#if USE_SQRT
    w = sqrtf (s);
#else // USE_SQRT
    r = sse_rsqrtf (s);
    /* use A. Schoenhage's coupled iteration for the square root */
    v = r * 0.5f;
    w = r * s;
    w = fmaf (fmaf (w, -w, s), v, w);
#if USE_2ND_ITERATION
    v = fmaf (fmaf (r, -w, 1), v, v);
    w = fmaf (fmaf (w, -w, s), v, w);
#endif // USE_2ND_ITERATION
    if (mx == 0) w = mx; // fixup hypot(0,0)
#endif // USE_SQRT

    /* reverse previous scaling */
    res = w * scale_out;
#endif // USE_BEEBE

    /* check for special cases: infinity and NaN */
    t = a + b;
    if (!(fabsf (t) <= INFINITY)) res = t; // isnan(t)
    if (mx == INFINITY) res = mx; // isinf(mx)
    return res;
}

uint32_t __float_as_uint32 (float a)
{
    uint32_t r;
    memcpy (&r, &a, sizeof r);
    return r;
}

float __uint32_as_float (uint32_t a)
{
    float r;
    memcpy (&r, &a, sizeof r);
    return r;
}

float sse_rsqrtf (float a)
{
    __m128 b, t;
    float res;
    int old_mxcsr;
    old_mxcsr = _mm_getcsr();
    _MM_SET_FLUSH_ZERO_MODE (_MM_FLUSH_ZERO_OFF); 
    _MM_SET_ROUNDING_MODE (_MM_ROUND_NEAREST);
    b = _mm_set_ss (a);
    t = _mm_rsqrt_ss (b);
    _mm_store_ss (&res, t);
    _mm_setcsr (old_mxcsr);
    return res;
}

A simple FMA-based double-precision implementation of hypot() looks like so:

uint64_t __double_as_uint64 (double a)
{
    uint64_t r;
    memcpy (&r, &a, sizeof r);
    return r;
}

double __uint64_as_double (uint64_t a)
{
    double r;
    memcpy (&r, &a, sizeof r);
    return r;
}

double my_hypot (double a, double b)
{
    double fa, fb, mn, mx, scale_in, scale_out, r, s, t;
    uint64_t expo;

    fa = fabs (a);
    fb = fabs (b);
    mx = fmax (fa, fb);
    mn = fmin (fa, fb);

    /* compute scale factors */
    expo = __double_as_uint64 (mx) & 0xff80000000000000ULL;
    scale_in = __uint64_as_double (0x7fc0000000000000ULL - expo);
    scale_out = __uint64_as_double (expo + 0x0020000000000000ULL);
    
    /* scale mx towards unity */
    mn = mn * scale_in;
    mx = mx * scale_in;

    /* mx in [2**-52, 2**6) */
    s = fma (mx, mx, mn * mn); // 0.75 ulp
    r = sqrt (s);

    /* reverse previous scaling */
    r = r * scale_out;

    /* check for special cases: infinity and NaN */
    t = a + b;
    if (!(fabs (t) <= INFINITY)) r = t; // isnan(t)
    if (mx == INFINITY) r = mx; // isinf(mx)

    return r;
}

http://steve.hollasch.net/cgindex/math/pythag-root.txt

suggests that the faster implementation using sqrt() is quadratic vs. cubic for Moler & Morrison, with about the same overflow characteristics.