开发者

Is there a floating point value of x, for which x-x == 0 is false?

开发者 https://www.devze.com 2022-12-28 05:19 出处:网络
In most cases, I understand that a floating point comparison test should be implemented using over a range of values (abs(x-y) < epsilon), but does self subtraction imply that the result will be ze

In most cases, I understand that a floating point comparison test should be implemented using over a range of values (abs(x-y) < epsilon), but does self subtraction imply that the result will be zero?

// can the assertion be triggered?
float x = //?;
assert( x-x == 0 )

My guess is that nan/inf might be special cases, but I'm more interested in what happens for simple v开发者_如何学JAVAalues.

edit:

I'm happy to pick an answer if someone can cite a reference (IEEE floating point standard)?


As you hinted, inf - inf is NaN, which is not equal to zero. Similarly, NaN - NaN is NaN. It is true, however, that for any finite floating-point number x, x - x == 0.0 (depending on the rounding mode, the result of x - x might be negative zero, but negative zero compares equal to 0.0 in floating-point arithmetic).

Edit: it's a little tricky to give a clear standards reference, because this is an emergent property of the rules set forth in the IEEE-754 standard. Specifically, it follows from the requirement that operations defined in Clause 5 be correctly rounded. Subtraction is such an operation (Section 5.4.1 "Arithmetic operations"), and the correctly-rounded result of x - x is a zero of the appropriate sign (Section 6.3, paragraph 3):

When the sum of two operands with opposite signs (or the difference of two operands with like signs) is exactly zero, the sign of that sum (or difference) shall be +0 in all rounding-direction attributes except roundTowardNegative; under that attribute, the sign of an exact zero sum (or difference) shall be −0.

So the result of x - x must be +/- 0, and therefore must compare equal to 0.0 (Section 5.11, paragraph 2):

Comparisons shall ignore the sign of zero.

Further Edit: That's not to say that a buggy compiler couldn't cause that assert to fire. Your question is ambiguous; there is no finite floating point number x such that x - x == 0 is false. However, that's not what the code that you posted checks; it checks whether or not a certain expression in a C-style language can evaluate to a non-zero value; in particular, on certain platforms, with certain (ill-conceived) compiler optimizations, the two instances of the variable x in that expression might have different values, causing the assertion to fail (especially if x is the result of some computation, instead of a constant, representable value). This is a bug in the numerics model on those platforms, but that doesn't mean that it can't happen.


If the representation is transformed (for example from 64-bit memory format to 80-bit internal register format on x86) I would expect that the assertion could possibly fire in some circumstances.


Yes, apart from the special cases x-x will always be 0. But x*(1/x) will not always be 1 ;-)


Yes, self-subtraction should always result in zero, except for special cases.

The problem occurs where you add, subtract, multiply or divide before a comparison where the exponent and mantissa are adjusted. When the exponents are the same, the mantissas are subtracted, and if they're the same, everything ends up at zero.

http://grouper.ieee.org/groups/754/


My answer on the main question: "Is there a floating point value of x, for which x-x == 0 is false?" is: at least floating point implementation on Intel processors makes NO arithmetic underflow in "+" and "-" operations and so you will be not able to find an x for which x-x == 0 is false. The same is true for all processors which supports IEEE 754-2008 (see references bellow).

My short answer on another your question: if (x-y == 0) is exactly so safe as if (x == y), so assert(x-x == 0) is OK, because no arithmetic underflow will be produced in x-x or (x-y).

The reason is following. A float/double number will be hold in memory in the form mantissa and binary exponent. In the standard case mantissa is normalized: it is >= 0.5 and < 1. In <float.h> you can find some constants from IEEE floating point standard. Interesting now for us are only following

#define DBL_MIN         2.2250738585072014e-308 /* min positive value */
#define DBL_MIN_10_EXP  (-307)                  /* min decimal exponent */
#define DBL_MIN_EXP     (-1021)                 /* min binary exponent */

But not everybody knows, that you can have double numbers less then DBL_MIN. If you makes arithmetical operations with numbers under DBL_MIN, this number will be NOT normalized and so you works with this numbers like with integers (operation with mantissa only) without any "round errors".

Remark: I personally try not use words "round errors", because there are no errors in arithmetical computer operations. These operation are only not the same as +,-,* and / operations with the same computer numbers like floating number. There are deterministic operations on the subset of floating point numbers which can be saved in the form (mantissa,exponent) with well defined number of bits for each. Such subset of floats we can named as computer floating number. So the result of classical floating point operation will be projected back to the computer floating number set. Such projecting operation is deterministic, and have a lot of features like if x1 >= x2 then x1*y >= x2*y.

Sorry for the long remark and back to our subject.

To show exactly what we have if we operate with numbers less then DBL_MIN I wrote a small program in C:


#include <stdio.h>
#include <float.h>
#include <math.h>

void DumpDouble(double d)
{
    unsigned char *b = (unsigned char *)&d;
    int i;

    for (i=1; i<=sizeof(d); i++) {
        printf ("%02X", b[sizeof(d)-i]);
    }
    printf ("\n");
}

int main()
{
    double x, m, y, z;
    int exp;

    printf ("DBL_MAX=%.16e\n", DBL_MAX);
    printf ("DBL_MAX in binary form: ");
    DumpDouble(DBL_MAX);

    printf ("DBL_MIN=%.16e\n", DBL_MIN);
    printf ("DBL_MIN in binary form: ");
    DumpDouble(DBL_MIN);

    // Breaks the floating point number x into its binary significand
    // (a floating point value between 0.5(included) and 1.0(excluded))
    // and an integral exponent for 2
    x = DBL_MIN;
    m = frexp (x, &exp);
    printf ("DBL_MIN has mantissa=%.16e and exponent=%d\n", m, exp);
    printf ("mantissa of DBL_MIN in binary form: ");
    DumpDouble(m);

    // ldexp() returns the resulting floating point value from
    // multiplying x (the significand) by 2
    // raised to the power of exp (the exponent).
    x = ldexp (0.5, DBL_MIN_EXP);   // -1021
    printf ("the number (x) constructed from mantissa 0.5 and exponent=DBL_MIN_EXP (%d) in binary form: ", DBL_MIN_EXP);
    DumpDouble(x);

    y = ldexp (0.5000000000000001, DBL_MIN_EXP);
    m = frexp (y, &exp);
    printf ("the number (y) constructed from mantissa 0.5000000000000001 and exponent=DBL_MIN_EXP (%d) in binary form: ", DBL_MIN_EXP);
    DumpDouble(y);
    printf ("mantissa of this number saved as double will be displayed by printf(%%.16e) as %.16e and exponent=%d\n", m, exp);

    y = ldexp ((1 + DBL_EPSILON)/2, DBL_MIN_EXP);
    m = frexp (y, &exp);
    printf ("the number (y) constructed from mantissa (1+DBL_EPSILON)/2 and exponent=DBL_MIN_EXP (%d) in binary form: ", DBL_MIN_EXP);
    DumpDouble(y);
    printf ("mantissa of this number saved as double will be displayed by printf(%%.16e) as %.16e and exponent=%d\n", m, exp);

    z = y - x;
    m = frexp (z, &exp);
    printf ("z=y-x in binary form: ");
    DumpDouble(z);
    printf ("z will be displayed by printf(%%.16e) as %.16e\n", z);
    printf ("z has mantissa=%.16e and exponent=%d\n", m, exp);

    if (x == y)
        printf ("\"if (x == y)\" say x == y\n");
    else
        printf ("\"if (x == y)\" say x != y\n");

    if ((x-y) == 0)
        printf ("\"if ((x-y) == 0)\" say \"(x-y) == 0\"\n");
    else
        printf ("\"if ((x-y) == 0)\" say \"(x-y) != 0\"\n");
}

This code produced following output:

DBL_MAX=1.7976931348623157e+308
DBL_MAX in binary form: 7FEFFFFFFFFFFFFF
DBL_MIN=2.2250738585072014e-308
DBL_MIN in binary form: 0010000000000000
DBL_MIN has mantissa=5.0000000000000000e-001 and exponent=-1021
mantissa of DBL_MIN in binary form: 3FE0000000000000
the number (x) constructed from mantissa 0.5 and exponent=DBL_MIN_EXP (-1021) in binary form: 0010000000000000
the number (y) constructed from mantissa 0.5000000000000001 and exponent=DBL_MIN_EXP (-1021) in binary form: 0010000000000001
mantissa of this number saved as double will be displayed by printf(%.16e) as 5.0000000000000011e-001 and exponent=-1021
the number (y) constructed from mantissa (1+DBL_EPSILON)/2 and exponent=DBL_MIN_EXP (-1021) in binary form: 0010000000000001
mantissa of this number saved as double will be displayed by printf(%.16e) as 5.0000000000000011e-001 and exponent=-1021
z=y-x in binary form: 0000000000000001
z will be displayed by printf(%.16e) as 4.9406564584124654e-324
z has mantissa=5.0000000000000000e-001 and exponent=-1073
"if (x == y)" say x != y
"if ((x-y) == 0)" say "(x-y) != 0"

So we can see, that if we work with numbers less then DBL_MIN, they will be not normalized (see 0000000000000001). We are working with these numbers like with integers and without any "errors". Thus if we assign y=x then if (x-y == 0) is exactly so safe as if (x == y), and assert(x-x == 0) works OK. In this example, z = 0.5 * 2 ^(-1073) = 1 * 2 ^(-1072). This number is really the smallest number which can we have saved in double. All arithmetical operation with numbers less DBL_MIN works like with integer multiplied with 2 ^(-1072).

So I has no underflow problems on my Windows 7 computer with Intel processor. If somebody have another processor it would be interesting to compare our results.

Have somebody an idea how one can produce arithmetic underflow with - or + operations? My experiments looks like so, that it is impossible.

EDITED: I modified the code a little for better readability of the code and the messages.

ADDED LINKS: My experiments shows, that http://grouper.ieee.org/groups/754/faq.html#underflow is absolutely correct on my Intel Core 2 CPU. The way how it will be calculated produce no underflow in "+" and "-" floating point operations. My results are independent on Strict (/fp:strict) or Precise (/fp:precise) Microsoft Visual C compiler switches (see http://msdn.microsoft.com/en-us/library/e7s85ffb%28VS.80%29.aspx and http://msdn.microsoft.com/en-us/library/Aa289157)

ONE MORE (PROBABLY THE LAST ONE) LINK AND MY FINAL REMARK: I found a good reference http://en.wikipedia.org/wiki/Subnormal_numbers, where is described the same what I written before. Including of denormal numbers or denormalized numbers (now often called subnormal numbers for example in In IEEE 754-2008) follow to following statment:

“Denormal numbers provide the guarantee that addition and subtraction of floating-point numbers never underflows; two nearby floating-point numbers always have a representable non-zero difference. Without gradual underflow, the subtraction a−b can underflow and produce zero even though the values are not equal.”

So all my results must be correct on any processor which supports IEEE 754-2008.


Regarding what Mark says -- check out this link http://www.parashift.com/c++-faq-lite/newbie.html#faq-29.18 . (Not sure if it applies to your situation though.)

0

精彩评论

暂无评论...
验证码 换一张
取 消