C++ Float Division and Precision_问答_开发者_运维开发者技术经验分享

I know that 511 divided by 512 actually equals 0.998046875. I also know that the precision of floats is 7 digits. My question is, when I do this math in C++ (GCC) t开发者_运维百科he result I get is 0.998047, which is a rounded value. I'd prefer to just get the truncated value of 0.998046, how can I do that?

  float a = 511.0f;
  float b = 512.0f;
  float c = a / b;

Well, here's one problem. The value of 511/512, as a float, is exact. No rounding is done. You can check this by asking for more than seven digits:

#include <stdio.h>
int main(int argc, char *argv[])
{
    float x = 511.0f, y = 512.0f;
    printf("%.15f\n", x/y);
    return 0;
}

Output:

0.998046875000000

A float is stored not as a decimal number, but binary. If you divide a number by a power of 2, such as 512, the result will almost always be exact. What's going on is the precision of a float is not simply 7 digits, it is really 23 bits of precision.

See What Every Computer Scientist Should Know About Floating-Point Arithmetic.

I also know that the precision of floats is 7 digits.

No. The most common floating point format is binary and has a precision of 24 bits. It is somewhere between 6 and 7 decimal digits but you can't think in decimal if you want to understand how rounding work.

As b is a power of 2, c is exactly representable. It is during the conversion in a decimal representation that rounding will occurs. The standard ways of getting a decimal representation don't offer the possibility to use truncation instead of rounding. One way would be to ask for one more digit and ignore it.

But note that the fact that c is exactly representable is a property of its value. SOme apparently simpler values (like 0.1) don't have an exact representation in binary FP formats.

Your question is not unique, it has been answered numerous times before. This is not a simple topic and just because answers are posted doesn't necessarily mean they'll be of good quality. If you browse a little you'll find the really good stuff. And it will take you less time.

I bet someone will -1 me for commenting and not answering.

_____ Edit _____

What is fundamental to understanding floating point is to realize that everything is displayed in binary digits. Because most people have trouble grasping this they try to see it from the point of view of decimal digits.

On the subject of 511/512 you can start by looking at the value 1.0. In floating point this could be expressed as i.000000... * 2^0 or implicit bit set (to 1) multiplied by 2^0 ie equals 1. Since 511/512 is less than 1 you need to start with the next lower power -1 giving i.000000... * 2^-1 i e 0.5. Notice that the only thing that has changed is the exponent. If we want to express 511 in binary we get 9 ones - 111111111 or in floating point with implicit bit i.11111111 - which we can divide by 512 and put together with the exponent of -1 giving i.1111111100... * 2^-1.

How does this translate to 0.998046875?

Well to begin with the implicit bit represents 0.5 (or 2^-1), the first explicit bit 0.25 (2^-2), the next explicit bit 0.125 (2^-3), 0.0625, 0.03125 and so on until you've represented the ninth bit (eighth explicit). Sum them up and you get 0.998046875. From the i.11111111 we find that this number represents 9 binary digits of precision and, coincidentally, 9 decimal precision.

If you multiply 511/512 by 512 you will get i1111111100... * 2^8. Here there are the same nine binary digits of precision but only three decimal digits (for 511).

Consider i.11111111111111111111111 (i + 23 ones) * 2^-1. We will get a fraction (2^(24-1)^/(2^24))with 24 binary and 24 decimal digits of precision. Given an appropriate printf formatting all 24 decimal digits will be displayed. Multiply it by 2^24 and you still have 24 binary digits of precision but only 8 decimal (for 16777215).

Now consider i.1111100... * 2^2 which comes out to 7.875. i11 is the integer part and 111 the fraction part (111/1000 or 7/8ths). 6 binary digits of precision and 4 decimal.

Thinking decimal when doing floating-point is utterly detrimental to understanding it. Free yourself!

That 'rounded' value is most likley what is displayed through some output method rather than what is actually stored. Check the actual value in your debugger.

With iostream and stdio, you can specify the precision of the output. If you specify 7 significant digits, convert it to a string, then truncate the string before display you will get the output without rounding.

Can't think of one reason why you would want to do that however, and given the subseqent explanation of teh application, you'd be better off using double precision, though that will most likely simply shobe problems to somewhere else.

If you are just interested in the value, you could use double and then multiply the result by 10^6 and floor it. Divide again by 10^6 and you will get the truncated value.