Please take a look at the f开发者_如何学JAVAollowing content:
I understand how to convert a double to a binary based on IEEE 754. But I don't understand what the formula is used for.
Can anyone give me an example when we use the above formula, please?
Thanks a lot.
The formula that is highlighted in red can be used to calculate the real number that a 64-bit value represents when treated as a IEEE 754 double. It's only useful if you want to manually calculate the conversion from binary to the base-10 real number that it represents, such as when verifying the correctness of a C library's implementation of printf
.
For example, using the formula on 0x3fd5555555555555
, x is found to be exactly 0.333333333333333314829616256247390992939472198486328125. That is the real number that 0x3fd5555555555555
represents.
#include <stdio.h>
#include <stdlib.h>
int main()
{
union {
double d;
unsigned long long ull;
} u;
u.ull = 0x3fd5555555555555L;
printf("%.55f\n", u.d);
return EXIT_SUCCESS;
}
http://codepad.org/kSithgZQ
EDIT: As Olof commented, an IEEE 754 double exactly represents the value x in the equation, but not all real numbers are exactly representable. In fact, only a finite number of reals such as 0.5, 0.125, and 0.333333333333333314829616256247390992939472198486328125 are exactly representable, while the vast majority (uncountably many) including 1/3, 0.1, 0.4, and π are not.
The key to knowing whether a real is exactly-representable as an IEEE 754 double is to calculate the real number's binary representation and write it in scientific notation (e.g. b1.001×2-1 for 0.5625). If the number of binary digits to the right of the decimal point excluding trailing zeroes is less than or equal to 52 and the exponent minus one is between -1022 and +1023, inclusive, then the number is exactly representable.
Let's go through a couple of examples. Note that it helps to have an arbitrary-precision calculator on hand. I will use ARIBAS.
The number 1/64 is 0.015625 in decimal. To calculate its binary representation, we can use ARIBAS'
decode_float
function:==> set_floatprec(double_float). -: 64 ==> 1/64. -: 0.0156250000000000000 ==> set_printbase(2). -: 0y10 ==> decode_float(1/64). -: (0y10000000_00000000_00000000_00000000_00000000_00000000_00000000_00000000, -0y1000101) ==> set_printbase(10). -: 10 ==> -0y1000101. -: -69
Thus 1/64 = b0.000001, or b1.0×2-6 in scientific notation.
1/64 is exactly-representable.
The number 1/10 = 0.1 in decimal. To calculate its binary representation:
==> set_printbase(2). -: 0y10 ==> decode_float(1/10). -: (0y11001100_11001100_11001100_11001100_11001100_11001100_11001100_11001100, -0y1000011) ==> set_printbase(10). -: 10 ==> -0y1000011. -: -67
So 1/10 = 0.1 = b0.0001100 (where bold represents a repeating digit sequence), or b1.1001100×2-4 in scientific notation.
1/10 is not exactly-representable.
The formula is to convert the binary representation into a number !
You only need it if you are implementing a floating point unit
精彩评论