开发者

Some floating point precision and numeric limits question

开发者 https://www.devze.com 2023-03-05 17:01 出处:网络
I know that there are tons of questions like this one, but I couldn\'t find my answers. Please read before voting to close (:

I know that there are tons of questions like this one, but I couldn't find my answers. Please read before voting to close (:

  • According to PC ASM:
The numeric coprocessor has eight floating point registers. 
Each register holds 80 bits of data. 
Floating point numbers are always stored as 80-bit 
extended precision numbers in these registers.

How is that possible, when sizeof shows different things. For example, on x64 architecture, the sizeof double is 8 and this is far away from 80bits.

  • why does std::numeric_limits< long double >::max() gives me 1.18973e+4932 ?! This is huuuuuuuuuuge number. If this is not the way to get max of floating point numb开发者_StackOverflowers, then why this compiles at all, and even more - why does this returns a value.

  • what does this mean:

Double precision magnitudes can range from approximately 10^−308 to 10^308 

These are huge numbers, you cannot store them into 8B or even 16B (which is extended precision and it is only 128bits)?

Obviously, I'm missing something. Actually, obviously, a lot of things.


1) sizeof is the size in memory, not in a register. sizeof is in bytes, so 8 bytes = 64 bits. When doubles are calculated in memory (on this architecture), they get an extra 16 bits for more precise intermediate calculations. When the value is copied back to memory, the extra 16 bits are lost.

2) Why do you think long double doesn't go up to 1.18973e+4932?

3) Why can't you store 10^308 in 8 bytes? I only need 13 bits: 4 to store the 10, and 9 to store the 308.


  1. A double is not an intel coprocessor 80 bit floating point, it is a IEEE 754 64 bit floating point. With sizeof(double) you will get the size of the latter.

  2. This is the correct way to get the maximum value for long double, so your question is pointless.

  3. You are probably missing that floating point numbers are not exact numbers. 10^308 doesn't store 308 digits, only about 19 digits.


The size of space that the FPU uses and the amount of space used in memory to represent double are two different things. IEEE 754 (which probably most architectures use) specifies 32-bit single precision and 64-bit double precision numbers, which is why sizeof(double) gives you 8 bytes. Intel x86 does floating point math internally using 80 bits.

std::numeric_limits< long double >::max() is giving you the correct size for long double which is typically 80 bits. If you want the max size for 64 bit double you should use that as the template parameter.

For the question about ranges, why do you think you can't store them in 8 bytes? They do in fact fit, and what you're missing is that at the extremes of the range there are number that can't be represented (for example exponent nearing 308, there are many many integers that cant' be represented at all).

See also http://floating-point-gui.de/ for info about floating point.


Floating point number on computer are represented according to the IEEE 754-2008.

It defines several formats, amongst which
binary32 = Single precision,
binary64 = Double precision and
binary128 = Quadruple precision are the most common.
http://en.wikipedia.org/wiki/IEEE_754-2008#Basic_formats

Double precision number have 52 bits for the digit, which gives the precision, and 10 bits for the exponent, which gives the size of the number.
So doubles are 1.xxx(52 binary digits) * 2 ^ exponent(10 binary digits, so up to 2^10=1024)

And 2^1024 = 1,79 * 10^308
Which is why this is the largest value you can store in a double.

When using a quadruple precision number, they are 112 bits of precision and 14 digits for the exponent, so the largest exponent is 16384.

As 2^16384 gives 1,18 * 10^4932 you see that your C++ test was perfectly correct and that on x64 your double is actually stored in a quadruple precision number.

0

精彩评论

暂无评论...
验证码 换一张
取 消