Integers and float precision_问答_开发者_运维开发者技术经验分享

This is more of a numerical analysis rather than programming question, but I suppose some of you will be able to answer it.

In the sum two floats, is开发者_StackOverflow there any precision lost? Why?

In the sum of a float and a integer, is there any precision lost? Why?

Thanks.

In the sum two floats, is there any precision lost?

If both floats have differing magnitude and both are using the complete precision range (of about 7 decimal digits) then yes, you will see some loss in the last places.

Why?

This is because floats are stored in the form of (sign) (mantissa) × 2^(exponent). If two values have differing exponents and you add them, then the smaller value will get reduced to less digits in the mantissa (because it has to adapt to the larger exponent):

PS> [float]([float]0.0000001 + [float]1)
1

In the sum of a float and a integer, is there any precision lost?

Yes, a normal 32-bit integer is capable of representing values exactly which do not fit exactly into a float. A float can still store approximately the same number, but no longer exactly. Of course, this only applies to numbers that are large enough, i. e. longer than 24 bits.

Why?

Because float has 24 bits of precision and (32-bit) integers have 32. float will still be able to retain the magnitude and most of the significant digits, but the last places may likely differ:

PS> [float]2100000050 + [float]100
2100000100

The precision depends on the magnitude of the original numbers. In floating point, the computer represents the number 312 internally as scientific notation:

3.12000000000 * 10 ^ 2

The decimal places in the left hand side (mantissa) are fixed. The exponent also has an upper and lower bound. This allows it to represent very large or very small numbers.

If you try to add two numbers which are the same in magnitude, the result should remain the same in precision, because the decimal point doesn't have to move:

312.0 + 643.0 <==>

3.12000000000 * 10 ^ 2 +
6.43000000000 * 10 ^ 2
-----------------------
9.55000000000 * 10 ^ 2

If you tried to add a very big and a very small number, you would lose precision because they must be squeezed into the above format. Consider 312 + 12300000000000000000000. First you have to scale the smaller number to line up with the bigger one, then add:

1.23000000000 * 10 ^ 15 +
0.00000000003 * 10 ^ 15
-----------------------
1.23000000003 <-- precision lost here!

Floating point can handle very large, or very small numbers. But it can't represent both at the same time.

As for ints and doubles being added, the int gets turned into a double immediately, then the above applies.

When adding two floating point numbers, there is generally some error. D. Goldberg's "What Every Computer Scientist Should Know About Floating-Point Arithmetic" describes the effect and the reasons in detail, and also how to calculate an upper bound on the error, and how to reason about the precision of more complex calculations.

When adding a float to an integer, the integer is first converted to a float by C++, so two floats are being added and error is introduced for the same reasons as above.

The precision available for a float is limited, so of course there is always the risk that any given operation drops precision.

The answer for both your questions is "yes".

If you try adding a very large float to a very small one, you will for instance have problems.

Or if you try to add an integer to a float, where the integer uses more bits than the float has available for its mantissa.

The short answer: a computer represents a float with a limited number of bits, which is often done with mantissa and exponent, so only a few bytes are used for the significant digits, and the others are used to represent the position of the decimal point.

If you were to try to add (say) 10^23 and 7, then it won't be able to accurately represent that result. A similar argument applies when adding a float and integer -- the integer will be promoted to a float.

In the sum two floats, is there any precision lost? In the sum of a float and a integer, is there any precision lost? Why?

Not always. If the sum is representable with the precision you ask, and you won't get any precision loss.

Example: 0.5 + 0.75 => no precision loss x * 0.5 => no precision loss (except if x is too much small)

In the general case, one add floats in slightly different ranges so there is a precision loss which actually depends on the rounding mode. ie: if you're adding numbers with totally different ranges, expect precision problems.

Denormals are here to give extra-precision in extreme cases, at the expense of CPU.

Depending on how your compiler handle floating-point computation, results can vary.

With strict IEEE semantics, adding two 32 bits floats should not give better accuracy than 32 bits. In practice it may requires more instruction to ensure that, so you shouldn't rely on accurate and repeatable results with floating-point.

In both cases yes:

assert( 1E+36f + 1.0f == 1E+36f );
assert( 1E+36f + 1 == 1E+36f );

The case float + int is the same as float + float, because a standard conversion is applied to the int. In the case of float + float, this is implementation dependent, because an implementation may choose to do the addition at double precision. There may be some loss when you store the result, of course.

In both cases, the answer is "yes". When adding an int to a float, the integer is converted to floating point representation before the addition takes place anyway.

To understand why, I suggest you read this gem: What Every Computer Scientist Should Know About Floating-Point Arithmetic.