开发者

size of exponent and fraction in float256

开发者 https://www.devze.com 2023-03-28 14:49 出处:网络
You better look at the table to understand what i want: ╔════════╦════════╦════════════╦════════════╗

You better look at the table to understand what i want:

╔════════╦════════╦════════════╦════════════╗
║  name  ║  sign  ║  exponent  ║  fraction  ║
╠════════╬════════╬════════════╬════════════╣
║开发者_JAVA技巧float16 ║    1   ║      5     ║     10     ║
╠════════╬════════╬════════════╬════════════╣
║float32 ║    1   ║      8     ║     23     ║
╠════════╬════════╬════════════╬════════════╣
║float64 ║    1   ║     11     ║     52     ║
╠════════╬════════╬════════════╬════════════╣
║float128║    1   ║     15     ║    112     ║
╠════════╬════════╬════════════╬════════════╣
║float256║    1   ║    ????    ║    ????    ║
╠════════╬════════╬════════════╬════════════╣
║float512║    1   ║    ????    ║    ????    ║
╚════════╩════════╩════════════╩════════════╝

My question is how to calculate number of bits for exponent and fraction given total number of bits such as 256, 512 or 1024.


Early drafts of IEEE-754 (2008) defined guidelines for what the widths of the exponent and significand fields of arbitrary-width floats "should" be. This was not a hard requirement, but merely recommended practice. It was deemed to be too cumbersome for the minimal benefit provided, so it was dropped from the standard altogether, and replaced with:

Language standards should define mechanisms supporting extendable precision for each supported radix. Language standards supporting extendable precision shall permit users to specify p and emax. Language standards shall also allow the specification of an extendable precision by specifying p alone; in this case emax shall be defined by the language standard to be at least 1000×p when p is ≥ 237 bits in a binary format or p is ≥ 51 digits in a decimal format.

(3.7 Extended and extendable precisions, p14).

That said, the standard still defines (without requiring) "interchange formats" of every multiple-of-32-bits size larger than 128 in the tables in clause 3.6 (p13). Specifically, the binary format of width k has a round(4*log2(k)) - 13 bit exponent. For the specific case of k=256, this gives:

exponent: round(4*log2(256)) - 13 = 32 - 13 = 19
significand: 256 - 1 - 19 = 236

For a 384-bit wide format that followed this formula, the exponent width would be:

round(4*log2(384)) - 13 = round(34.339850002884624) - 13 = 21 bits

Please be aware that there are lots of packages out there for arbitrary-precision floating-point arithmetic that do not adhere to this guidelines. This is only the definition of the "binary256 interchange format", not what any given implementation necessarily uses.


There is no 256 bit double in the IEEE 754-2008 floating point standard.

The number of bits in the formats are not calculated, they are chosen arbitrary to give a specific precision and range. If you want to create your own 256 bit floating point number format, you can just pick the sizes that gives you the precision and range that you want.


The values in your table are from the IEEE 754-2008 standard, which only goes up to 128 bits. If you have hardware or software implementing floating point with even more bits, you need to consult its documentation.

0

精彩评论

暂无评论...
验证码 换一张
取 消