float8、float16、float32、float64 和 float128 可以包含多少位数字?

How many digits can float8, float16, float32, float64, and float128 contain?

提问人:mathguy 提问时间:6/9/2019 最后编辑:mathguy 更新时间:11/14/2023 访问量:26973


Numpy 的 dtype 文档仅显示每个浮点数类型的“x 位指数,y 位尾数”,但我无法将其准确转换为小数点前后的数字。有没有简单的公式/表格可以查找?

Python NumPy 浮点 精度


3赞 campovski 6/9/2019
这是关于指数和尾数在十进制中的作用的注释,在二进制中一切都是一样的,只是不是以 10 为基数,而是以 2 为基数。我认为你可以从那里弄清楚,因为你是“数学家”。(提示:将上限和下限转换为十进制表示,并查看您获得的位数。
3赞 Mark Dickinson 6/9/2019
这根本不是一个愚蠢的问题,但答案很复杂,取决于您将如何使用这些信息。例如,IEEE 754 类型可以忠实地表示任何具有 15 个或更少有效数字的不太大不太小的十进制值,但要忠实地表示十进制值需要 17 个十进制数字。对于 15-17 范围内的各种不同值,可以提出参数。binary64binary64
2赞 Paul Panzer 6/9/2019
np.finfo 应该为您提供您需要知道的一切。
4赞 Eric Postpischil 6/10/2019


19赞 Netch 6/9/2019 #1


  1. 给定十进制表示形式的值,如果从十进制转换为选定的二进制格式并返回(使用默认舍入),可以保证保留多少个十进制数字。

  2. 给定二进制格式的值,如果将该值转换为十进制格式并返回原始二进制格式(同样,使用默认舍入),则需要多少个十进制数字才能使原始值保持不变。

在这两种情况下,十进制表示都被视为独立于指数,没有前导和尾随零(例如,0.0123e4、1.23e2、1.2300e2、123、123.0、123000.000e-3 都是 3 位数字)。

对于 32 位二进制浮点数,这两个大小分别为 6 位和 9 位十进制数字。在 C 中,这些是 和 。(奇怪的是,32 位浮点数保留了 7 位十进制数字,但也有例外。 在 C++ 中,分别查看 和 。<float.h>FLT_DIGFLT_DECIMAL_DIGstd::numeric_limits<float>::digits10std::numeric_limits<float>::max_digits10

对于 64 位二进制浮点数,它们分别为 15 和 17(分别为 和 和 )。DBL_DIGDBL_DECIMAL_DIGstd::numeric_limits<double>::{digits10, max_digits10}

它们的一般公式(thx2 @MarkDickinson)

  • ${format}_DIG(数字10):floor((p-1)*log10(2))
  • ${format}_DECIMAL_DIG(max_digits10):ceil(1+p*log10(2))


此外,在 C++ 数字限制页面上有一些数学解释的注释:

标准的 32 位 IEEE 754 浮点类型有一个 24 位小数部分(写入 23 位,隐含一个),这可能表明它可以表示 7 位小数(24 * std::log10(2) 是 7.22),但相对舍入误差是不均匀的,一些具有 7 位十进制数字的浮点值在转换为 32 位浮点数后无法承受: 最小的正示例是 8.589973e9,往返后变为 8.589974e9。这些舍入误差在表示形式中不能超过一位,digits10 的计算公式为 (24-1)*std::log10(2),即 6.92。向下舍入的结果为值 6。

在注释中查找 16 位和 128 位浮点数的值(但请参阅下文了解实际的 128 位浮点数)。


@PaulPanzer suggested . It gives first of these values ({format}_DIG); maybe it is the thing you search:numpy.finfo

>>> numpy.finfo(numpy.float16).precision
>>> numpy.finfo(numpy.float32).precision
>>> numpy.finfo(numpy.float64).precision
>>> numpy.finfo(numpy.float128).precision

but, on most systems (my one was Ubuntu 18.04 on x86-84) the value is confusing for float128; it is really for 80-bit x86 "extended" float with a 64 bit significand; real IEEE754 float128 has 112 significand bits and so the real value will be around 33, but numpy presents another type under this name. See here for details: in general, float128 is a delusion in numpy.

UPD3: you mentioned - there is no such type in IEEE754 set. One could imagine such type for some utterly specific purposes, but its range will bee too narrow for any universal usage.float8


1赞 Mark Dickinson 6/9/2019
The C standard has the relevant formulas in it: see e.g., C11 Those formulas, for a given binary precision p, are and . For IEEE 754 binary16 the relevant numbers are and ; for IEEE 754 binary128 they're and .floor((p-1)*log10(2))1 + ceil(p * log10(2))353336
1赞 Mark Dickinson 6/9/2019
The general condition for precision-p base-B floating-point to round-trip through precision-q base-D floating-point (assuming that and aren't powers of a common base) is that ; both of the formulas in the C standard can be derived from this.BDB**p <= D**(q-1)
0赞 Paul Panzer 6/10/2019
"but, on my system ... the value is wrong for float128" Nope, what's"wrong" here is the name which is a in fact a , the 128 refers to the fact that it is "padded" to 128 bits for the sake of alignment - fell for that myselffloat128float80
0赞 Netch 6/10/2019
@PaulPanzer thanks, integrated this into the reply. For "float128 is a fact a float80", that's the thing to confuse anybody not accustomly familiar with this specifics and so should be regularly reminded.
0赞 Netch 6/10/2019
@MarkDickinson thanks, integrated this into the reply.
13赞 SREERAG R NANDAN 9/3/2021 #2

To keep it simple.

Normally as the magnitude of the value increases or decreases, the number of decimal digits of precision increases or decreases respectively


Data-Type | Precision
float16   | 3 to 4
float32   | 6 to 9
float64   | 15 to 17
float128  | 18 to 34

if you understood don't forget to upvote the answer

Bitwise properties:

float16 : 1 sign bit, 5 exponent bit, 10-bit significand (fractional part).

float32 : 1 sign bit, 8 exponent bit, and 23-bit significand (fractional part).

float64 : 1 sign bit, 11 exponent bits, and 52 fraction bits.

float128 : 1 sign bit, 15 exponent bits, and 112 fraction bits.


0赞 Pascal Cuoq 5/16/2023
I don't believe that float128 has only 18 decimal digits of precision.
0赞 SREERAG R NANDAN 5/17/2023
answer is edited, thank you for pointing out included both lower and upper limit