提问人:chux - Reinstate Monica 提问时间:2/1/2023 更新时间:2/2/2023 访问量:135
为什么加法误差接近“DBL_MAX”和无穷大?
Why addition error near `DBL_MAX` and infinity?
问:
在四舍五入模式的实验中,( 和 )显然无法形成预期的无穷大之和,而是在将 和 的 1 ULP 相加时形成。FE_DOWNWARD
FE_TOWARDZERO
DBL_MAX
DBL_MAX
DBL_MAX
下面是演示意外求和的测试代码。在不同的回合模式下,它会增加接近 0.5 ULP 和 1.0 ULP 的值。和 中没有发现任何问题。DBL_MAX
FE_TONEAREST
FE_UPWARD
问题:
- 你同意这是一个错误吗?
- 代码在另一台机器上形成正确答案吗?
- 可悲的是,这是在最近报道的另一个近乎
DBL_MAX
问题之后发生的,所以也许我的数学库低于标准。要求提供关于如何报告的建议。
编译器说明: 调用:
Cygwin C 编译器 gcc -std=c11 -O0 -
g3 -pedantic -Wall -Wextra -Wconversion -c -fmessage-length=0 -v -MMD -MP -MF“rand_i.d” -MT“rand_i.o” -o “rand_i.o” “../rand_i.c“
COLLECT_GCC=gcc 目标:x86_64-pc-cygwin gcc
版本 11.3.0 (GCC) GNU C11 (GCC) 版本 11.3.0 (x86_64-pc-cygwin
)
由 GNU C 版本 11.3.0、GMP 版本 6.2.1、MPFR 版本 4.1.0、MPC 版本 1.2.1、isl 版本 isl-0.25-GMP 编译
#include <fenv.h>
#include <float.h>
#include <math.h>
#include <stdio.h>
#include <stdlib.h>
#define STRINGIFY(x) #x
#define TOSTRING(x) STRINGIFY(x)
int main() {
const double max = DBL_MAX;
const double max_ulp = max - nextafter(max, 0);
int mode[] = {FE_DOWNWARD, FE_TONEAREST, FE_TOWARDZERO, FE_UPWARD};
const char *modes[] = {STRINGIFY(FE_DOWNWARD), STRINGIFY(FE_TONEAREST), //
STRINGIFY(FE_TOWARDZERO), STRINGIFY(FE_UPWARD)};
int n = sizeof mode / sizeof mode[0];
int p = (DBL_MANT_DIG + 2)/4;
int P = (LDBL_MANT_DIG + 2)/4;
printf("%s:%d\n", STRINGIFY(FLT_EVAL_METHOD), FLT_EVAL_METHOD);
printf("LDBL_MAX :%-24.*La %.*Lg\n", P, LDBL_MAX, LDBL_DECIMAL_DIG, LDBL_MAX);
printf("DBL_MAX :%-24.*a %.*g\n", p, max, DBL_DECIMAL_DIG, max);
printf("DBL_MAX_ULP:%-24.*a %.*g\n", p, max_ulp, DBL_DECIMAL_DIG, max_ulp);
printf("\n");
printf("mode: Addendum SUM (double) SUM (long double)\n");
for (int i = 0; i < n; i++) {
if (fesetround(mode[i])) {
perror("Invalid mode");
return -1;
}
double delta[] = {nextafter(max_ulp / 2, 0), max_ulp / 2, //
nextafter(max_ulp / 2, INFINITY), nextafter(max_ulp, 0), //
max_ulp, nextafter(max_ulp, INFINITY)};
const char *deltas[] = { "0.5 ulp-", "0.5 ulp", "0.5 ulp+", //
"ulp-", "ulp", "ulp+"};
int dn = sizeof delta / sizeof delta[0];
for (int d = 0; d < dn; d++) {
double sum = max + delta[d];
printf("mode:%-14s %-8s:%-24.*a sum:%-24.*a %-24.*La\n", //
modes[i], deltas[d], p, delta[d], p, sum, P, 0.0L + max + delta[d]);
}
puts("");
}
}
输出:注意 4 行意外行。
FLT_EVAL_METHOD:0
LDBL_MAX :0x1.fffffffffffffffep+16383 1.18973149535723176502e+4932
DBL_MAX :0x1.fffffffffffffp+1023 1.7976931348623157e+308
DBL_MAX_ULP:0x1.0000000000000p+971 1.9958403095347198e+292
mode: Addendum SUM (double) SUM (long double)
mode:FE_DOWNWARD 0.5 ulp-:0x1.fffffffffffffp+969 sum:0x1.fffffffffffffp+1023 0x1.fffffffffffff7fep+1023
mode:FE_DOWNWARD 0.5 ulp :0x1.0000000000000p+970 sum:0x1.fffffffffffffp+1023 0x1.fffffffffffff800p+1023
mode:FE_DOWNWARD 0.5 ulp+:0x1.0000000000001p+970 sum:0x1.fffffffffffffp+1023 0x1.fffffffffffff800p+1023
mode:FE_DOWNWARD ulp- :0x1.fffffffffffffp+970 sum:0x1.fffffffffffffp+1023 0x1.fffffffffffffffep+1023
mode:FE_DOWNWARD ulp :0x1.0000000000000p+971 sum:0x1.fffffffffffffp+1023 0x1.0000000000000000p+1024
--> inf expected <--
mode:FE_DOWNWARD ulp+ :0x1.0000000000001p+971 sum:0x1.fffffffffffffp+1023 0x1.0000000000000000p+1024
--> inf expected <--
mode:FE_TONEAREST 0.5 ulp-:0x1.fffffffffffffp+969 sum:0x1.fffffffffffffp+1023 0x1.fffffffffffff800p+1023
mode:FE_TONEAREST 0.5 ulp :0x1.0000000000000p+970 sum:inf 0x1.fffffffffffff800p+1023
mode:FE_TONEAREST 0.5 ulp+:0x1.0000000000001p+970 sum:inf 0x1.fffffffffffff800p+1023
mode:FE_TONEAREST ulp- :0x1.fffffffffffffp+970 sum:inf 0x1.0000000000000000p+1024
mode:FE_TONEAREST ulp :0x1.0000000000000p+971 sum:inf 0x1.0000000000000000p+1024
mode:FE_TONEAREST ulp+ :0x1.0000000000001p+971 sum:inf 0x1.0000000000000000p+1024
mode:FE_TOWARDZERO 0.5 ulp-:0x1.fffffffffffffp+969 sum:0x1.fffffffffffffp+1023 0x1.fffffffffffff7fep+1023
mode:FE_TOWARDZERO 0.5 ulp :0x1.0000000000000p+970 sum:0x1.fffffffffffffp+1023 0x1.fffffffffffff800p+1023
mode:FE_TOWARDZERO 0.5 ulp+:0x1.0000000000001p+970 sum:0x1.fffffffffffffp+1023 0x1.fffffffffffff800p+1023
mode:FE_TOWARDZERO ulp- :0x1.fffffffffffffp+970 sum:0x1.fffffffffffffp+1023 0x1.fffffffffffffffep+1023
mode:FE_TOWARDZERO ulp :0x1.0000000000000p+971 sum:0x1.fffffffffffffp+1023 0x1.0000000000000000p+1024
--> inf expected <--
mode:FE_TOWARDZERO ulp+ :0x1.0000000000001p+971 sum:0x1.fffffffffffffp+1023 0x1.0000000000000000p+1024
--> inf expected <--
mode:FE_UPWARD 0.5 ulp-:0x1.fffffffffffffp+969 sum:inf 0x1.fffffffffffff800p+1023
mode:FE_UPWARD 0.5 ulp :0x1.0000000000000p+970 sum:inf 0x1.fffffffffffff800p+1023
mode:FE_UPWARD 0.5 ulp+:0x1.0000000000001p+970 sum:inf 0x1.fffffffffffff802p+1023
mode:FE_UPWARD ulp- :0x1.fffffffffffffp+970 sum:inf 0x1.0000000000000000p+1024
mode:FE_UPWARD ulp :0x1.0000000000000p+971 sum:inf 0x1.0000000000000000p+1024
mode:FE_UPWARD ulp+ :0x1.0000000000001p+971 sum:inf 0x1.0000000000000002p+1024
答:
检查和处理溢出分两个阶段进行。请注意,实际的软件/电路可能使用不同的策略,但结果就像这是程序一样。
- 计算的无限精度结果是四舍五入的,其规则基于当前的四舍五入模式,但对指数没有限制。在这个阶段,没有“无穷大”这样的东西,但数字可以变得像他们需要的那样大。
- 如果结果超出可表示范围,则以基于舍入模式的方式将其“更正”为在可表示范围内。对于(“正常”模式),数字被校正为无穷大。但是,对于 ,它被更正为 。(对于其他舍入模式,它取决于符号:从零四舍五入导致无穷大,向零四舍五入导致 。
FE_NEAREST
FE_TOWARDZERO
+/-DBL_MAX
+/-DBL_MAX
因此,给定模式的溢出规则让人想起该模式的舍入规则,但实际上并不相同。可以说是很奇怪的,因为它在所有超出范围的情况下都像(仅限 IEEE-754)从零拉开距离的舍入模式,而不是注意到无穷大比 .但其他模式的基本行为是,当其首选方向(给定符号)趋向于零时输出,当其首选方向远离零时输出无穷大。FE_NEAREST
DBL_MAX
+/-DBL_MAX
另请注意,当阶段 1 的结果超出可表示范围的值时,即使阶段 2 的结果为 。溢出并不表示“我做了一个无穷大”,而是“我必须做第 2 阶段”。DBL_MAX
评论
DBL_MAX + ulp
0x1.0000000000000000p+1024
FE_TOWARDZERO
FE_NEAREST
DBL_MAX
double
FE_NEAREST
@Sneftel使用 2 阶段舍入模型的想法很有用,并且很好地解释了为什么我的预期结果是错误的并且代码是正确的。
下面是发布溢出标志的增强代码,以帮助说明模型应用程序,以供参考。
#include <fenv.h>
#include <float.h>
#include <math.h>
#include <stdio.h>
#include <stdlib.h>
#define STRINGIFY(x) #x
#define TOSTRING(x) STRINGIFY(x)
const char* exceptstr(fexcept_t *flag) {
static char buf[100];
buf[0] = 0;
if (*flag & FE_DIVBYZERO)
strcat(buf, STRINGIFY(FE_DIVBYZERO));
if (*flag & FE_INEXACT)
strcat(buf, STRINGIFY(FE_INEXACT));
if (*flag & FE_INVALID)
strcat(buf, STRINGIFY(FE_INVALID));
if (*flag & FE_OVERFLOW)
strcat(buf, STRINGIFY(FE_OVERFLOW));
if (*flag & FE_UNDERFLOW)
strcat(buf, STRINGIFY(FE_UNDERFLOW));
return buf;
}
int main() {
const double max = DBL_MAX;
const double max_ulp = max - nextafter(max, 0);
int mode[] = {FE_DOWNWARD, FE_TONEAREST, FE_TOWARDZERO, FE_UPWARD};
const char *modes[] = {STRINGIFY(FE_DOWNWARD), STRINGIFY(FE_TONEAREST), //
STRINGIFY(FE_TOWARDZERO), STRINGIFY(FE_UPWARD)};
int n = sizeof mode / sizeof mode[0];
int p = (DBL_MANT_DIG + 2) / 4;
int P = (LDBL_MANT_DIG + 2) / 4;
printf("%s:%d\n", STRINGIFY(FLT_EVAL_METHOD), FLT_EVAL_METHOD);
printf("LDBL_MAX :%-24.*La %.*Lg\n", P, LDBL_MAX, LDBL_DECIMAL_DIG,
LDBL_MAX);
printf("DBL_MAX :%-24.*a %.*g\n", p, max, DBL_DECIMAL_DIG, max);
printf("DBL_MAX_ULP:%-24.*a %.*g\n", p, max_ulp, DBL_DECIMAL_DIG, max_ulp);
printf("\n");
printf(
"mode: Addendum SUM (double) SUM (long double) FE\n");
for (int i = 0; i < n; i++) {
if (fesetround(mode[i])) {
perror("Invalid mode");
return -1;
}
double delta[] = {nextafter(max_ulp / 2, 0), max_ulp / 2, //
nextafter(max_ulp / 2, INFINITY), nextafter(max_ulp, 0), //
max_ulp, nextafter(max_ulp, INFINITY)};
const char *deltas[] = {"0.5 ulp-", "0.5 ulp", "0.5 ulp+", //
"ulp-", "ulp", "ulp+"};
int dn = sizeof delta / sizeof delta[0];
for (int d = 0; d < dn; d++) {
if (feclearexcept(FE_ALL_EXCEPT)) {
perror("feclearexcept()");
return -1;
}
///////////////////////////////
double sum = max + delta[d];
///////////////////////////////
fexcept_t flag;
if (fegetexceptflag(&flag, FE_ALL_EXCEPT)) {
perror("fegetexceptflag()");
return -1;
}
printf(
"mode:%-14s %-8s:%-24.*a sum:%-24.*a %-24.*La %s\n", //
modes[i], deltas[d], p, delta[d], p, sum, P, 1.0L * max + delta[d],
exceptstr(&flag));
}
puts("");
}
}
输出(看向最右边)
FLT_EVAL_METHOD:0
LDBL_MAX :0x1.fffffffffffffffep+16383 1.18973149535723176502e+4932
DBL_MAX :0x1.fffffffffffffp+1023 1.7976931348623157e+308
DBL_MAX_ULP:0x1.0000000000000p+971 1.9958403095347198e+292
mode: Addendum SUM (double) SUM (long double) FE
mode:FE_DOWNWARD 0.5 ulp-:0x1.fffffffffffffp+969 sum:0x1.fffffffffffffp+1023 0x1.fffffffffffff7fep+1023 FE_INEXACT
mode:FE_DOWNWARD 0.5 ulp :0x1.0000000000000p+970 sum:0x1.fffffffffffffp+1023 0x1.fffffffffffff800p+1023 FE_INEXACT
mode:FE_DOWNWARD 0.5 ulp+:0x1.0000000000001p+970 sum:0x1.fffffffffffffp+1023 0x1.fffffffffffff800p+1023 FE_INEXACT
mode:FE_DOWNWARD ulp- :0x1.fffffffffffffp+970 sum:0x1.fffffffffffffp+1023 0x1.fffffffffffffffep+1023 FE_INEXACT
mode:FE_DOWNWARD ulp :0x1.0000000000000p+971 sum:0x1.fffffffffffffp+1023 0x1.0000000000000000p+1024 FE_INEXACTFE_OVERFLOW
mode:FE_DOWNWARD ulp+ :0x1.0000000000001p+971 sum:0x1.fffffffffffffp+1023 0x1.0000000000000000p+1024 FE_INEXACTFE_OVERFLOW
mode:FE_TONEAREST 0.5 ulp-:0x1.fffffffffffffp+969 sum:0x1.fffffffffffffp+1023 0x1.fffffffffffff800p+1023 FE_INEXACT
mode:FE_TONEAREST 0.5 ulp :0x1.0000000000000p+970 sum:inf 0x1.fffffffffffff800p+1023 FE_INEXACTFE_OVERFLOW
mode:FE_TONEAREST 0.5 ulp+:0x1.0000000000001p+970 sum:inf 0x1.fffffffffffff800p+1023 FE_INEXACTFE_OVERFLOW
mode:FE_TONEAREST ulp- :0x1.fffffffffffffp+970 sum:inf 0x1.0000000000000000p+1024 FE_INEXACTFE_OVERFLOW
mode:FE_TONEAREST ulp :0x1.0000000000000p+971 sum:inf 0x1.0000000000000000p+1024 FE_INEXACTFE_OVERFLOW
mode:FE_TONEAREST ulp+ :0x1.0000000000001p+971 sum:inf 0x1.0000000000000000p+1024 FE_INEXACTFE_OVERFLOW
mode:FE_TOWARDZERO 0.5 ulp-:0x1.fffffffffffffp+969 sum:0x1.fffffffffffffp+1023 0x1.fffffffffffff7fep+1023 FE_INEXACT
mode:FE_TOWARDZERO 0.5 ulp :0x1.0000000000000p+970 sum:0x1.fffffffffffffp+1023 0x1.fffffffffffff800p+1023 FE_INEXACT
mode:FE_TOWARDZERO 0.5 ulp+:0x1.0000000000001p+970 sum:0x1.fffffffffffffp+1023 0x1.fffffffffffff800p+1023 FE_INEXACT
mode:FE_TOWARDZERO ulp- :0x1.fffffffffffffp+970 sum:0x1.fffffffffffffp+1023 0x1.fffffffffffffffep+1023 FE_INEXACT
mode:FE_TOWARDZERO ulp :0x1.0000000000000p+971 sum:0x1.fffffffffffffp+1023 0x1.0000000000000000p+1024 FE_INEXACTFE_OVERFLOW
mode:FE_TOWARDZERO ulp+ :0x1.0000000000001p+971 sum:0x1.fffffffffffffp+1023 0x1.0000000000000000p+1024 FE_INEXACTFE_OVERFLOW
mode:FE_UPWARD 0.5 ulp-:0x1.fffffffffffffp+969 sum:inf 0x1.fffffffffffff800p+1023 FE_INEXACTFE_OVERFLOW
mode:FE_UPWARD 0.5 ulp :0x1.0000000000000p+970 sum:inf 0x1.fffffffffffff800p+1023 FE_INEXACTFE_OVERFLOW
mode:FE_UPWARD 0.5 ulp+:0x1.0000000000001p+970 sum:inf 0x1.fffffffffffff802p+1023 FE_INEXACTFE_OVERFLOW
mode:FE_UPWARD ulp- :0x1.fffffffffffffp+970 sum:inf 0x1.0000000000000000p+1024 FE_INEXACTFE_OVERFLOW
mode:FE_UPWARD ulp :0x1.0000000000000p+971 sum:inf 0x1.0000000000000000p+1024 FE_INEXACTFE_OVERFLOW
mode:FE_UPWARD ulp+ :0x1.0000000000001p+971 sum:inf 0x1.0000000000000002p+1024 FE_INEXACTFE_OVERFLOW
上一个:For循环返回错误结果
评论
double
long double
LDBL_MAX :0xf.fffffffffffffff0p+16380 1.18973149535723176502e+4932
long double
0xf.fffffffffffffff0p+16380 == 0x1.fffffffffffffffep+16383
DBL_MAX + DBL_MAX
FE_DOWNWARD