为什么加法误差接近“DBL_MAX”和无穷大?

Why addition error near `DBL_MAX` and infinity?

提问人:chux - Reinstate Monica 提问时间:2/1/2023 更新时间:2/2/2023 访问量:135

问:

在四舍五入模式的实验中,( 和 )显然无法形成预期的无穷大之和,而是在将 和 的 1 ULP 相加时形成。FE_DOWNWARDFE_TOWARDZERODBL_MAXDBL_MAXDBL_MAX

下面是演示意外求和的测试代码。在不同的回合模式下,它会增加接近 0.5 ULP 和 1.0 ULP 的值。和 中没有发现任何问题。DBL_MAXFE_TONEARESTFE_UPWARD

问题:

  1. 你同意这是一个错误吗?
  2. 代码在另一台机器上形成正确答案吗?
  3. 可悲的是,这是在最近报道的另一个近乎DBL_MAX问题之后发生的,所以也许我的数学库低于标准。要求提供关于如何报告的建议。

编译器说明: 调用:
Cygwin C 编译器 gcc -std=c11 -O0 -
g3 -pedantic -Wall -Wextra -Wconversion -c -fmessage-length=0 -v -MMD -MP -MF“rand_i.d” -MT“rand_i.o” -o “rand_i.o” “../rand_i.c“
COLLECT_GCC=gcc 目标:x86_64-pc-cygwin gcc
版本 11.3.0 (GCC) GNU C11 (GCC) 版本 11.3.0 (x86_64-pc-cygwin


由 GNU C 版本 11.3.0、GMP 版本 6.2.1、MPFR 版本 4.1.0、MPC 版本 1.2.1、isl 版本 isl-0.25-GMP 编译


#include <fenv.h>
#include <float.h>
#include <math.h>
#include <stdio.h>
#include <stdlib.h>

#define STRINGIFY(x) #x
#define TOSTRING(x) STRINGIFY(x)

int main() {
  const double max = DBL_MAX;
  const double max_ulp = max - nextafter(max, 0);

  int mode[] = {FE_DOWNWARD, FE_TONEAREST, FE_TOWARDZERO, FE_UPWARD};
  const char *modes[] = {STRINGIFY(FE_DOWNWARD), STRINGIFY(FE_TONEAREST), //
      STRINGIFY(FE_TOWARDZERO), STRINGIFY(FE_UPWARD)};
  int n = sizeof mode / sizeof mode[0];
  int p = (DBL_MANT_DIG + 2)/4;
  int P = (LDBL_MANT_DIG + 2)/4;
  printf("%s:%d\n", STRINGIFY(FLT_EVAL_METHOD), FLT_EVAL_METHOD);

  printf("LDBL_MAX   :%-24.*La %.*Lg\n", P, LDBL_MAX, LDBL_DECIMAL_DIG, LDBL_MAX);
  printf("DBL_MAX    :%-24.*a %.*g\n", p, max, DBL_DECIMAL_DIG, max);
  printf("DBL_MAX_ULP:%-24.*a %.*g\n", p, max_ulp, DBL_DECIMAL_DIG, max_ulp);
  printf("\n");
  printf("mode:               Addendum                          SUM (double)                 SUM (long double)\n");
  for (int i = 0; i < n; i++) {
    if (fesetround(mode[i])) {
      perror("Invalid mode");
      return -1;
    }
    double delta[] = {nextafter(max_ulp / 2, 0), max_ulp / 2, //
        nextafter(max_ulp / 2, INFINITY), nextafter(max_ulp, 0), //
        max_ulp, nextafter(max_ulp, INFINITY)};
    const char *deltas[] = { "0.5 ulp-", "0.5 ulp", "0.5 ulp+", //
        "ulp-", "ulp", "ulp+"};
    int dn = sizeof delta / sizeof delta[0];
    for (int d = 0; d < dn; d++) {
      double sum = max + delta[d];
      printf("mode:%-14s %-8s:%-24.*a sum:%-24.*a %-24.*La\n", //
          modes[i],  deltas[d], p, delta[d], p, sum, P, 0.0L + max + delta[d]);
    }
    puts("");
  }
}

输出:注意 4 行意外行

FLT_EVAL_METHOD:0
LDBL_MAX   :0x1.fffffffffffffffep+16383 1.18973149535723176502e+4932
DBL_MAX    :0x1.fffffffffffffp+1023  1.7976931348623157e+308
DBL_MAX_ULP:0x1.0000000000000p+971   1.9958403095347198e+292

mode:               Addendum                          SUM (double)                 SUM (long double)
mode:FE_DOWNWARD    0.5 ulp-:0x1.fffffffffffffp+969   sum:0x1.fffffffffffffp+1023  0x1.fffffffffffff7fep+1023
mode:FE_DOWNWARD    0.5 ulp :0x1.0000000000000p+970   sum:0x1.fffffffffffffp+1023  0x1.fffffffffffff800p+1023
mode:FE_DOWNWARD    0.5 ulp+:0x1.0000000000001p+970   sum:0x1.fffffffffffffp+1023  0x1.fffffffffffff800p+1023
mode:FE_DOWNWARD    ulp-    :0x1.fffffffffffffp+970   sum:0x1.fffffffffffffp+1023  0x1.fffffffffffffffep+1023
mode:FE_DOWNWARD    ulp     :0x1.0000000000000p+971   sum:0x1.fffffffffffffp+1023  0x1.0000000000000000p+1024
                                                      --> inf expected <--
mode:FE_DOWNWARD    ulp+    :0x1.0000000000001p+971   sum:0x1.fffffffffffffp+1023  0x1.0000000000000000p+1024
                                                      --> inf expected <--

mode:FE_TONEAREST   0.5 ulp-:0x1.fffffffffffffp+969   sum:0x1.fffffffffffffp+1023  0x1.fffffffffffff800p+1023
mode:FE_TONEAREST   0.5 ulp :0x1.0000000000000p+970   sum:inf                      0x1.fffffffffffff800p+1023
mode:FE_TONEAREST   0.5 ulp+:0x1.0000000000001p+970   sum:inf                      0x1.fffffffffffff800p+1023
mode:FE_TONEAREST   ulp-    :0x1.fffffffffffffp+970   sum:inf                      0x1.0000000000000000p+1024
mode:FE_TONEAREST   ulp     :0x1.0000000000000p+971   sum:inf                      0x1.0000000000000000p+1024
mode:FE_TONEAREST   ulp+    :0x1.0000000000001p+971   sum:inf                      0x1.0000000000000000p+1024

mode:FE_TOWARDZERO  0.5 ulp-:0x1.fffffffffffffp+969   sum:0x1.fffffffffffffp+1023  0x1.fffffffffffff7fep+1023
mode:FE_TOWARDZERO  0.5 ulp :0x1.0000000000000p+970   sum:0x1.fffffffffffffp+1023  0x1.fffffffffffff800p+1023
mode:FE_TOWARDZERO  0.5 ulp+:0x1.0000000000001p+970   sum:0x1.fffffffffffffp+1023  0x1.fffffffffffff800p+1023
mode:FE_TOWARDZERO  ulp-    :0x1.fffffffffffffp+970   sum:0x1.fffffffffffffp+1023  0x1.fffffffffffffffep+1023
mode:FE_TOWARDZERO  ulp     :0x1.0000000000000p+971   sum:0x1.fffffffffffffp+1023  0x1.0000000000000000p+1024
                                                      --> inf expected <--
mode:FE_TOWARDZERO  ulp+    :0x1.0000000000001p+971   sum:0x1.fffffffffffffp+1023  0x1.0000000000000000p+1024
                                                      --> inf expected <--

mode:FE_UPWARD      0.5 ulp-:0x1.fffffffffffffp+969   sum:inf                      0x1.fffffffffffff800p+1023
mode:FE_UPWARD      0.5 ulp :0x1.0000000000000p+970   sum:inf                      0x1.fffffffffffff800p+1023
mode:FE_UPWARD      0.5 ulp+:0x1.0000000000001p+970   sum:inf                      0x1.fffffffffffff802p+1023
mode:FE_UPWARD      ulp-    :0x1.fffffffffffffp+970   sum:inf                      0x1.0000000000000000p+1024
mode:FE_UPWARD      ulp     :0x1.0000000000000p+971   sum:inf                      0x1.0000000000000000p+1024
mode:FE_UPWARD      ulp+    :0x1.0000000000001p+971   sum:inf                      0x1.0000000000000002p+1024
c 浮点 无穷大

评论

0赞 dbush 2/1/2023
CentOS 7 上的 gcc 4.8.5 为值生成相同的输出,但有所不同。doublelong double
0赞 dbush 2/1/2023
LDBL_MAX :0xf.fffffffffffffff0p+16380 1.18973149535723176502e+4932
0赞 chux - Reinstate Monica 2/1/2023
@dbush值是相同的,只是在打印中发生了变化。 仍然感谢您的见解。long double0xf.fffffffffffffff0p+16380 == 0x1.fffffffffffffffep+16383
0赞 Mark Dickinson 2/1/2023
我相信这是预期的行为 - 如果您评估(例如)使用舍入方向,您应该会看到相同的行为。它与溢出异常的默认异常处理有关。IEEE 754 的相关部分是 7.4 (b):“roundTowardZero 将所有溢出带到格式的最大有限数,并带有中间结果的符号”和 (c):“roundTowardNegative 将正溢出到格式的最大有限数,并将负溢出带到 -inf”。DBL_MAX + DBL_MAXFE_DOWNWARD
0赞 Mark Dickinson 2/1/2023
我可以在很久以后(下班后)写出一个正确的答案,但埃里克·波斯特皮希尔可能会打败我。:-)

答:

2赞 Sneftel 2/1/2023 #1

检查和处理溢出分两个阶段进行。请注意,实际的软件/电路可能使用不同的策略,但结果就像这是程序一样。

  1. 计算的无限精度结果是四舍五入的,其规则基于当前的四舍五入模式,但对指数没有限制。在这个阶段,没有“无穷大”这样的东西,但数字可以变得像他们需要的那样大。
  2. 如果结果超出可表示范围,则以基于舍入模式的方式将其“更正”为在可表示范围内。对于(“正常”模式),数字被校正为无穷大。但是,对于 ,它被更正为 。(对于其他舍入模式,它取决于符号:从零四舍五入导致无穷大,向零四舍五入导致 。FE_NEARESTFE_TOWARDZERO+/-DBL_MAX+/-DBL_MAX

因此,给定模式的溢出规则让人想起该模式的舍入规则,但实际上并不相同。可以说是很奇怪的,因为它在所有超出范围的情况下都像(仅限 IEEE-754)从零拉开距离的舍入模式,而不是注意到无穷大比 .但其他模式的基本行为是,当其首选方向(给定符号)趋向于零时输出,当其首选方向远离零时输出无穷大。FE_NEARESTDBL_MAX+/-DBL_MAX

另请注意,当阶段 1 的结果超出可表示范围的值时,即使阶段 2 的结果为 。溢出并不表示“我做了一个无穷大”,而是“我必须做第 2 阶段”。DBL_MAX

评论

0赞 chux - Reinstate Monica 2/1/2023
同意#1。在“对指数没有限制”的情况下,确切的总和是 - 所以舍入模式还不是那么重要。这是#2,我需要更关注....嗯DBL_MAX + ulp0x1.0000000000000000p+1024FE_TOWARDZERO
0赞 Sneftel 2/1/2023
@chux-ReinstateMonica 如果你忘记了行为,把它想象成“无穷大是向上的,是向下的”,这是最简单的。这是两个最接近的选择。FE_NEARESTDBL_MAX
0赞 chux - Reinstate Monica 2/1/2023
光线开始照进来,但两个步骤,涉及四舍五入模式,似乎不对劲。难道计算不是说它有无限的精度,对指数没有限制吗?如果结果不能完全表示为 ,则舍入模式应用一次?double
0赞 Sneftel 2/1/2023
@chux-ReinstateMonica 当然,但这并不能描述异常引发行为,也没有真正描述无穷大的来源(特别是对于)。但它绝对适用于其他舍入模式的值行为。FE_NEAREST
0赞 chux - Reinstate Monica #2

@Sneftel使用 2 阶段舍入模型的想法很有用,并且很好地解释了为什么我的预期结果是错误的并且代码是正确的。

下面是发布溢出标志的增强代码,以帮助说明模型应用程序,以供参考。

#include <fenv.h>
#include <float.h>
#include <math.h>
#include <stdio.h>
#include <stdlib.h>

#define STRINGIFY(x) #x
#define TOSTRING(x) STRINGIFY(x)

const char* exceptstr(fexcept_t *flag) {
  static char buf[100];
  buf[0] = 0;
  if (*flag & FE_DIVBYZERO)
    strcat(buf, STRINGIFY(FE_DIVBYZERO));
  if (*flag & FE_INEXACT)
    strcat(buf, STRINGIFY(FE_INEXACT));
  if (*flag & FE_INVALID)
    strcat(buf, STRINGIFY(FE_INVALID));
  if (*flag & FE_OVERFLOW)
    strcat(buf, STRINGIFY(FE_OVERFLOW));
  if (*flag & FE_UNDERFLOW)
    strcat(buf, STRINGIFY(FE_UNDERFLOW));
  return buf;
}

int main() {
  const double max = DBL_MAX;
  const double max_ulp = max - nextafter(max, 0);

  int mode[] = {FE_DOWNWARD, FE_TONEAREST, FE_TOWARDZERO, FE_UPWARD};
  const char *modes[] = {STRINGIFY(FE_DOWNWARD), STRINGIFY(FE_TONEAREST), //
  STRINGIFY(FE_TOWARDZERO), STRINGIFY(FE_UPWARD)};
  int n = sizeof mode / sizeof mode[0];
  int p = (DBL_MANT_DIG + 2) / 4;
  int P = (LDBL_MANT_DIG + 2) / 4;
  printf("%s:%d\n", STRINGIFY(FLT_EVAL_METHOD), FLT_EVAL_METHOD);

  printf("LDBL_MAX   :%-24.*La %.*Lg\n", P, LDBL_MAX, LDBL_DECIMAL_DIG,
  LDBL_MAX);
  printf("DBL_MAX    :%-24.*a %.*g\n", p, max, DBL_DECIMAL_DIG, max);
  printf("DBL_MAX_ULP:%-24.*a %.*g\n", p, max_ulp, DBL_DECIMAL_DIG, max_ulp);
  printf("\n");
  printf(
      "mode:               Addendum                          SUM (double)                 SUM (long double)          FE\n");
  for (int i = 0; i < n; i++) {
    if (fesetround(mode[i])) {
      perror("Invalid mode");
      return -1;
    }
    double delta[] = {nextafter(max_ulp / 2, 0), max_ulp / 2, //
    nextafter(max_ulp / 2, INFINITY), nextafter(max_ulp, 0), //
    max_ulp, nextafter(max_ulp, INFINITY)};
    const char *deltas[] = {"0.5 ulp-", "0.5 ulp", "0.5 ulp+", //
        "ulp-", "ulp", "ulp+"};
    int dn = sizeof delta / sizeof delta[0];
    for (int d = 0; d < dn; d++) {
      if (feclearexcept(FE_ALL_EXCEPT)) {
        perror("feclearexcept()");
        return -1;
      }
      ///////////////////////////////
      double sum = max + delta[d];
      ///////////////////////////////
      fexcept_t flag;
      if (fegetexceptflag(&flag, FE_ALL_EXCEPT)) {
        perror("fegetexceptflag()");
        return -1;
      }

      printf(
          "mode:%-14s %-8s:%-24.*a sum:%-24.*a %-24.*La %s\n", //
          modes[i], deltas[d], p, delta[d], p, sum, P, 1.0L * max + delta[d],
          exceptstr(&flag));
    }
    puts("");
  }
}

输出(看向最右边)

FLT_EVAL_METHOD:0
LDBL_MAX   :0x1.fffffffffffffffep+16383 1.18973149535723176502e+4932
DBL_MAX    :0x1.fffffffffffffp+1023  1.7976931348623157e+308
DBL_MAX_ULP:0x1.0000000000000p+971   1.9958403095347198e+292

mode:               Addendum                          SUM (double)                 SUM (long double)          FE
mode:FE_DOWNWARD    0.5 ulp-:0x1.fffffffffffffp+969   sum:0x1.fffffffffffffp+1023  0x1.fffffffffffff7fep+1023 FE_INEXACT
mode:FE_DOWNWARD    0.5 ulp :0x1.0000000000000p+970   sum:0x1.fffffffffffffp+1023  0x1.fffffffffffff800p+1023 FE_INEXACT
mode:FE_DOWNWARD    0.5 ulp+:0x1.0000000000001p+970   sum:0x1.fffffffffffffp+1023  0x1.fffffffffffff800p+1023 FE_INEXACT
mode:FE_DOWNWARD    ulp-    :0x1.fffffffffffffp+970   sum:0x1.fffffffffffffp+1023  0x1.fffffffffffffffep+1023 FE_INEXACT
mode:FE_DOWNWARD    ulp     :0x1.0000000000000p+971   sum:0x1.fffffffffffffp+1023  0x1.0000000000000000p+1024 FE_INEXACTFE_OVERFLOW
mode:FE_DOWNWARD    ulp+    :0x1.0000000000001p+971   sum:0x1.fffffffffffffp+1023  0x1.0000000000000000p+1024 FE_INEXACTFE_OVERFLOW

mode:FE_TONEAREST   0.5 ulp-:0x1.fffffffffffffp+969   sum:0x1.fffffffffffffp+1023  0x1.fffffffffffff800p+1023 FE_INEXACT
mode:FE_TONEAREST   0.5 ulp :0x1.0000000000000p+970   sum:inf                      0x1.fffffffffffff800p+1023 FE_INEXACTFE_OVERFLOW
mode:FE_TONEAREST   0.5 ulp+:0x1.0000000000001p+970   sum:inf                      0x1.fffffffffffff800p+1023 FE_INEXACTFE_OVERFLOW
mode:FE_TONEAREST   ulp-    :0x1.fffffffffffffp+970   sum:inf                      0x1.0000000000000000p+1024 FE_INEXACTFE_OVERFLOW
mode:FE_TONEAREST   ulp     :0x1.0000000000000p+971   sum:inf                      0x1.0000000000000000p+1024 FE_INEXACTFE_OVERFLOW
mode:FE_TONEAREST   ulp+    :0x1.0000000000001p+971   sum:inf                      0x1.0000000000000000p+1024 FE_INEXACTFE_OVERFLOW

mode:FE_TOWARDZERO  0.5 ulp-:0x1.fffffffffffffp+969   sum:0x1.fffffffffffffp+1023  0x1.fffffffffffff7fep+1023 FE_INEXACT
mode:FE_TOWARDZERO  0.5 ulp :0x1.0000000000000p+970   sum:0x1.fffffffffffffp+1023  0x1.fffffffffffff800p+1023 FE_INEXACT
mode:FE_TOWARDZERO  0.5 ulp+:0x1.0000000000001p+970   sum:0x1.fffffffffffffp+1023  0x1.fffffffffffff800p+1023 FE_INEXACT
mode:FE_TOWARDZERO  ulp-    :0x1.fffffffffffffp+970   sum:0x1.fffffffffffffp+1023  0x1.fffffffffffffffep+1023 FE_INEXACT
mode:FE_TOWARDZERO  ulp     :0x1.0000000000000p+971   sum:0x1.fffffffffffffp+1023  0x1.0000000000000000p+1024 FE_INEXACTFE_OVERFLOW
mode:FE_TOWARDZERO  ulp+    :0x1.0000000000001p+971   sum:0x1.fffffffffffffp+1023  0x1.0000000000000000p+1024 FE_INEXACTFE_OVERFLOW

mode:FE_UPWARD      0.5 ulp-:0x1.fffffffffffffp+969   sum:inf                      0x1.fffffffffffff800p+1023 FE_INEXACTFE_OVERFLOW
mode:FE_UPWARD      0.5 ulp :0x1.0000000000000p+970   sum:inf                      0x1.fffffffffffff800p+1023 FE_INEXACTFE_OVERFLOW
mode:FE_UPWARD      0.5 ulp+:0x1.0000000000001p+970   sum:inf                      0x1.fffffffffffff802p+1023 FE_INEXACTFE_OVERFLOW
mode:FE_UPWARD      ulp-    :0x1.fffffffffffffp+970   sum:inf                      0x1.0000000000000000p+1024 FE_INEXACTFE_OVERFLOW
mode:FE_UPWARD      ulp     :0x1.0000000000000p+971   sum:inf                      0x1.0000000000000000p+1024 FE_INEXACTFE_OVERFLOW
mode:FE_UPWARD      ulp+    :0x1.0000000000001p+971   sum:inf                      0x1.0000000000000002p+1024 FE_INEXACTFE_OVERFLOW