两种乘积之和和差的精确浮点计算-解网

问：

两个乘积的差值和两个乘积之和是在各种常见计算中发现的两个基元。diff_of_products （a，b，c，d）：= ab - cd 和 sum_of_products（a，b，c，d）：= ab + cd 是密切相关的伴随函数，它们仅通过某些操作数的符号而有所不同。使用这些原语的示例如下：

计算 x = （a + i b）和 y = （c + i d）的复数乘法：

x*y = diff_of_products （a， c， b， d） + i sum_of_products （a， d， b， c）

2x2 矩阵行列式的计算：diff_of_products （a， d， b， c）：

| a  b |
| c  d |

在直角三角形计算对面导管与相邻导管的长度时：diff_of_products （h， h， a， a）ha

计算具有正判别的二次方程的两个实解：

q = -（b + copysign （sqrt （diff_of_products （b， b， 4a， c））， b）） / 2 x₀ = q / a

x₁ = c / q

3D 叉积 a = b ⨯ c 的计算：

a x = diff_of_products （b y， c z， b z， c y） a y = diff_of_products （b z， c x， b x， c z）
a _z = diff_of_products （b _x， c y， b _y， c_x）
)

当使用 IEEE-754 二进制浮点格式进行计算时，除了潜在的溢出和下溢的明显问题外，当两个乘积的大小相似但 sum_of_products（）的符号相反或 diff_of_products（）的符号相同时，任一函数的幼稚实现都可能遭受灾难性的抵消。

仅关注精度方面，如何在 IEEE-754 二进制算术的背景下稳健地实现这些功能？可以假设融合乘加运算的可用性，因为大多数现代处理器架构都支持此操作，并且通过标准函数在许多编程语言中公开。在不失去通用性的情况下，讨论可以限制为单精度（IEEE-754）格式，以便于阐述和测试。binary32

算法点浮点精度

答：

5赞 njuffa 8/31/2020 #1

融合乘法加法（FMA）运算在提供减法抵消保护方面的效用源于全双宽乘积在最终加法中的参与。据我所知，关于它用于准确和稳健地计算二次方程解的第一个公开记录是著名浮点专家威廉·卡汉（William Kahan）的两组非正式笔记：

威廉·卡汉（William Kahan），“Matlab的损失是没有人的收益”。1998 年 8 月，2004 年 7 月修订（在线）
William Kahan，“关于没有超精确算术的浮点计算的成本”。2004年11月（在线）)

Higham 关于数值计算的标准著作是我第一次遇到 Kahan 的算法应用于 2x2 矩阵行列式的计算（第 65 页）：

Nicholas J. Higham，“数值算法的准确性和稳定性”，SIAM 1996

三位英特尔研究人员在英特尔首款支持 FMA 的 CPU 安腾处理器的背景下发布了一种不同的计算 ab+cd 算法，该算法也基于 FMA（第 273 页）：

Marius Cornea、John Harrison 和 Ping Tak Peter Tang：“基于 Itanium 的系统上的科学计算”。英特尔出版社 2002

近年来，法国研究人员的四篇论文详细研究了这两种算法，并通过数学证明提供了误差范围。对于二进制浮点运算，在中间计算中没有上溢或下溢的情况下，Kahan算法和Cornea-Harrison-Tang（CHT）算法的最大相对误差均为单位舍入的两倍，即2u。对于 IEEE-754 或单精度，此误差范围为 ^2-23，对于 IEEE-754 或双精度，此误差范围为 ^2-52。binary32binary64

此外，还表明，对于二进制浮点运算，Kahan算法中的误差最多为1.5 ulps。从文献中，我不知道 CHT 算法的等效结果，即经过验证的 ulp 误差范围。我自己使用以下代码的实验表明误差范围为 1.25 ulp。

Sylvie Boldo，“Kahan's algorithm for a correct discriminant computation at last formal proved”，IEEE Transactions on Computers，第 58 卷，第 2 期，2009 年 2 月，第 220-225 页（在线）)

Claude-Pierre Jeannerod、Nicolas Louvet 和 Jean-Michel Muller，“进一步分析 Kahan's Algorithm for the Accurate Computation of 2x2 Determinants”，《计算数学》，第 82 卷，第 284 期，2013 年 10 月，第 2245-2264 页（在线）)

Jean-Michel Muller，“On the Error of Computing ab+cd using Cornea， Harrison and Tang's Method”，ACM Transactions on Mathematical Software，第 41 卷，第 2 期，2015 年 1 月，第 7 条（在线）)

Claude-Pierre Jeannerod，“角膜-哈里森-唐方法的基数无关误差分析”，ACM Transactions on Mathematical Software 第 42 卷第 3 期，2016 年 5 月，第 19 条（在线）)

Kahan 算法需要 4 个浮点运算，其中 2 个是 FMA，而 CHT 算法需要 7 个浮点运算，其中 2 个是 FMA。我构建了下面的测试框架，以探索可能存在哪些其他权衡。我通过实验证实了文献中关于两种算法的相对误差和 Kahan 算法的 ulp 误差的界限。我的实验表明，CHT 算法提供了 1.25 ulp 的较小 ulp 误差范围，但它也以大约 Kahan 算法的两倍率产生错误舍入结果。

#include <stdio.h>
#include <stdlib.h>
#include <stdint.h>
#include <string.h>
#include <float.h>
#include <math.h>

#define TEST_SUM  (0)  // function under test. 0: a*b-c*d; 1: a*b+c*d 
#define USE_CHT   (0)  // algorithm. 0: Kahan; 1: Cornea-Harrison-Tang

/*
  Compute a*b-c*d with error <= 1.5 ulp. Maximum relative err = 2**-23

  Claude-Pierre Jeannerod, Nicolas Louvet, and Jean-Michel Muller,
  "Further Analysis of Kahan's Algorithm for the Accurate Computation 
  of 2x2 Determinants", Mathematics of Computation, Vol. 82, No. 284, 
  Oct. 2013, pp. 2245-2264
*/
float diff_of_products_kahan (float a, float b, float c, float d)
{
    float w = d * c;
    float e = fmaf (c, -d, w);
    float f = fmaf (a, b, -w);
    return f + e;
}

/*
  Compute a*b-c*d with error <= 1.25 ulp (?). Maximum relative err = 2**-23

  Claude-Pierre Jeannerod, "A Radix-Independent Error Analysis of the 
  Cornea-Harrison-Tang Method", ACM Transactions on Mathematical Software
  Vol. 42, No. 3, Article 19 (May 2016).
*/
float diff_of_products_cht (float a, float b, float c, float d)
{
    float p1 = a * b; 
    float p2 = c * d;
    float e1 = fmaf (a, b, -p1); 
    float e2 = fmaf (c, -d, p2);
    float r = p1 - p2; 
    float e = e1 + e2;
    return r + e;
}

/*
  Compute a*b+c*d with error <= 1.5 ulp. Maximum relative err = 2**-23

  Jean-Michel Muller, "On the Error of Computing ab+cd using Cornea,
  Harrison and Tang's Method", ACM Transactions on Mathematical Software, 
  Vol. 41, No.2, Article 7, (January 2015)
*/
float sum_of_products_kahan (float a, float b, float c, float d)
{
    float w = c * d;
    float e = fmaf (c, -d, w);
    float f = fmaf (a, b, w);
    return f - e;
}

/*
  Compute a*b+c*d with error <= 1.25 ulp (?). Maximum relative err = 2**-23

  Claude-Pierre Jeannerod, "A Radix-Independent Error Analysis of the 
  Cornea-Harrison-Tang Method", ACM Transactions on Mathematical Software
  Vol. 42, No. 3, Article 19 (May 2016).
*/
float sum_of_products_cht (float a, float b, float c, float d)
{
    float p1 = a * b; 
    float p2 = c * d;
    float e1 = fmaf (a, b, -p1); 
    float e2 = fmaf (c, d, -p2);
    float r = p1 + p2; 
    float e = e1 + e2;
    return r + e;
}

// Fixes via: Greg Rose, KISS: A Bit Too Simple. http://eprint.iacr.org/2011/007
static unsigned int z=362436069,w=521288629,jsr=362436069,jcong=123456789;
#define znew (z=36969*(z&0xffff)+(z>>16))
#define wnew (w=18000*(w&0xffff)+(w>>16))
#define MWC  ((znew<<16)+wnew)
#define SHR3 (jsr^=(jsr<<13),jsr^=(jsr>>17),jsr^=(jsr<<5)) /* 2^32-1 */
#define CONG (jcong=69069*jcong+13579)                     /* 2^32 */
#define KISS ((MWC^CONG)+SHR3)

typedef struct {
    double y;
    double x;
} dbldbl;

dbldbl make_dbldbl (double head, double tail)
{
    dbldbl z;
    z.x = tail;
    z.y = head;
    return z;
}

dbldbl add_dbldbl (dbldbl a, dbldbl b) {
    dbldbl z;
    double t1, t2, t3, t4, t5;
    t1 = a.y + b.y;
    t2 = t1 - a.y;
    t3 = (a.y + (t2 - t1)) + (b.y - t2);
    t4 = a.x + b.x;
    t2 = t4 - a.x;
    t5 = (a.x + (t2 - t4)) + (b.x - t2);
    t3 = t3 + t4;
    t4 = t1 + t3;
    t3 = (t1 - t4) + t3;
    t3 = t3 + t5;
    z.y = t4 + t3;
    z.x = (t4 - z.y) + t3;
    return z;
}

dbldbl sub_dbldbl (dbldbl a, dbldbl b)
{
    dbldbl z;
    double t1, t2, t3, t4, t5;
    t1 = a.y - b.y;
    t2 = t1 - a.y;
    t3 = (a.y + (t2 - t1)) - (b.y + t2);
    t4 = a.x - b.x;
    t2 = t4 - a.x;
    t5 = (a.x + (t2 - t4)) - (b.x + t2);
    t3 = t3 + t4;
    t4 = t1 + t3;
    t3 = (t1 - t4) + t3;
    t3 = t3 + t5;
    z.y = t4 + t3;
    z.x = (t4 - z.y) + t3;
    return z;
}

dbldbl mul_dbldbl (dbldbl a, dbldbl b)
{
    dbldbl t, z;
    t.y = a.y * b.y;
    t.x = fma (a.y, b.y, -t.y);
    t.x = fma (a.x, b.x, t.x);
    t.x = fma (a.y, b.x, t.x);
    t.x = fma (a.x, b.y, t.x);
    z.y = t.y + t.x;
    z.x = (t.y - z.y) + t.x;
    return z;
}

double prod_diff_ref (float a, float b, float c, float d)
{
    dbldbl t = sub_dbldbl (
        mul_dbldbl (make_dbldbl ((double)a, 0), make_dbldbl ((double)b, 0)),
        mul_dbldbl (make_dbldbl ((double)c, 0), make_dbldbl ((double)d, 0))
        );
    return t.x + t.y;
}

double prod_sum_ref (float a, float b, float c, float d)
{
    dbldbl t = add_dbldbl (
        mul_dbldbl (make_dbldbl ((double)a, 0), make_dbldbl ((double)b, 0)),
        mul_dbldbl (make_dbldbl ((double)c, 0), make_dbldbl ((double)d, 0))
        );
    return t.x + t.y;
}

float __uint32_as_float (uint32_t a)
{
    float r;
    memcpy (&r, &a, sizeof r);
    return r;
}

uint32_t __float_as_uint32 (float a)
{
    uint32_t r;
    memcpy (&r, &a, sizeof r);
    return r;
}

uint64_t __double_as_uint64 (double a)
{
    uint64_t r;
    memcpy (&r, &a, sizeof r);
    return r;
}

static double floatUlpErr (float res, double ref)
{
    uint64_t i, j, err;
    int expoRef;
    
    /* ulp error cannot be computed if either operand is NaN, infinity, zero */
    if (isnan(res) || isnan (ref) || isinf(res) || isinf (ref) ||
        (res == 0.0f) || (ref == 0.0f)) {
        return 0.0;
    }
    /* Convert the float result to an "extended float". This is like a float
       with 56 instead of 24 effective mantissa bits.
    */
    i = ((uint64_t)__float_as_uint32(res)) << 32;
    /* Convert the double reference to an "extended float". If the reference is
       >= 2^129, we need to clamp to the maximum "extended float". If reference
       is < 2^-126, we need to denormalize because of float's limited exponent
       range.
    */
    expoRef = (int)(((__double_as_uint64(ref) >> 52) & 0x7ff) - 1023);
    if (expoRef >= 129) {
        j = (__double_as_uint64(ref) & 0x8000000000000000ULL) |
            0x7fffffffffffffffULL;
    } else if (expoRef < -126) {
        j = ((__double_as_uint64(ref) << 11) | 0x8000000000000000ULL) >> 8;
        j = j >> (-(expoRef + 126));
        j = j | (__double_as_uint64(ref) & 0x8000000000000000ULL);
    } else {
        j = ((__double_as_uint64(ref) << 11) & 0x7fffffffffffffffULL) >> 8;
        j = j | ((uint64_t)(expoRef + 127) << 55);
        j = j | (__double_as_uint64(ref) & 0x8000000000000000ULL);
    }
    err = (i < j) ? (j - i) : (i - j);
    return err / 4294967296.0;
}

int main (void)
{
    const float ULMT = sqrtf (FLT_MAX) / 2; // avoid overflow
    const float LLMT = sqrtf (FLT_MIN) * 2; // avoid underflow
    const uint64_t N = 1ULL << 38;
    double ref, ulp, relerr, maxrelerr = 0, maxulp = 0;
    uint64_t count = 0LL, incorrectly_rounded = 0LL;
    uint32_t ai, bi, ci, di;
    float af, bf, cf, df, resf;

#if TEST_SUM
    printf ("testing a*b+c*d ");
#else
    printf ("testing a*b-c*d ");
#endif // TEST_SUM
#if USE_CHT
    printf ("using Cornea-Harrison-Tang algorithm\n");
#else
    printf ("using Kahan algorithm\n");
#endif

    do {
        do {
            ai = KISS;
            af = __uint32_as_float (ai);
        } while (!isfinite(af) || (fabsf (af) > ULMT) || (fabsf (af) < LLMT));
        do {
            bi = KISS;
            bf = __uint32_as_float (bi);
        } while (!isfinite(bf) || (fabsf (bf) > ULMT) || (fabsf (bf) < LLMT));
        do {
            ci = KISS;
            cf = __uint32_as_float (ci);
        } while (!isfinite(cf) || (fabsf (cf) > ULMT) || (fabsf (cf) < LLMT));
        do {
            di = KISS;
            df = __uint32_as_float (di);
        } while (!isfinite(df) || (fabsf (df) > ULMT) || (fabsf (df) < LLMT));
        count++;
#if TEST_SUM        
#if USE_CHT
        resf = sum_of_products_cht (af, bf, cf, df);
#else // USE_CHT
        resf = sum_of_products_kahan (af, bf, cf, df);
#endif // USE_CHT
        ref = prod_sum_ref (af, bf, cf, df);
#else // TEST_SUM
#if USE_CHT
        resf = diff_of_products_cht (af, bf, cf, df);
#else // USE_CHT
        resf = diff_of_products_kahan (af, bf, cf, df);
#endif // USE_CHT
        ref = prod_diff_ref (af, bf, cf, df);
#endif // TEST_SUM
        ulp = floatUlpErr (resf, ref);
        incorrectly_rounded += ulp > 0.5;
        relerr = fabs ((resf - ref) / ref);
        if ((ulp > maxulp) || ((ulp == maxulp) && (relerr > maxrelerr))) {
            maxulp = ulp;
            maxrelerr = relerr;
            printf ("%13llu %12llu ulp=%.9f a=% 15.8e b=% 15.8e c=% 15.8e d=% 15.8e res=% 16.6a ref=% 23.13a relerr=%13.9e\n",
                    count, incorrectly_rounded, ulp, af, bf, cf, df, resf, ref, relerr);
        }
    } while (count <= N);

    return EXIT_SUCCESS;
}

上一个：有人可以解释我 Python Division [重复]

下一个：如何在 C++ 中的数组中存储浮点值

两种乘积之和和差的精确浮点计算

Accurate floating-point computation of the sum and difference of two products

评论