x86_64 和 ARMv8.2-A 之间的浮点计算结果不同

Differing Floating Point Calculation Results between x86_64 and ARMv8.2-A

提问人:Marty 提问时间:9/24/2020 最后编辑:Marty 更新时间:9/27/2020 访问量:2107

问:

我在 aarch64 和 x86_64 中编译了相同的 Fortran 库和代码。它是一个跨 n 维数组/矩阵运行算法的模型。ARM CPU 是 Amazon Graviton2。AWS中的AMD和Intel选项在编译代码并运行x86_64时会产生相同的结果。

我正在使用带有以下标志的 gcc / g++ / gfortran / mpich(所有版本 8.3.0,来自 debian buster 的主要存储库)

-O2 -ftree-vectorize -funroll-loops -w -ffree-form -ffree-line-length-none -fconvert=big-endian -frecord-marker=4

这一切都编译和运行良好,但是,我注意到在模型的输出中,结果略有不同。这似乎是精度或舍入的问题,因为输出之间的大多数值都是相同的。但是,在整个输出中(看似)存在随机值,其中看起来像是为一个 arch 编译的代码向下舍入或截断,而另一个 arch 向上舍入。

输出存储为 NetCDF(使用 NetCDF-Fortran 版本 4.5.3),文件的 md5sum 在 x86_64 CPU 上是相同的,但在 aarch64 上有所不同。

关于为什么会发生这种情况的任何想法?或者我可以在编译过程中使用任何标志来确保我在各个架构中获得相同的结果?

我现在看到的值的精度为小数点后 5 位,即 123.12345

这是输出中的一个片段,您可以看到大多数值是相同的,但有些值似乎舍入不同(我用 ** 标记了不同的值):diff

  657c657
  <     18.83633, 18.83212, 18.82778, **18.82337**, 18.81886, 18.81425, 18.80956, 
  ---
  >     18.83633, 18.83212, 18.82778, **18.82336**, 18.81886, 18.81425, 18.80956, 
  1151c1151
  <     17.35448, 17.37331, 17.39206, 17.41071, 17.42931, **17.4478**, 17.46622, 
  ---
  >     17.35448, 17.37331, 17.39206, 17.41071, 17.42931, **17.44779**, 17.46622, 
  1711c1711
  <     19.77562, 19.77532, 19.77493, 19.77445, 19.77386, 19.77319, **19.77241**, 
  ---
  >     19.77562, 19.77532, 19.77493, 19.77445, 19.77386, 19.77319, **19.77242**, 
  2130c2130
  <     20.06532, 20.06839, **20.07135**, 20.07423, 20.07702, 20.0797, 20.0823, 
  ---
  >     20.06532, 20.06839, **20.07136**, 20.07423, 20.07702, 20.0797, 20.0823, 
  2140c2140
  <     20.04788, 20.04424, 20.04047, **20.03661**, 20.03268, 20.02863, 20.02448, 
  ---
  >     20.04788, 20.04424, 20.04047, **20.03662**, 20.03268, 20.02863, 20.02448, 
  2600c2600
  <     11.54104, 11.57732, 11.61352, 11.6497, 11.68579, **11.72186**, 11.75784, 
  ---
  >     11.54104, 11.57732, 11.61352, 11.6497, 11.68579, **11.72185**, 11.75784,
浮点 Fortran ARM 精度

评论

6赞 chux - Reinstate Monica 9/24/2020
“精度为小数点后 5 位,即 123.12345”,对于浮点数,最好表示为精度的 8 位小数。甚至比 24 位二进制精度更好。关于人们对 binary32 的期望。
3赞 Frant 9/24/2020
@Marty:浮点不是我的专业领域,因此我觉得我无法提供答案。我只想说,即使 x86_64 和 ARMv8-a 实现了相同的标准,浮点计算结果也可能不同 - 请参阅本文,或者编译器在某些与架构相关的方式上可能表现不同: 您可能需要仔细检查是否有任何特定于架构的舍入/显示选项。
6赞 Vladimir F Героям слава 9/24/2020
这是浮点数(单精度,32 位,4 字节)的 expexted precisiin。CPU不同,组装也不同。优化可能会有所不同。具体有哪些操作不同?他们的源代码是什么?如何编译代码?哪些标志?
2赞 Vladimir F Героям слава 9/24/2020
另请注意,相同的内容可能只是在其他小数位(您不打印)上有所不同。
2赞 Frant 9/24/2020
@Marty:以硬件生成的十六进制而不是十进制显示二进制值,并再次执行比较会很有趣。不确定这在 Fortran 中与在 C 中有多容易。

答:

3赞 Andreas H. 9/24/2020 #1

如果代码仅使用 +、-、* 和 sqrt 等基本算术运算,并且编译器处于IEEE754一致性模式,则无论使用何种 CPU,输出都应位相同。 此IEEE754一致性模式通常是默认设置。

否则,该问题可能是由编译器或 CPU 错误引起的。

选项,例如将编译器置于非 IEEE 754 一致性模式。 它使用数学等价规则来优化代码,这些代码不一定在数值上等价(例如,等等)。 如果是这种情况,并且编译器对 ARM 代码的优化方式与 x86_64 不同,这可能是一种解释。-ffast-math((a*a)*a)*a -> (a*a)*(a*a)

Also if the code uses functions such as , , and such, the output will only be bit-identical if the exact same run-time library is used. This is because these functions are not correctly rounded and results typically have a tiny error (which may amplify in the calculation and show up in the way you observe it). It also might be the case that for x86_64 special CPU instructions for these functions are used and for ARM a software implementation or vice versa. Note that even if these functions are implemented on the CPU/FPU they are also not correctly rounded and very likely different algorithms are used.sincosexpatan2

TL/DR: check the compiler flags for or try adding at the end of the options.-ffast-math-fno-fast-math

EDIT: As @Rob mentioned in the comment another flag that could be added . In gcc it is by default 'fast' (independent on ) which may generate the FMA instruction even when not explicitly requested. This also breaks 754 conformance.-ffp-contract=off-ffast-math

评论

1赞 evets 9/24/2020
You forgot that the OP is outputting the data, so there are the IO routines to convert binary to decimal. IEEE-754 discusses conversion, but it is unclear whether OP has a strictly conforming IEEE-754 IO library.
1赞 Ian Bush 9/24/2020
Also if the code is threaded or multiprocess something like a reduction can cause this
3赞 Rob 9/24/2020
Some recent work on a project I work on required when porting the code to arm, showed we needed to set -ffp-contract=off to get better agreement between x86 and arm.
0赞 Marty 9/24/2020
@evets the diff output was just to demonstrate that there is a rounding difference between arch's, I updated the question to include that the NetCDF binary output itself is indeed different
6赞 evets 9/25/2020
gfortran understands -ffast-math. gfortran can use any option that gcc supports with a few exceptions (e.g., those marked as C/C++ only or Ada only).