SIGFPE - 错误的算术运算 - 在 Fortran 的 MPI_Init() 中

SIGFPE - erroneous arithmetic operation - in MPI_Init() in Fortran

提问人:H. Weirauch 提问时间:6/20/2023 最后编辑:Vladimir F Героям славаH. Weirauch 更新时间:6/20/2023 访问量:104

问:

使用 gfortran 标志编译时,MPI 并行 Fortran 2008 代码崩溃并出现浮点异常。-ffpe-trap

让我们考虑以下 MWE Fortran 程序:

program mwe
  use mpi_f08
  integer :: ierror
  call MPI_Init(ierror)
  print*,"MPI_Init returned", ierror
end program

另存为 ,并附带 CMake 配置mwe.F90

cmake_minimum_required(VERSION 3.16)

project(mpimwe
    DESCRIPTION "Minimal Working Example for Fortran MPI with SIGFPE safeguards"
    LANGUAGES Fortran)

find_package(MPI COMPONENTS Fortran REQUIRED)
string(APPEND CMAKE_Fortran_FLAGS " -ffpe-trap=invalid,zero,overflow")

set(exec "mwe")
add_executable(${exec} ${exec}.F90)
target_link_libraries(${exec} ${MPI_Fortran_LIBRARIES})

target_include_directories(${exec} PRIVATE ${MPI_Fortran_MODULE_DIR})

请注意编译器标志。gfortran 手册页建议使用此标志:-ffpe-trap

  -ffpe-trap=list
      Specify a list of floating point exception traps to enable.  On most systems, if a
      floating point exception occurs and the trap for that exception is enabled, a SIGFPE
      signal will be sent and the program being aborted, producing a core file useful for
      debugging.
 [...]
      The first three exceptions (invalid, zero, and overflow) often indicate serious errors,
      and unless the program has provisions for dealing with these exceptions, enabling traps
      for these three exceptions is probably a good idea.

计算机 1(个人 PC):gfortran 10.3.0,Open MPI 4.0.3

代码的编译有效。使用 N=1..4 运行代码是有效的。使用 N>4 运行或不使用 N4 运行代码不起作用,但会产生以下错误:mpiexec -np <N>mpiexec

$ ./mwe

Program received signal SIGFPE: Floating-point exception - erroneous arithmetic operation.

Backtrace for this error:
#0  0x7ff5f4673d21 in ???
#1  0x7ff5f4672ef5 in ???
#2  0x7ff5f44a408f in ???
    at /build/glibc-SzIz7B/glibc-2.31/signal/../sysdeps/unix/sysv/linux/x86_64/sigaction.c:0
#3  0x7ff5f1f565d3 in ???
#4  0x7ff5f1f0f402 in ???
#5  0x7ff5f1eecf9e in ???
#6  0x7ff5f245c465 in ???
#7  0x7ff5f3f67020 in ???
#8  0x7ff5f3f5a478 in ???
#9  0x7ff5f40e8fcf in ???
#10  0x7ff5f3feae54 in ???
#11  0x7ff5f3e7eef2 in ???
#12  0x7ff5f40212fb in ???
#13  0x7ff5f43af322 in ???
#14  0x7ff5f4353072 in ???
#15  0x7ff5f444aa4b in ???
#16  0x7ff5f4937901 in ???
#17  0x557cd23a41df in ???
#18  0x557cd23a43ce in ???
#19  0x7ff5f4485082 in __libc_start_main
    at ../csu/libc-start.c:308
#20  0x557cd23a410d in ???
#21  0xffffffffffffffff in ???
Floating point exception

计算机 2(HPC 集群):gfortran 12.2.0、Open MPI 4.1.4、Slurm 22.05.6

代码的编译有效。无论是否运行代码都适用于所有 N。 提交到 Slurm 队列中会重现 SIGFPE:mpiexec

$ ./mwe
 MPI_Init returned           0
$ srun ./mwe
srun: job ... queued and waiting for resources
srun: job ... has been allocated resources

Program received signal SIGFPE: Floating-point exception - erroneous arithmetic operation.

Backtrace for this error:
#0  0x152893b5451f in ???
        at ./signal/../sysdeps/unix/sysv/linux/x86_64/libc_sigaction.c:0
#1  0x152892295723 in ???
#2  0x15289226da4e in ???
#3  0x1528924340c5 in ???
#4  0x152893315992 in ???
#5  0x1528933013d8 in ???
#6  0x15289387d164 in ???
#7  0x15289395e0e6 in ???
#8  0x152893963165 in ???
#9  0x1528939d63bd in ???
#10  0x1528943f98f5 in ???
#11  0x152894225d6c in ???
#12  0x1528945026c7 in ???
#13  0x15289454aa5c in ???
#14  0x4011cc in ???
#15  0x4013aa in ???
#16  0x152893b3bd8f in __libc_start_call_main
        at ../sysdeps/nptl/libc_start_call_main.h:58
#17  0x152893b3be3f in __libc_start_main_impl
        at ../csu/libc-start.c:392
#18  0x4010f4 in ???
#19  0xffffffffffffffff in ???
srun: error: worker_node: task 0: Floating point exception

在所有失败的情况下,使用 都不是问题,但添加 或触发 SIGFPE。-ffpe-trap=overflow-ffpe-trap=invalid-ffpe-trap=zero


预期行为:我想避免第一个 MPI 命令已经触发编译器的浮点异常保护措施。由于我无法控制 MPI 基础结构*的内容,因此这些标志对 MPI 并行代码毫无用处。MPI_Init-ffpe-trap

*:只是我的猜测,根本原因要么在Open MPI的某个地方,要么是在不同代码中多次出现的多个错误(Open MPI,Slurm;编译器和系统库的作用我不清楚)。

Fortran MPI Gfortran slurm

评论

0赞 Vladimir F Героям слава 6/20/2023
OpenMPI 是否针对该特定版本的 GCC (gfortran) 编译?是否来自与用于编译的 OpenMPI 相同的版本 OpenMPI(不仅是版本号,还有编译器)?mpiexec
0赞 Vladimir F Героям слава 6/20/2023
另外,CMake 发出的确切或等效命令是什么?详细运行。例如,bytefreaks.net/programming-2/make-building-with-cmake-verbosempif90makemake VERBOSE=1
0赞 Vladimir F Героям слава 6/20/2023
或者实际上只是直接在命令行上构建测试代码,并完全避免使用 CMake。它似乎是一个不必要的层,遮蔽了东西。mpif90
0赞 H. Weirauch 6/20/2023
我可以肯定地说,MWE 和 Open MPI 是在集群上使用相同的 gfortran 编译的。对于我的 PC,我必须依靠操作系统打包程序 (Ubuntu) 没有搞砸。
0赞 H. Weirauch 6/20/2023
在本地文件系统中,使用 BE 路径规范时,调用是X*/X1/gcc-12.2.0/gcc-12.2.0-X2/bin/gfortran -ffpe-trap=invalid,zero,overflow CMakeFiles/mwe.dir/mwe.F90.o -o mwe -Wl,-rpath,/X1/gcc-12.2.0/openmpi-4.1.4-X3/lib /X1/gcc-12.2.0/openmpi-4.1.4-X3/lib/libmpi_usempif08.so /X1/gcc-12.2.0/openmpi-4.1.4-X3/lib/libmpi_usempi_ignore_tkr.so /X1/gcc-12.2.0/openmpi-4.1.4-X3/lib/libmpi_mpifh.so /X1/gcc-12.2.0/openmpi-4.1.4-X3/lib/libmpi.so

答: 暂无答案