提问人:H. Weirauch 提问时间:6/20/2023 最后编辑:Vladimir F Героям славаH. Weirauch 更新时间:6/20/2023 访问量:104
SIGFPE - 错误的算术运算 - 在 Fortran 的 MPI_Init() 中
SIGFPE - erroneous arithmetic operation - in MPI_Init() in Fortran
问:
使用 gfortran 标志编译时,MPI 并行 Fortran 2008 代码崩溃并出现浮点异常。-ffpe-trap
让我们考虑以下 MWE Fortran 程序:
program mwe
use mpi_f08
integer :: ierror
call MPI_Init(ierror)
print*,"MPI_Init returned", ierror
end program
另存为 ,并附带 CMake 配置mwe.F90
cmake_minimum_required(VERSION 3.16)
project(mpimwe
DESCRIPTION "Minimal Working Example for Fortran MPI with SIGFPE safeguards"
LANGUAGES Fortran)
find_package(MPI COMPONENTS Fortran REQUIRED)
string(APPEND CMAKE_Fortran_FLAGS " -ffpe-trap=invalid,zero,overflow")
set(exec "mwe")
add_executable(${exec} ${exec}.F90)
target_link_libraries(${exec} ${MPI_Fortran_LIBRARIES})
target_include_directories(${exec} PRIVATE ${MPI_Fortran_MODULE_DIR})
请注意编译器标志。gfortran 手册页建议使用此标志:-ffpe-trap
-ffpe-trap=list Specify a list of floating point exception traps to enable. On most systems, if a floating point exception occurs and the trap for that exception is enabled, a SIGFPE signal will be sent and the program being aborted, producing a core file useful for debugging. [...] The first three exceptions (invalid, zero, and overflow) often indicate serious errors, and unless the program has provisions for dealing with these exceptions, enabling traps for these three exceptions is probably a good idea.
计算机 1(个人 PC):gfortran 10.3.0,Open MPI 4.0.3
代码的编译有效。使用 N=1..4 运行代码是有效的。使用 N>4 运行或不使用 N4 运行代码不起作用,但会产生以下错误:mpiexec -np <N>
mpiexec
$ ./mwe
Program received signal SIGFPE: Floating-point exception - erroneous arithmetic operation.
Backtrace for this error:
#0 0x7ff5f4673d21 in ???
#1 0x7ff5f4672ef5 in ???
#2 0x7ff5f44a408f in ???
at /build/glibc-SzIz7B/glibc-2.31/signal/../sysdeps/unix/sysv/linux/x86_64/sigaction.c:0
#3 0x7ff5f1f565d3 in ???
#4 0x7ff5f1f0f402 in ???
#5 0x7ff5f1eecf9e in ???
#6 0x7ff5f245c465 in ???
#7 0x7ff5f3f67020 in ???
#8 0x7ff5f3f5a478 in ???
#9 0x7ff5f40e8fcf in ???
#10 0x7ff5f3feae54 in ???
#11 0x7ff5f3e7eef2 in ???
#12 0x7ff5f40212fb in ???
#13 0x7ff5f43af322 in ???
#14 0x7ff5f4353072 in ???
#15 0x7ff5f444aa4b in ???
#16 0x7ff5f4937901 in ???
#17 0x557cd23a41df in ???
#18 0x557cd23a43ce in ???
#19 0x7ff5f4485082 in __libc_start_main
at ../csu/libc-start.c:308
#20 0x557cd23a410d in ???
#21 0xffffffffffffffff in ???
Floating point exception
计算机 2(HPC 集群):gfortran 12.2.0、Open MPI 4.1.4、Slurm 22.05.6
代码的编译有效。无论是否运行代码都适用于所有 N。 提交到 Slurm 队列中会重现 SIGFPE:mpiexec
$ ./mwe
MPI_Init returned 0
$ srun ./mwe
srun: job ... queued and waiting for resources
srun: job ... has been allocated resources
Program received signal SIGFPE: Floating-point exception - erroneous arithmetic operation.
Backtrace for this error:
#0 0x152893b5451f in ???
at ./signal/../sysdeps/unix/sysv/linux/x86_64/libc_sigaction.c:0
#1 0x152892295723 in ???
#2 0x15289226da4e in ???
#3 0x1528924340c5 in ???
#4 0x152893315992 in ???
#5 0x1528933013d8 in ???
#6 0x15289387d164 in ???
#7 0x15289395e0e6 in ???
#8 0x152893963165 in ???
#9 0x1528939d63bd in ???
#10 0x1528943f98f5 in ???
#11 0x152894225d6c in ???
#12 0x1528945026c7 in ???
#13 0x15289454aa5c in ???
#14 0x4011cc in ???
#15 0x4013aa in ???
#16 0x152893b3bd8f in __libc_start_call_main
at ../sysdeps/nptl/libc_start_call_main.h:58
#17 0x152893b3be3f in __libc_start_main_impl
at ../csu/libc-start.c:392
#18 0x4010f4 in ???
#19 0xffffffffffffffff in ???
srun: error: worker_node: task 0: Floating point exception
在所有失败的情况下,使用 都不是问题,但添加 或触发 SIGFPE。-ffpe-trap=overflow
-ffpe-trap=invalid
-ffpe-trap=zero
预期行为:我想避免第一个 MPI 命令已经触发编译器的浮点异常保护措施。由于我无法控制 MPI 基础结构*的内容,因此这些标志对 MPI 并行代码毫无用处。MPI_Init
-ffpe-trap
*:只是我的猜测,根本原因要么在Open MPI的某个地方,要么是在不同代码中多次出现的多个错误(Open MPI,Slurm;编译器和系统库的作用我不清楚)。
答: 暂无答案
评论
mpiexec
mpif90
make
make VERBOSE=1
mpif90
X*
/X1/gcc-12.2.0/gcc-12.2.0-X2/bin/gfortran -ffpe-trap=invalid,zero,overflow CMakeFiles/mwe.dir/mwe.F90.o -o mwe -Wl,-rpath,/X1/gcc-12.2.0/openmpi-4.1.4-X3/lib /X1/gcc-12.2.0/openmpi-4.1.4-X3/lib/libmpi_usempif08.so /X1/gcc-12.2.0/openmpi-4.1.4-X3/lib/libmpi_usempi_ignore_tkr.so /X1/gcc-12.2.0/openmpi-4.1.4-X3/lib/libmpi_mpifh.so /X1/gcc-12.2.0/openmpi-4.1.4-X3/lib/libmpi.so