使用 Scalene 分发的剖面割炬（torchrun + scalene）-解网

问：

我正在使用代码中分发的火炬。我从我的终端使用 torchrun 命令运行它。我想使用 scalene 分析器对其进行分析。

torchrun run 运行命令示例：bash torchrun --nnodes 1 --nproc_per_node 6 --standalone main.py --train

示例斜角分析命令：bash scalene --no-browser --reduced-profile --cpu --outfile profile_rnd00_pong_5│ fig_path=./configs/PongTuning/config_rnd00.conf --log_name=PongTuning_rnd0 hr_teslaT4_test00.html --profile-interval 120 main.py --train

我尝试将这两者组合如下，但它不起作用：有没有办法使用 scalene，同时还依赖于 torchrun 来运行我的分布式 pytorch 代码。bash scalene --no-browser --reduced-profile --cpu --outfile profile_rnd00_pong_5│ fig_path=./configs/PongTuning/config_rnd00.conf --log_name=PongTuning_rnd0 hr_teslaT4_test00.html --profile-interval 120 torchrun --nnodes 1 --nproc_per_node 6 --standalone main.py --train

我也尝试了以下方法：

python -m scalene --- -m torch.distributed.run --nnodes 1 --nproc_per_node 6 --standalone main.py --train

它引发了以下错误：

Scalene: Program did not run for long enough to profile.                      
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 2) local
_rank: 0 (pid: 136598) of binary: /tmp/scalenelcqus7e6/python                 
Error in program being profiled:

注意：main.py 使用 argparse 并接受 --train 作为其选项之一。

我也看过 pytorch 分析器，但它似乎对我没有帮助。它不会分析我的代码中与 pytorch 无关的部分。我需要分析代码的其他部分进行优化，例如 python 数组的用例和对象类型之间的转换。

我真的很感谢你的帮助。谢谢。

Python PyTorch 分析分布式计算 Scalene

使用 Scalene 分发的剖面割炬（torchrun + scalene）

Profiling Torch Distributed with Scalene (torchrun + scalene)

评论

使用 Scalene 分发的剖面割炬 （torchrun + scalene）

Profiling Torch Distributed with Scalene (torchrun + scalene)

评论

使用 Scalene 分发的剖面割炬（torchrun + scalene）