torch.distributed.get_world_size() 和 torch.cuda.device_count() 返回不同的数字,出现无效的设备序号错误

torch.distributed.get_world_size() and torch.cuda.device_count() returning different numbers, getting invalid device ordinal error

提问人:DLS 提问时间:11/4/2023 最后编辑:talonmiesDLS 更新时间:11/4/2023 访问量:47

问:

我正在尝试在 pytorch 中的多个 GPU 之间使用张量并行化,特别是 2 个 Nvidia A100,以使用基于 slurm 的 HPC 将一个 GPU 太大的模型传播到多个 GPU 中。我的模型有问题,所以我生成了这个小玩具示例(改编自这里)来说明主要问题。到目前为止,所有工作都是在 jupyter 笔记本上完成的。

我是整个并行化的新手。我的理解是,要将模型分布在多个 GPU 上,我应该将模型的每个分片设置为不同的 GPU,假设没有多线程,GPU 的总数就是进程数。但是,当我运行时,我得到的结果与我运行时不同。torch.distributed.get_world_size()torch.cuda.device_count()

from torch.distributed.tensor.parallel import (
    PairwiseParallel,
    parallelize_module,
)

import torch
from torch import nn
import torch.distributed as dist
from torch.distributed._tensor import DeviceMesh #, DTensor, Shard, Replicate, distribute_tensor

from torch.testing._internal.common_distributed import (
    spawn_threads_and_init_comms,
)

class ToyModel(nn.Module):
    def __init__(self, in_channels, hidden_channels):
        super(ToyModel, self).__init__()
        self.dummy_param = nn.Parameter(torch.empty(0))
        self.net1 = nn.Linear(in_channels, hidden_channels)
        self.relu = nn.ReLU()
        self.net2 = nn.Linear(hidden_channels, in_channels)

    def forward(self, x):
        return self.net2(self.relu(self.net1(x)))
        
@spawn_threads_and_init_comms
def demo_world_rank_mismatch(world_size):
    rank = dist.get_rank()
    print(f'Rank: {rank}')

    print(f"World size: {dist.get_world_size()}")

    print("Create a sharding plan based on the given world_size", world_size)

    mesh = torch.arange(world_size)
    # create a sharding plan based on the given world_size.
    device_mesh = DeviceMesh(
        "cuda",
        mesh,
    )
    in_dim = 1024
    hidden_dim = 4 * in_dim

    model = ToyModel(in_dim, hidden_dim).to(rank)

    # Create a optimizer for the parallelized module.
    LR = 0.25
    optimizer = torch.optim.SGD(model.parameters(), lr=LR)

    lparallel=True
    if lparallel:
        print("Parallelize the module based on the given Parallel Style", rank)
        model = parallelize_module(model, device_mesh, PairwiseParallel())
    print(model)

    print(f"model of rank {rank} on {model.dummy_param.device}")

    # Perform a num of iterations of forward/backward
    # and optimizations for the sharded module.
    for i in range(ITER_TIME):
        inp = torch.rand(10000, in_dim).to(rank)
        output = model(inp)
        #print(f"FWD Step: iter {i}", rank)
        output.sum().backward()
        #print(f"BWD Step: iter {i}", rank)
        optimizer.step()
        #print(f"Optimization Step: iter {i}", rank)
    
    #print("Training finished", rank)
    print(f'{rank}, max memory alloc: {torch.cuda.max_memory_allocated(device=rank)}')


print(f"Device count: {torch.cuda.device_count()}")
demo_world_rank_mismatch(torch.cuda.device_count())


运行此代码将输出以下内容

Device count: 2
Rank: 1
World size: 4
Create a sharding plan based on the given world_size 2
Rank: 2
World size: 4
Create a sharding plan based on the given world_size 2
Rank: 3
World size: 4
Create a sharding plan based on the given world_size 2
Rank: 0
World size: 4
Create a sharding plan based on the given world_size 2
Parallelize the module based on the given Parallel Style 0
Parallelize the module based on the given Parallel Style 1

然后它输出了一个巨大的错误消息,其中大部分是从退出的多个线程中重复的,其中最后一位在这里:

RuntimeError: CUDA error: invalid device ordinal
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

Pytorch 版本为 2.1.0+cu121。我不知该何去何从。

Python PyTorch 并行处理 slurm

评论

0赞 Prakhar Sharma 11/4/2023
我看你是并行训练的新手。您可能应该查找教程。让我们处理一切。此作业不需要张量并行化框架。DataParallel
1赞 DLS 11/7/2023
Prakhar Sharma,这并不能解决我的问题。在运行模型的单元崩溃后调用 nvidia-smi 表明,唯一存储了任何内容的 GPU 是第一个 GPU(排名 0)。

答: 暂无答案