提问人:DLS 提问时间:11/4/2023 最后编辑:talonmiesDLS 更新时间:11/4/2023 访问量:47
torch.distributed.get_world_size() 和 torch.cuda.device_count() 返回不同的数字,出现无效的设备序号错误
torch.distributed.get_world_size() and torch.cuda.device_count() returning different numbers, getting invalid device ordinal error
问:
我正在尝试在 pytorch 中的多个 GPU 之间使用张量并行化,特别是 2 个 Nvidia A100,以使用基于 slurm 的 HPC 将一个 GPU 太大的模型传播到多个 GPU 中。我的模型有问题,所以我生成了这个小玩具示例(改编自这里)来说明主要问题。到目前为止,所有工作都是在 jupyter 笔记本上完成的。
我是整个并行化的新手。我的理解是,要将模型分布在多个 GPU 上,我应该将模型的每个分片设置为不同的 GPU,假设没有多线程,GPU 的总数就是进程数。但是,当我运行时,我得到的结果与我运行时不同。torch.distributed.get_world_size()
torch.cuda.device_count()
from torch.distributed.tensor.parallel import (
PairwiseParallel,
parallelize_module,
)
import torch
from torch import nn
import torch.distributed as dist
from torch.distributed._tensor import DeviceMesh #, DTensor, Shard, Replicate, distribute_tensor
from torch.testing._internal.common_distributed import (
spawn_threads_and_init_comms,
)
class ToyModel(nn.Module):
def __init__(self, in_channels, hidden_channels):
super(ToyModel, self).__init__()
self.dummy_param = nn.Parameter(torch.empty(0))
self.net1 = nn.Linear(in_channels, hidden_channels)
self.relu = nn.ReLU()
self.net2 = nn.Linear(hidden_channels, in_channels)
def forward(self, x):
return self.net2(self.relu(self.net1(x)))
@spawn_threads_and_init_comms
def demo_world_rank_mismatch(world_size):
rank = dist.get_rank()
print(f'Rank: {rank}')
print(f"World size: {dist.get_world_size()}")
print("Create a sharding plan based on the given world_size", world_size)
mesh = torch.arange(world_size)
# create a sharding plan based on the given world_size.
device_mesh = DeviceMesh(
"cuda",
mesh,
)
in_dim = 1024
hidden_dim = 4 * in_dim
model = ToyModel(in_dim, hidden_dim).to(rank)
# Create a optimizer for the parallelized module.
LR = 0.25
optimizer = torch.optim.SGD(model.parameters(), lr=LR)
lparallel=True
if lparallel:
print("Parallelize the module based on the given Parallel Style", rank)
model = parallelize_module(model, device_mesh, PairwiseParallel())
print(model)
print(f"model of rank {rank} on {model.dummy_param.device}")
# Perform a num of iterations of forward/backward
# and optimizations for the sharded module.
for i in range(ITER_TIME):
inp = torch.rand(10000, in_dim).to(rank)
output = model(inp)
#print(f"FWD Step: iter {i}", rank)
output.sum().backward()
#print(f"BWD Step: iter {i}", rank)
optimizer.step()
#print(f"Optimization Step: iter {i}", rank)
#print("Training finished", rank)
print(f'{rank}, max memory alloc: {torch.cuda.max_memory_allocated(device=rank)}')
print(f"Device count: {torch.cuda.device_count()}")
demo_world_rank_mismatch(torch.cuda.device_count())
运行此代码将输出以下内容
Device count: 2
Rank: 1
World size: 4
Create a sharding plan based on the given world_size 2
Rank: 2
World size: 4
Create a sharding plan based on the given world_size 2
Rank: 3
World size: 4
Create a sharding plan based on the given world_size 2
Rank: 0
World size: 4
Create a sharding plan based on the given world_size 2
Parallelize the module based on the given Parallel Style 0
Parallelize the module based on the given Parallel Style 1
然后它输出了一个巨大的错误消息,其中大部分是从退出的多个线程中重复的,其中最后一位在这里:
RuntimeError: CUDA error: invalid device ordinal
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
Pytorch 版本为 2.1.0+cu121。我不知该何去何从。
答: 暂无答案
评论
DataParallel