提问人:Vivek 提问时间:10/28/2023 更新时间:10/28/2023 访问量:44
如何使用 SLURM 运行多个作业,每个 GPU 一个作业?
How to run multiple jobs, one per GPU, with SLURM?
问:
如果之前有人问过这个问题/回答过这个问题,我深表歉意,但即使在阅读了我能找到的所有内容之后,我也在努力让 SLURM 做我想做的事。
假设我有一台有 4 个 GPU 的机器。我想并行训练 4 个模型,每个作业都在单个 GPU 上运行。我正在使用一个作业数组,并且有一个如下所示的 bash 脚本。但是,当我运行此脚本时,
#!/bin/bash
#SBATCH --job-name=train
#SBATCH --array=0-3
#SBATCH --nodes=1
#SBATCH --gpus-per-node=1
srun python train.py --config=$SLURM_ARRAY_TASK_ID
该模型一次仅使用两个 GPU
7824_1 gpu train vivekg R 0:00 1 node
7824_2 gpu train vivekg R 0:00 1 node
7824_[3-4] gpu train vivekg PD 0:00 1 (Resources)
当我运行时,我看到我的机器上只有 2/4 个 GPU 在使用中。nvidia-smi
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.113.01 Driver Version: 535.113.01 CUDA Version: 12.2 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 NVIDIA RTX A6000 Off | 00000000:3B:00.0 Off | Off |
| 30% 30C P2 195W / 300W | 23856MiB / 49140MiB | 92% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
| 1 NVIDIA RTX A6000 Off | 00000000:5E:00.0 Off | Off |
| 30% 27C P2 179W / 300W | 42172MiB / 49140MiB | 98% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
| 2 NVIDIA RTX A6000 Off | 00000000:86:00.0 Off | Off |
| 30% 18C P8 20W / 300W | 3MiB / 49140MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
| 3 NVIDIA RTX A6000 Off | 00000000:AF:00.0 Off | Off |
| 30% 17C P8 20W / 300W | 3MiB / 49140MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
+---------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=======================================================================================|
| 0 N/A N/A 637518 C ...kg/miniforge3/envs/train/bin/python 23848MiB |
| 1 N/A N/A 637520 C ...kg/miniforge3/envs/train/bin/python 42164MiB |
+---------------------------------------------------------------------------------------+
跑步并没有让我看到任何不寻常的东西。scontrol show job <JOBID> -d
JobId=7825 ArrayJobId=7824 ArrayTaskId=1 JobName=train.sh
UserId=vivekg(26289) GroupId=vivekg(26289) MCS_label=N/A
Priority=1 Nice=0 Account=(null) QOS=(null)
JobState=RUNNING Reason=None Dependency=(null)
Requeue=0 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
DerivedExitCode=0:0
RunTime=00:01:45 TimeLimit=UNLIMITED TimeMin=N/A
SubmitTime=2023-10-27T13:48:36 EligibleTime=2023-10-27T13:48:37
AccrueTime=2023-10-27T13:48:37
StartTime=2023-10-27T13:48:37 EndTime=Unknown Deadline=N/A
PreemptEligibleTime=2023-10-27T13:48:37 PreemptTime=None
SuspendTime=None SecsPreSuspend=0 LastSchedEval=2023-10-27T13:48:37 Scheduler=Main
Partition=A6000 AllocNode:Sid=basil:1508745
ReqNodeList=chili ExcNodeList=sumac
NodeList=chili
BatchHost=chili
NumNodes=1 NumCPUs=2 NumTasks=1 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
TRES=cpu=2,mem=90000M,node=1,billing=2
Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
JOB_GRES=gpu:A6000:1
Nodes=chili CPU_IDs=0-1 Mem=90000 GRES=gpu:A6000:1(IDX:0)
MinCPUsNode=1 MinMemoryNode=90000M MinTmpDiskNode=0
Features=(null) DelayBoot=00:00:00
OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
Power=
TresPerNode=gres:gpu:1
知道我做错了什么吗?非常感谢您的帮助。
答:
0赞
Vivek
10/28/2023
#1
弄清楚我做错了什么。默认情况下,SLRUM 为每个作业分配 90GB 内存。每个 GPU 只有 48GB,因此一项工作占用了两个 GPU。指定已修复该问题。#SBATCH --mem
评论