如何使用 SLURM 运行多个作业,每个 GPU 一个作业?

How to run multiple jobs, one per GPU, with SLURM?

提问人:Vivek 提问时间:10/28/2023 更新时间:10/28/2023 访问量:44

问:

如果之前有人问过这个问题/回答过这个问题,我深表歉意,但即使在阅读了我能找到的所有内容之后,我也在努力让 SLURM 做我想做的事。

假设我有一台有 4 个 GPU 的机器。我想并行训练 4 个模型,每个作业都在单个 GPU 上运行。我正在使用一个作业数组,并且有一个如下所示的 bash 脚本。但是,当我运行此脚本时,

#!/bin/bash

#SBATCH --job-name=train
#SBATCH --array=0-3
#SBATCH --nodes=1
#SBATCH --gpus-per-node=1

srun python train.py --config=$SLURM_ARRAY_TASK_ID

该模型一次仅使用两个 GPU

            7824_1     gpu    train   vivekg  R       0:00      1 node
            7824_2     gpu    train   vivekg  R       0:00      1 node
        7824_[3-4]     gpu    train   vivekg PD       0:00      1 (Resources)

当我运行时,我看到我的机器上只有 2/4 个 GPU 在使用中。nvidia-smi

+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.113.01             Driver Version: 535.113.01   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA RTX A6000               Off | 00000000:3B:00.0 Off |                  Off |
| 30%   30C    P2             195W / 300W |  23856MiB / 49140MiB |     92%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   1  NVIDIA RTX A6000               Off | 00000000:5E:00.0 Off |                  Off |
| 30%   27C    P2             179W / 300W |  42172MiB / 49140MiB |     98%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   2  NVIDIA RTX A6000               Off | 00000000:86:00.0 Off |                  Off |
| 30%   18C    P8              20W / 300W |      3MiB / 49140MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   3  NVIDIA RTX A6000               Off | 00000000:AF:00.0 Off |                  Off |
| 30%   17C    P8              20W / 300W |      3MiB / 49140MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
                                                                                         
+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|    0   N/A  N/A    637518      C   ...kg/miniforge3/envs/train/bin/python    23848MiB |
|    1   N/A  N/A    637520      C   ...kg/miniforge3/envs/train/bin/python    42164MiB |
+---------------------------------------------------------------------------------------+

跑步并没有让我看到任何不寻常的东西。scontrol show job <JOBID> -d

JobId=7825 ArrayJobId=7824 ArrayTaskId=1 JobName=train.sh
   UserId=vivekg(26289) GroupId=vivekg(26289) MCS_label=N/A
   Priority=1 Nice=0 Account=(null) QOS=(null)
   JobState=RUNNING Reason=None Dependency=(null)
   Requeue=0 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
   DerivedExitCode=0:0
   RunTime=00:01:45 TimeLimit=UNLIMITED TimeMin=N/A
   SubmitTime=2023-10-27T13:48:36 EligibleTime=2023-10-27T13:48:37
   AccrueTime=2023-10-27T13:48:37
   StartTime=2023-10-27T13:48:37 EndTime=Unknown Deadline=N/A
   PreemptEligibleTime=2023-10-27T13:48:37 PreemptTime=None
   SuspendTime=None SecsPreSuspend=0 LastSchedEval=2023-10-27T13:48:37 Scheduler=Main
   Partition=A6000 AllocNode:Sid=basil:1508745
   ReqNodeList=chili ExcNodeList=sumac
   NodeList=chili
   BatchHost=chili
   NumNodes=1 NumCPUs=2 NumTasks=1 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
   TRES=cpu=2,mem=90000M,node=1,billing=2
   Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
   JOB_GRES=gpu:A6000:1
     Nodes=chili CPU_IDs=0-1 Mem=90000 GRES=gpu:A6000:1(IDX:0)
   MinCPUsNode=1 MinMemoryNode=90000M MinTmpDiskNode=0
   Features=(null) DelayBoot=00:00:00
   OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
   Power=
   TresPerNode=gres:gpu:1

知道我做错了什么吗?非常感谢您的帮助。

诽谤

评论


答:

0赞 Vivek 10/28/2023 #1

弄清楚我做错了什么。默认情况下,SLRUM 为每个作业分配 90GB 内存。每个 GPU 只有 48GB,因此一项工作占用了两个 GPU。指定已修复该问题。#SBATCH --mem