为什么“torch.profiler”在与 ncu 共同运行时没有捕获 cuda 操作-解网

问：

我已将我的模型和输入绑定到 cuda

x = torch.randint(low=0, high=256, size=(1, 3, 224, 224), dtype=torch.float32).to(device="cuda:0")
model = torchvision.models.googlenet().eval()
inputs = (x,)
model = model.to(device="cuda:0").eval()

我用来分析模型torch.profiler

    with torch.profiler.profile(
        on_trace_ready=trace_handler,
        activities=[
            torch.profiler.ProfilerActivity.CPU,
            torch.profiler.ProfilerActivity.CUDA,
        ],
        with_stack=True,
    ) as p:
        with torch.no_grad():
            for _ in range(warm_ups):
                model(*inputs) # don't record time
                p.step()
            for i in range(iterations):
                y = model(*inputs)
                p.step()

然后我使用命令对上面的文件进行分析，其中包含，只需调用它即可。ncutorch.profilertmp.py

ncu -o <output> python tmp.py

但是当我检查导出的配置文件报告时，我发现所有跟踪的操作都是，例如cpu_op

  {
    "ph": "X", "cat": "cpu_op", "name": "aten::conv2d", "pid": 9832, "tid": 9832,
    "ts": 1700192834976494, "dur": 1091759,
    "args": {
      "External id": 1,"Ev Idx": 0
    }
  },

奇怪的是，如果我只是单独运行，我可以得到正确的内核函数，比如tmp.py

  },
  {
    "ph": "X", "cat": "kernel", "name": "void cask_cudnn::computeOffsetsKernel<false, false>(cask_cudnn::ComputeOffsetsParams)", "pid": 0, "tid": 7,
    "ts": 1700193716776597, "dur": 3,
    "args": {
      "External id": 2040,
      "queued": 0, "device": 0, "context": 1,
      "stream": 7, "correlation": 2040,
      "registers per thread": 16,
      "shared memory": 0,
      "blocks per SM": 1.0416666,
      "warps per SM": 8.333333,
      "grid": [50, 1, 1],
      "block": [256, 1, 1],
      "est. achieved occupancy %": 26
    }
  },

为什么？

pytorch cuda 配置文件 nsight-compute

为什么“torch.profiler”在与 ncu 共同运行时没有捕获 cuda 操作

Why `torch.profiler` catches no cuda operation when co-running with ncu

评论