提问人:Samuel 提问时间:11/17/2023 更新时间:11/17/2023 访问量:63
具有可分页内存的 Cuda 多设备异步性
Cuda multi-device asynchronicity with pageable memory
问:
Cuda 是否提供了一种使用主机上的可分页内存在不同设备之间获得异步性的方法?(请注意,这与在单个设备上阻止可分页内存的永无止境的问题无关)cudaMemcpyAsync
背景如下:在访问具有多个 Cuda 设备的节点后,我将工作负载批处理以在所有设备之间均匀分布。批处理由从主机到设备的实例(每个设备一个)组成,然后是内核启动,最后是从设备到主机。这些实例是从设备环路启动的。但问题是以下问题:由于我使用的主机内存是可分页的,因此与主机同步,因此最内层循环中的每次迭代都会发生,前提是前一个迭代完全完成,从而阻止设备同时工作。num_devices
cudaMemcpyAsync
cudaMemcpyAsync
num_devices
for
cudaMemcpyAsync
for
我在下面附上一个最小的例子:
/*
Compilation: nvcc main.cu -o main.cuda
(nvcc version 12)
*/
#include <cuda_runtime.h>
#include <vector>
#include <stdint.h>
#include <cassert>
// trivial kernel for illustration
__global__
void kernel(double* d_u, const uint64_t size)
{
uint64_t j = blockIdx.x * blockDim.x + threadIdx.x;
if (j<size) {
d_u[j] *= 2.0;
}
}
// providing home-made popcnt in case __builtin_popcount is not supported
unsigned int hm_popcnt(int word) {
unsigned int n = 0;
while(word) {
if (word&1) {++n;}
word>>=1;
}
return n;
}
int main() {
unsigned int n = 30;
uint64_t dimension = (1ULL)<<n;
unsigned int n0 = 27;
uint64_t batch_size = (1ULL)<<n0;
int blockSize = 256;
int numBlocks = (batch_size + blockSize - 1)/blockSize;
int num_devices;
cudaGetDeviceCount(&num_devices);
assert(num_devices!=1); // 1 device, no luck
assert(__builtin_popcount(num_devices)==1); // for sake of simplicity
//assert(hm_popcnt(num_devices)==1);
cudaStream_t streams[num_devices];
for (int dev_id=0; dev_id<num_devices; ++dev_id) {
cudaSetDevice(dev_id);
cudaStreamCreateWithFlags(&streams[dev_id], cudaStreamNonBlocking);
}
std::vector<double> h_v(dimension, 1.0); // pageable memory
// each device holds its array of dimension <batch_size>
double * d_v[num_devices];
for (int dev_id=0; dev_id<num_devices; ++dev_id) {
cudaSetDevice(dev_id);
cudaMalloc((void**)&d_v[dev_id], batch_size*sizeof(double));
}
uint32_t num_batches = ((1UL)<<(n-n0))/num_devices;
for (uint32_t i=0; i<num_batches; ++i) {
for (int dev_id=0; dev_id<num_devices; ++dev_id) {
cudaSetDevice(dev_id);
uint64_t start_index = (i*num_devices + dev_id) * batch_size;
cudaMemcpyAsync(d_v[dev_id], &h_v[start_index], batch_size*sizeof(double), cudaMemcpyHostToDevice, streams[dev_id]);
kernel<<<numBlocks, blockSize, 0, streams[dev_id]>>>(d_v[dev_id], batch_size);
cudaMemcpyAsync(&h_v[start_index], d_v[dev_id], batch_size*sizeof(double), cudaMemcpyDeviceToHost, streams[dev_id]);
// h_v is on pageable memory, so the cudaMemcpyAsync is synchronous, preventing devices to work at the same time
}
}
for (int dev_id=0; dev_id<num_devices; ++dev_id)
{
cudaSetDevice(dev_id);
cudaFree(d_v[dev_id]);
cudaStreamDestroy(streams[dev_id]);
}
return 0;
}
由于在生产环境中非常大,因此无法执行从可分页内存到固定内存的主机到主机复制。相反,我可以通过交换外部和内部循环并在设备上使用 OpenMP 循环来解决此问题(代价是涉及 CPU 线程,这很好):dimension
for
num_devices
// requires to add #include "omp.h" and to compile with flags -Xcompiler -fopenmp
#pragma omp parallel for schedule(static, 1) num_threads(num_devices)
for (int dev_id=0; dev_id<num_devices; ++dev_id) {
cudaSetDevice(dev_id);
for (uint32_t i=0; i<num_batches; ++i) {
uint64_t start_index = (i*num_devices + dev_id) * batch_size;
cudaMemcpyAsync(d_v[dev_id], &h_v[start_index], batch_size*sizeof(double), cudaMemcpyHostToDevice, streams[dev_id]);
kernel<<<numBlocks, blockSize, 0, streams[dev_id]>>>(d_v[dev_id], batch_size);
cudaMemcpyAsync(&h_v[start_index], d_v[dev_id], batch_size*sizeof(double), cudaMemcpyDeviceToHost, streams[dev_id]);
}
}
IMO 这不是很优雅,不知何故,我觉得 Cuda 应该提供更干净的东西来在这个简单的场景中实现多设备异步。是这样吗? 如果没有,还有其他解决方案吗?
答:
使用可分页且未固定的内存时,cudaMemcpyAsync 函数会阻止调用线程,类似于 cudaMemcpy 函数。因此,实际上,您必须从不同的线程调用 cudaMemcpy 或 cudaMemcpyAsync,以实现设备之间的伪异步性。 这里需要注意的是,这根本不是异步副本。CPU 线程和 GPU 在复制过程中被阻止。
如果你想要一个异步副本,你必须使用固定内存 - 这是没有办法的。
您可以简单地将可分页内存的分配替换为代码中对 cudaMallocHost() 的调用!std::vector<double> h_v(dimension, 1.0); // pageable memory
如果您只想实现设备之间的异步性,而不是主机和设备之间的异步复制,那么您的解决方案已经可行。在这种情况下,应将 cudaMemcpyAsync 替换为 cudaMemcpy。
评论
cudaMemcpyAsync
cudaMemcpy
评论
cudaMemcpyAsync
cudaMemcpyAsync
cudaSetDevice