推力执行策略中的内存池-解网

问：

我正在寻找在推力内使用内存池的解决方案，因为我想将调用次数限制为 . 肯定接受分配器，但处理起来并不容易，显然会分配一个临时缓冲区。cudaMallocdevice_vectorthrust::sort

根据如何使用 CUDA Thrust 执行策略覆盖 Thrust 的低级设备内存分配器的答案，似乎可以通过调整执行策略来钩住 Thrust 以使用特殊分配器，但它已经很旧了，我似乎找不到任何关于执行策略的文档来解释如何准确进行。

为了完整起见，有 thrust/examples/cuda/custom_temporary_allocation.cu，但它不是很令人满意，因为它使用挂钩的内存池作为全局变量。

我认为 Thrust 开发人员很可能已经考虑过这一点，并且会包含一些机制来允许在 exec 策略中注入自定义内存管理器，我只是找不到它。

C++ CUDA 推力分配器内存池

为了获得更多控制，我建议查看 RAPIDS 内存管理器（RMM）（RAPIDS 是 Nvidia 的一部分，我认为现在是 Thrust 和 CUB 创新的主要驱动力）或直接使用 CUB（它基本上是 Thrust 的 CUDA 后端），即 cub：:D eviceRadixSort（用于原始类型）或 cub：:D eviceMergeSort（通常适用）。

0赞 brice rebsamen 7/4/2023

请参阅 github.com/ingowald/cudaKDTree/pull/7 了解我的更改@paleonix如果您能对其进行快速审查（这是一个很小的 PR），那就太好了

0赞 paleonix 7/4/2023

评论线程错误？无论哪种方式，您都会发现 thrust/examples/mr_basic.cu 对 Thrust 中的分配器很感兴趣。

答：

2赞 Abator Abetor 7/2/2023 #1

下面是流排序内存分配的示例分配器，它使用 cudaMallocAsync 从特定流上的默认 cuda 内存池进行分配。与par_nosync执行策略一起，这允许完全异步的 thrust：：sort。

#include <thrust/device_malloc_allocator.h>

template <class T>
struct ThrustAllocatorAsync : public thrust::device_malloc_allocator<T> {
public:
    using Base      = thrust::device_malloc_allocator<T>;
    using pointer   = typename Base::pointer;
    using size_type = typename Base::size_type;

    ThrustAllocatorAsync(cudaStream_t stream_) : stream{stream_} {}

    pointer allocate(size_type num){
        T* result = nullptr;
        cudaMallocAsync(&result, sizeof(T) * num, stream);
        return thrust::device_pointer_cast(result);
    }

    void deallocate(pointer ptr, size_type num){
        cudaFreeAsync(thrust::raw_pointer_cast(ptr), stream);
    }

private:
    cudaStream_t stream;
};

...

thrust::sort(
   thrust::cuda::par_nosync(ThrustAllocatorAsync<char>(stream)).on(stream),
   data.begin(),
   data.end()
);

正如评论中建议的那样，使用 RMM 也可以实现相同的目的。

#include <rmm/mr/device/cuda_async_memory_resource.hpp> 
#include <rmm/exec_policy.hpp>

...
// could use any other class derived from rmm::mr::device_memory_resource
rmm::mr::cuda_async_memory_resource mr; 

thrust::sort(
   rmm::exec_policy_nosync(stream, &mr),
   data.begin(),
   data.end()
);

推力执行策略中的内存池

memory pool in thrust execution policy

评论

评论