arrayfire 中的多线程 fft 卷积-解网

问：

我正在尝试在多个 CPU 线程上并行化 arrayfire 中的 fft 卷积：

#include <arrayfire.h>
#include <iostream>
#include <omp.h> 

using namespace af;

void printarray(const std::vector<float>& f, size_t N=10)
{
    const size_t bound=std::min(N,f.size());
    for (size_t i=0; i< bound; ++i){
        using namespace std;
        cout<<f[i];
        if (i+1<bound) cout<<", ";
        else cout<<endl;
    }
}
using namespace std;
int main() {
    std::vector<float> vec{2.0,1.0};
    cout<<"vec: "<<endl;
    printarray(vec);
    std::vector<float> kernel(10000000,5.0);
    cout<<"kernel: "<<endl;
    printarray(kernel);
    try {
#pragma omp parallel
        {
#pragma omp master
            {
                cout<<"Threads: "<<omp_get_num_threads()<<endl;
            }
            af::array af_in(vec.size(), vec.data());
            af::array af_kernel(kernel.size(), kernel.data());
            af::array tmp = af::fftConvolve(af_in, af_kernel, AF_CONV_EXPAND);
            std::vector<float> out;
            float *h = tmp.host<float>();
            size_t entries = tmp.bytes() / sizeof(float);
            for (size_t i = 0; i < entries; ++i) {
                out.push_back(h[i]);
            }
            af::freeHost(h);
            int thr_num=omp_get_thread_num();
            cout<<"Thread "<<thr_num<<" finished"<<endl; 
        }
    } catch (af::exception& e) {
        fprintf(stderr, "%s\n", e.what());
        throw;
    }
    return 0;
}

这个最小的示例程序可以使用进行编译。但不知何故，它只按顺序运行卷积，而不是并行运行。例如，可以通过在程序运行时观察 CPU 负载或测量运行时来检查这一点：example.cppg++ example.cpp -lafcpu -fopenmp

$ time OMP_NUM_THREADS=1 ./a.out
vec: 
2, 1
kernel: 
5, 5, 5, 5, 5, 5, 5, 5, 5, 5
Threads: 1
Thread 0 finished

real    0m1,745s
user    0m1,654s
sys 0m0,069s
$ time OMP_NUM_THREADS=8 ./a.out
vec: 
2, 1
kernel: 
5, 5, 5, 5, 5, 5, 5, 5, 5, 5
Threads: 8
Thread 2 finished
Thread 5 finished
Thread 6 finished
Thread 1 finished
Thread 0 finished
Thread 3 finished
Thread 7 finished
Thread 4 finished

real    0m11,944s
user    0m14,552s
sys 0m0,544s

我想函数内部一定有一些锁定机制，尽管我什至在各个线程中构造了单独的变量，但它可以防止并行执行。af::fftConvolveaf::array

如何在 CPU 上并行化这些卷积？af::fftConvolve

C++ 多线程 ArrayFire

arrayfire 中的多线程 fft 卷积

Multithreaded fft convolution in arrayfire

评论