C++ pthreads 导致墙时间比串行运行更长-解网

问：

我目前正在编写/运行一个程序，该程序本质上是在 Linux 中运行的域中反弹对象。这个过程可以对许多对象（~10^10 或更多）完成，我正在尝试弄清楚如何加快速度（目前在串行中需要 3 小时以上）。对于每个对象，我初始化它，开始它的移动，并查看它是否已经移动到一个新的域网格单元。从计算上讲，它不是很复杂，我只是做了很多数组查找，然后我将对象在前一个单元格中移动的距离写到数组中。我正在使用 pthreads 在 >50 个任务上运行，并且在运行代码时遇到速度变慢（需要 4 小时以上）。

代码本身有点太长，无法粘贴到这里，但重要的步骤：

//global variables
unsigned int Nth, Ncells; // these are input to the code
std::vector<std::vector<double>> distVec;

static void * start_moveObject(void *args)
            {
                return ((cellargs *)args)->context->moveOject(args);
            }

void * moveObject(void * args)
{
        unsigned int startv = ((cellargs *)args)->st;
        unsigned int endv = ((cellargs *)args)->ed;
        unsigned int thID = ((cellargs *)args)->thID;
        unsigned int dim = ((cellargs *)args)->context->dim;
        bool ncell = false;
        double dist = 0.;

        unsigned int currCell, prevCell;


        for(unsigned int index = startv; index < endv; index++)
        {
                // do the things here that move the object and calculate the distance (dist) traveled in currCell
                // check if it is in new cell
                if(currCell != prevCell)
                {
                       distVec[thID][prevCell] += dist; 
                       dist = 0;
                       prevCell = currCell;
                }
        }


int main()
{
            pthread_t * threads = new pthread_t[Nthreads];
            cellargs * args = new cellargs[Nthreads];
            unsigned int st, end, prev;
            distVec.resize(Nth);


            prev = 0;
            for(unsigned int ith = 0; ith < Nthreads; ith++)
            {
                        st = prev;
                        end = st + floor(Ncells/Nthreads);
                        if(ith == Nthreads -1)
                        end = Ncells -1;
                        prev = end + 1;

                        distVec[ith].resize(Ncells, 0.);

                        args[ith].st = st;
                        args[ith].end = end;
                        args[ith].thID = ith;
            
                        if(pthread_create(&threads[ith], NULL, start_moveObject, (void*)&args[ith]) != 0) 
            }

            for(unsigned int ith = 0; ith < Nth; ith++)
                        pthread_join(threads[ith], NULL);
}

有没有其他方法可以为并行运行编写代码，或者如果无法优化，其他库可以更好地用于此？

C++ 多线程 pthreads

使用各种值运行性能测试可能是值得的（从低到低到然后向上移动到 N 是计算机中的内核数，甚至更高一点），并使用它来构建一些关于不同级别的并行性如何影响程序运行时的数据。这些数据可能会告诉您有关性能问题的一些信息。NthreadsNthreads=1Nthreads=N

0赞 Jeremy Friesner 11/17/2023

顺便说一句，看起来像一个潜在的错误......你打算写吗？if(ith = Nthreads -1)if(ith == Nthreads -1)

0赞 sara 11/18/2023

我目前正在使用 96 个内核/线程运行。我写入/读取的唯一值是 distVec 向量。我还有其他每个线程访问的数组，但由于它们仅从这些向量中读取，因此我没有同步对它们的访问。在 distVec 向量中，每个线程都有自己的行。阵列的互斥锁锁定段似乎会导致更大的速度变慢，因为我经常访问它们。由于访问取决于对象在空间中的位置，因此每个线程都可以写入向量的任何元素。@JeremyFriesner感谢您的捕捉，这只是在这里写的错别字！

1赞 Jeremy Friesner 11/18/2023

您可能只是使内存子系统的带宽饱和。如果带宽到 RAM 是您的瓶颈，那么添加更多线程不会使任何事情变得更快（因为额外的线程只会让每个人都花费更多时间等待内存控制器操作完成）

答： 暂无答案

上一个：使用 joblib 比单线程循环慢

下一个：Tkinter 中的多线程通信，不会产生不必要的线程

C++ pthreads 导致墙时间比串行运行更长

c++ pthreads causing longer wall times than serial run

评论