关于使用 SSE2 内部函数进一步优化该卡方函数的建议-解网

问：

我正在尝试将 c 代码中的以下卡方函数转换为 SSE2 内部函数

我得到了这两个函数的正确输出。我使用我生成的一些随机 4KB 数据测量了两个函数运行所需的时间，平均而言，我看到大约 70-90 毫秒的性能改进

我只是想知道我是否缺少任何进一步的优化，可以进一步提高性能。任何关于这方面的线索都会有所帮助

普通 C 代码：

int observed[256] = {0};
        double chiSquare = 0.0;
        double expected = (double)size / 256; // Constant expected value
        // Calculate frequency of each byte value
        for (int i = 0; i < size; i++) {
            observed[data[i]]++;
        }
        // Calculate the chi-square statistic
        for (int i = 0; i < 256; i++) {
                double diff = observed[i] - expected;
                chiSquare += (diff * diff) / expected;
        }
        return chiSquare;

SSE2 内部函数：

int observed[256] = {0};
        const double expected = (double)size / 256;  // Make 'expected' a constant
        double chiSquare = 0.0;
        // Process data in 16-byte (128-bit) chunks
        for (int i = 0; i < size; i += 16) {
            __m128i dataChunk = _mm_loadu_si128((__m128i*)(data + i));
            // Unpack 8-bit values into 16-bit values for counting
            __m128i dataUnpacked = _mm_unpacklo_epi8(dataChunk, _mm_setzero_si128());
            // Extract and process 8 values in parallel
            for (int j = 0; j <= 1; j++) {
                uint16_t values[8];
                _mm_storeu_si128((__m128i*)values, dataUnpacked);
                for (int k = 0; k < 8; k++) {
                    observed[values[k]]++;
                }
                dataUnpacked = _mm_unpackhi_epi8(dataChunk, _mm_setzero_si128());
            }
        }
        // Calculate the chi-square statistic using SSE2 intrinsics
        __m128d sum = _mm_setzero_pd();
        for (int i = 0; i < 256; i += 2) {
            __m128d observedVec = _mm_set_pd(observed[i + 1], observed[i]);
            __m128d diff = _mm_sub_pd(observedVec, _mm_set1_pd(expected));
            __m128d squaredDiff = _mm_mul_pd(diff, diff);
            __m128d result = _mm_div_pd(squaredDiff, _mm_set1_pd(expected));
            sum = _mm_add_pd(sum, result);
        }
        // Sum up the results in the sum
        double sumArray[2];
        _mm_storeu_pd(sumArray, sum);
        for (int i = 0; i < 2; i++) {
            chiSquare += sumArray[i];
        }
        return chiSquare;
        }**

C 优化 SSE 内部函数 SSE2

0赞 Peter Cordes 9/4/2023

最明显的是，这是一个常数，所以你可以在循环之后除以它。使用浮点数学，这甚至不会导致任何精度损失，因为更大的幅度意味着更高的指数。另外，还不够精确吗？这样一来，您就可以在每条指令中执行两倍的元素，并使打包的转换更加高效。（你现在正在做的事情，用，可能编译效率低下;至少使用和）。expectedfloat__m128i__m128_mm_set_pd_mm_loadu_si64_mm_cvtepi32_pd

0赞 Peter Cordes 9/4/2023

此外，您可以在转换前用整数进行减法，从而减轻 FP 数学执行端口的压力，这应该是瓶颈。或者，如果您展开以隐藏 FP 添加延迟（为什么 mulss 在 Haswell 上只需要 3 个周期，这与 Agner 的指令表不同？（使用多个累加器展开 FP 循环））。（当然，在将分界线从循环中吸收出来之后，吞吐量瓶颈可能与 FP 添加延迟瓶颈一样大或更大。_mm_sub_epi32

0赞 tstanisl 9/4/2023

除法在现代处理器上很慢，请考虑用乘以代替。_mm_div_pd1.0 / expected

1赞 chtz 9/4/2023

除了避免在第二个循环中除法外，最大的问题是计算直方图，这对于 SIMD 来说并非易事。一些相关问题：stackoverflow.com/questions/30970060/... stackoverflow.com/questions/12985949/......

0赞 Sanku 9/4/2023

@PeterCordes 首先感谢您的建议。在我尝试了您的所有建议后，我会更新我的帖子。

答：

1赞 Simon Goater 9/4/2023 #1

在我的 Westmere i5 笔记本电脑上，您的 SSE2 版本的函数基准测试速度比标量函数慢（~25%）。我在我的机器上对标量函数的性能进行了轻微的改进（~20% 的 4KB 数据）。此外，SSE2 函数不适用于所有“大小”值，我相信您已经知道了。无论如何，我的函数如下。

double getchisquared(int size, uint8_t *data) {
        double diff, chiSquare = 0.0;
        double expected = (double)size / 256; // Constant expected value
        int i, iterations = (size >> 2) << 2;
        // Calculate frequency of each byte value
        for (i = 0; i < iterations;) {
            observed[data[i++]]++;
            observed[data[i++]]++;
            observed[data[i++]]++;
            observed[data[i++]]++;
        }
        for (i = iterations; i < size; i++) {
            observed[data[i]]++;
        }
        // Calculate the chi-square statistic        
        for (i = 0; i < 256; i++) {
                diff = observed[i] - expected;
                chiSquare += (diff * diff) ;
        }
        return chiSquare / expected;        
}

我认为 SSE2 并没有像 chtz 指出的那样为直方图阶段的优化提供太大的希望。你可能对 AVX2 有更多的运气，但我还没有调查过。

0赞 Peter Cordes 9/5/2023

您确实需要 AVX-512 散射器来实现高效的直方图;AVX2 只有 gathers。即便如此，您仍然需要冲突检测，或者需要 8 或 16 个不同的计数数组，以便每个 SIMD 元素可以分散到不同的数组中。由于我们可以使用 SIMD 有效地对计数数组求和，并且每个计数数组只有 256 * 4 字节 = 1KiB，因此对于中型到大型输入来说，这是值得的。

0赞 Peter Cordes 9/5/2023

即使使用标量直方图，如果在较短的值范围内大量运行相同值，则展开 2 或 4 个计数数组也应该是胜利，从而创建存储转发延迟链太长，无法隐藏无序 exec。（展开 4 时，每个增量可能为 3 到 4 个 uops，Ice Lake 上的 ROB 是一个窗口，可能是 352/4 = 88 个增量，调度程序大小为 160 / 4 = 40 个尚未执行的冲突增量。Skyake 的 ROB 大小“仅”为 97 uops。此外，它们可能不是完全统一的调度器，但我们有一个加载/存储/ALU 组合。E 核更小data[i]i

0赞 Peter Cordes 9/5/2023

如果我们知道总数只有 4096，则任何计数都不能溢出，因此我们可以使用更紧凑的直方图来减少展开的缓存占用，并在 SIMD 对直方图求和时，每个向量获得两倍的元素，然后扩展到 32 位以转换为浮点或双精度。（只要我们不想使用 AVX-512 聚集/分散，它仅适用于 32 位或 64 位块。sizeuint16_t

0赞 Simon Goater 9/5/2023

我忘记了 AVX2 没有散射。感谢您的澄清。

0赞 Peter Cordes 9/5/2023

恕我直言，您更改为仅在函数顶部使用 C89 样式的变量声明会使其可读性降低。只需要存在于第二个循环中，并且应该是每个循环的私有。这些变量甚至不应该在使用它们的循环之前声明，这使得它们有些令人惊讶的类型在现场更加明显。double diffidouble

1赞 Peter Cordes 9/5/2023 #2

对于直方图部分：

使用 2 或 4 个计数数组展开直方图，以避免在 40 到 80 个值（基于 ROB（重新排序缓冲区）和调度程序容量）内重复运行一个值时出现存储转发延迟瓶颈。如果小于，则使用元素，因此没有计数箱可以溢出，以便更密集地使用 L1d 缓存。（在 x86 上，16 位内存增量不慢于 32 位）。SIMD 非常适合之后对 2 或 4 个数组进行垂直求和，而 4 x 1024 字节的数组足够少，混叠和冲突未命中不会成为问题，因此您可以一次性完成。（它最初将更多的内存归零并求和，并且会弄脏更多的缓存，因此只有当它通过对典型数据的真正加速来收回成本时，才值得这样做。data[i]iuint16_tsizeUINT16_MAX/unroll
SIMD 对实际的直方图部分没有帮助，除非您有用于散射和聚集的 AVX-512。（最好是 8 或 16 个计数数组，以避免冲突检测的需要，如果您有足够大的输入来摊销额外的归零和求和）。最近为修复 Intel CPU 中的 Downfall 漏洞而进行的微码更新极大地损害了 gather 性能，并且 Zen 4 上的 gather/scatter 性能一开始就不是很好。（散射速度比 AMD 和 Intel 上的聚集慢，https://uops.info/ 在 AVX-512 中搜索。AMD在每一个方面都比英特尔慢，至少使用Downfall微码更新之前的数字，并且IDK收集和分散可以相互重叠。vpscatterdd

你可能会用让它变慢，除非它可能展开以使用 SSE2 ，但标量移位会更好。_mm_unpacklo_epi8pextrw

对于处理计数的循环：

将 div 从循环中下沉; .而且由于 ur 始终是 2 的幂，这甚至不应该改变 FP 舍入误差，因为无论哪种方式，尾数都是一样的！区别在于你离溢流和下溢有多近。（很远，因为你要对通常很小的整数的非大量平方求和。x/e + y/e + z/e == (x+y+z)/eexpected

如果您确实需要多次除以已知为 2 的幂的数字，请执行。这与二进制浮点数的 2 幂完全相同。（如果内联后编译时常数确实为 2 的幂，编译器将执行此操作：https://godbolt.org/z/9TGbfGne7mult = 1.0f/expectedexpected)
优于每个向量累积 2 倍元素数，以及更高效的打包转换（更少的洗牌）。（使用 or ，不要使用两个单独的标量双精度！但是，只有 24 位尾数精度，对 16 位整数进行平方最多可以产生 32 位有效位。但在大多数情况下，我认为您的整数会更小，最多 12 位，因为 size=4K。floatdoublechiSquare_mm_cvtepi32_pd_ps_mm_set_pdfloat

或者用浮点向量展开几次迭代，然后在向量 （/）的底部求和为 2 个浮点数，转换为双精度（），然后添加到累加器中。某种形式的展开可以很好地隐藏 FP 延迟，否则这将是第 2 个循环的瓶颈，一旦您解决了循环内部的明显问题。请参阅处理器的延迟边界和吞吐量边界，了解必须按顺序发生的操作和为什么 mulss 在 Haswell 上只需要 3 个周期，这与 Agner 的指令表不同？（使用多个累加器展开 FP 循环）movhlpsaddps_mm_cvtps_pd__m128ddiv

第二个循环的计数为 256，非常适合展开需要清理代码的操作。
用整数 SIMD 做第一个子，如果是 2 的幂>= 256，那么是一个整数。这减轻了 FP 数学执行端口的一些压力，如果不是延迟，这应该是瓶颈。sizeexpected

您甚至应该能够使用 _mm_madd_epi16 （pmaddwd）对 16 位有符号整数的平方和求和，从而非常有效地生成 32 位整数的向量。

您实际上可以将整个平方和作为整数，最大尺寸为 65664（略高于 64K 个元素），而不会溢出uint32_t。更糟糕的情况（最大总和）是当所有计数都在一个桶中时，因为这会使一个大数平方，而其他数字也是最负的。例如，我们有一个桶加上 255 个其他桶，每个桶的乘以 255 是0xff00。大小的总0xff0000=4096。这实际上只有 24 位，所以我们有足够的空间。((4096-4096/256) ^ 2) = 0xfe0100(0-4096/256)^2 = 0x100((n-n/256) ^ 2) + 255*(0-n/256)^2 < 2^32

pmaddwd是一个有符号的乘法，所以我们需要它适合最坏的情况。这确实允许.(size - size/256) < 32768int16_tsize == 32K

SSE2 没有高效的 32 位整数乘法，因此实际上最好转换为浮点数，然后将大小> 32K 加倍。，，是一个加宽的 32x32 => 64 位乘法，它只从输入中获取偶数元素，即 64 位块的下半部分。实际上，这可能对更广泛的计数很有用，与 or （）结合使用，以喂养、.pmuldq_mm_mul_epi32pshufdpsrldq_mm_bsrli_si128(v, 4)paddq_mm_add_epi64

即使你有 SSE4.1 （），它在较新的 Intel CPU 上以 2 uops 的形式运行，而 FP mul 的运行速度为 1 uop，而 FP mul 每 32 位块只有 24 位尾数。它在 Zen 3 和 4 上，在 Sandy/Ivy bridge 上，以及更早的英特尔上都很快。如果您有适用于较新 CPU 的其他版本，则可以考虑仅在没有 AVX2+FMA 的 CPU 上运行的版本中使用 SSE4.1。pmulld_mm_mullo_epi32pmulld

将求和为整数可避免精度有限的问题。float

对于非微小的输入，直方图将是主要成本。如果没有用于分散+聚集的 AVX-512（AVX2 只有聚集），则与 SIMD 没有任何用处。您需要整数寄存器中的每个元素本身，以便在寻址模式下使用。该循环应该只使用标量运算;你正在通过手动解压来搬起石头砸自己的脚，用和 .（除非编译器可能优化您的存储/重新加载并使用 8x ，这不是最有效的方法，但在某些 CPU 上可以通过降低 AGU 端口压力来提供帮助。uint16_t_mm_unpacklo_epi8_mm_unpackhi_epi8pextrw

如果您仍然需要 ALU 指令将它们切成单独的元素以用于寻址模式，那么执行较少的较宽负载并不一定是好的。L1d 缓存速度很快，非常擅长处理来自同一行的多个小负载。

将 8 位零扩展到 64 位是免费发生的，作为加载到整数寄存器的一部分，其成本与相同，现代 x86 CPU 可以从 L1d 缓存（和 1 或 2 个存储）中每个时钟执行 2 或 3 次加载（https://agner.org/optimize/ 和类似的东西 https://chipsandcheese.com/2022/11/05/amds-zen-4-part-1-frontend-and-execution-engine/），因此有足够的吞吐量来对每个字节执行整数加载，而不会从内存增量工作中窃取后端周期。像 or 这样的东西，每个是 2 个 uops。movzx eax, byte [rdi]mov eax, [rdi]pextrbpextrw

但是 AGU（地址生成单元）吞吐量可能是一个问题，尤其是在 Haswell / Skylake 上内存增量的索引寻址模式下，因为端口 7 上的简单 store-AGU 无法处理索引寻址模式。Sandy/Ivy Bridge 根本没有端口 7。（Ice Lake 解决了这个问题，将 2 个专用存储 AGU 与 2 个加载端口分开。为了解决这个瓶颈，最好的办法是标量代码，比如加载 4x 并将其拆分为 / / 等。这会将一些解包工作转移到 ALU 指令上，但会降低前端吞吐量。尝试让编译器从 C 中制作这样的 asm 是很不方便的，尤其是来自 AH 的部分可能不会发生;相反，您可能会获得更多的班次，这些班次占用了如此多的前端带宽，以至于您比负载端口上的瓶颈更糟糕。mov eax, [rdi]uint8_tmovzx edx, almovzx ecx, ahshr eax, 16movzx

我比较了使用 https://uica.uops.info/。编译器生成的直方图的内部循环由 4 展开（复制自 Simon Goater 的答案）如下所示：

.L3:
   movzx   ecx, BYTE PTR [rax]
   add     rax, 4
   add     WORD PTR [rsp-120+rcx*2], 1
   movzx   ecx, BYTE PTR [rax-3]
   add     WORD PTR [rsp+392+rcx*2], 1
   movzx   ecx, BYTE PTR [rax-2]
   add     WORD PTR [rsp-120+rcx*2], 1
   movzx   ecx, BYTE PTR [rax-1]
   add     WORD PTR [rsp+392+rcx*2], 1
   cmp     rdi, rax
   jne     .L3

而不是手动将其编辑为 ASM，后者执行一个 4 字节加载并使用 shift 和 movzx 解包：

.L3:
   mov     ecx, [rax]       ##### 4-byte load
   add     rax, 4
   movzx   edx, cl      # mov-elimination avoids a back-end uop
   add     WORD PTR [rsp-120+rdx*2], 1
   movzx   edx, ch      # has extra latency for reading CH on Skylake and later, but throughput is ok
   add     WORD PTR [rsp+392+rdx*2], 1
   shr     ecx, 16
   movzx   edx, cl
   add     WORD PTR [rsp-120+rdx*2], 1
   movzx   edx, ch
   #shr     ecx, 8    # If reading high-8 registers has any weird effects, minimize it.
   add     WORD PTR [rsp+392+rdx*2], 1
   cmp     rdi, rax
   jne     .L3

假设没有缓存未命中或因存储转发延迟而停滞，则 https://uica.uops.info/ 的分析如下所示：

	多个 MOVZX 负载	使用 movzx reg、reg 和 shifts 进行 ALU 开箱
常春藤桥	7.06 次循环 / iter	7.03 次循环 / iter
天湖	5.97 c / i（端口 6 各 2,3 个 UOPS）	5.25 C/ITER（几乎是前端瓶颈）
火箭湖	4.00 C / I（端口 4 各 2,3 个 UOPS）	4.50°C/ITER

如果您正在调整非常旧的 CPU，例如 Nehalem 和更早的 CPU（1/时钟负载），那么从中可以获得更多收益。优化 SIMD 直方图计算报告称，与使用 SSE4.1 相比，使用 C 环路的速度提高了 1.5 倍。他们没有说他们在 2015 年使用什么 CPU，但这在 Ross 的完全展开版本中是合理的。是 2 个 UOPS，所以标量仍然更好。（对于 Intel CPU，理想的 asm 应该使用，而不是 inc 作为内存操作数，这样它就可以为前端微融合成更少的 uops。编译器应该已经知道这一点，或者至少 GCC 知道，clang 不知道;我之所以提到它，只是因为链接的问答中的代码。pextrbpextrbadd dword [mem], 1

手动矢量化版本（未经测试）

我利用 GNU C 写了几次，而不是在它变得杂乱无章的时候。如果希望可移植到 MSVC，请更改它。vec + vec_mm_add_ps

#include <immintrin.h>
#include <stdalign.h>

double getchisquared_manual_sse2(int size, uint8_t *data)
{
    // assert(size > 0);

    // TODO: if size >= 65536, use uint32_t counts instead, unless we know they're well enough distributed
    // also, unrolling over 2 arrays means the max count in any one of them is half the size, but then summing needs to widen first.
    _Alignas(16) uint16_t observed[2][256] = {{0}};

    // Calculate frequency of each byte value
    //const int iterations = size & -4;   // size is apparently always a power of 2?
    const int iterations = size;
    for (int i = 0; i < iterations;) {
        observed[0][data[i++]]++;    // unroll over two arrays to better hide store-forwarding latency
        observed[1][data[i++]]++;    // when there's a region with a lot of the same data[i]
        observed[0][data[i++]]++;
        observed[1][data[i++]]++;
    }
    // check (size & (size-1)) == 0  if power of 2 size isn't guaranteed,
    // Which would guarantee size%256 == 0 for size>=256
#if 0
    // not needed for power-of-2 or multiple-of-256 sizes
    for (int i = iterations; i < size; i++) {
        observed[0][data[i]]++;
    }
#endif

    // Calculate the chi-square statistic        
    if (size >= 256 && size <= 32768) {    // 16-bit integer sum special case
        // size>=256 (or non-zero and size%256==0) gives a whole integer expected value
        //   one range-check is slightly more efficient than size%256==0 && size <= 32768
        //   also this lets the compiler know it's signed-positive, allowing /256 to just be a shift.
        // sum of squares fit in a uint32_t result even in the worst case of all counts in one bucket:
        //  ((n-n/256) ^ 2) + 255*(0-n/256)^2 fits in u32 even up to n=64K + 225
        //  and max diff = size - size/256 fits in *signed* int16_t size <= 32768
        // size = 32768+256 would give a max diff of 0x807f, so 32768 is the top

        int expected = size / 256;       // a whole number that can be an integer
        __m128i vchisquare = _mm_setzero_si128();
        //double expected = (double)size / 256.0f; // Constant expected value
        for (int i = 0; i < 256; i += 8) {
            __m128i obs0 = _mm_load_si128((const __m128i*)&observed[0][i]);
            __m128i obs1 = _mm_load_si128((const __m128i*)&observed[1][i]);
            __m128i diff = _mm_add_epi16(obs0, obs1);
            diff = _mm_sub_epi16(diff, _mm_set1_epi16(expected));  // with AVX, prefer -expected + obs0 + obs1 to encourage a memory-source vpaddw instead of a separate vmovdqa.  Although it will un-laminate on Intel unless the compiler uses an indexed addressing mode
            vchisquare = _mm_add_epi32(vchisquare, _mm_madd_epi16(diff,diff));
        }
        __m128i high = _mm_shuffle_epi32(vchisquare, _MM_SHUFFLE(1,0, 3,2));  // high half within vector
        vchisquare = _mm_add_epi32(vchisquare, high);
                high = _mm_shuffle_epi32(vchisquare, _MM_SHUFFLE(2,3, 0,1));  // high half within pairs
        vchisquare = _mm_add_epi32(vchisquare, high);
        double chiSquare = _mm_cvtsd_f64( _mm_cvtepi32_pd(vchisquare) );  // low scalar of packed i32 to double conversion
        // signed conversion is safe with size <= 32768
        return chiSquare / (double)expected;        
    } else {
        // 16-bit integer across count arrays, convert to i32 then f32
        // sum 8 vectors of f32 (from 4 vecs of u16) down to one vector; convert to f64 and accumulate
        __m128d vchisquare = _mm_setzero_pd();
        const __m128 vexpected = _mm_set1_ps(size / 256.0f);
        for (int i = 0; i < 256; ) {
            //const int unroll = 4;  // C doesn't treat this as fully constant, unlike C++
            #define unroll 1
            __m128 diffs[unroll+1] = {};
            for (int j = 0 ; j<unroll; j++){
                __m128i obs0 = _mm_load_si128((const __m128i*)&observed[0][i + 8*j]);
                __m128i obs1 = _mm_load_si128((const __m128i*)&observed[1][i + 8*j]);
                __m128i obs = _mm_add_epi16(obs0, obs1);
                // if expected is still an integer, but we can't use pmaddwd because size is too large, like 64K
                // then diff = _mm_sub_epi16(obs, _mm_set1_epi16(expected));  before unpacking.
                //  Or epi32 for even larger size that requires 32-bit count buckets.
                __m128 obs_lo = _mm_cvtepi32_ps(_mm_unpacklo_epi16(obs, _mm_setzero_si128()));
                __m128 obs_hi = _mm_cvtepi32_ps(_mm_unpackhi_epi16(obs, _mm_setzero_si128()));
                __m128 diff_lo = obs_lo - vexpected;
                __m128 diff_hi = obs_hi - vexpected;
                diffs[j] = _mm_add_ps(_mm_mul_ps(diff_lo,diff_lo), _mm_mul_ps(diff_hi,diff_hi));
            }
            i += 8*unroll;
            //diffs[0] += diffs[2];
            //diffs[1] += diffs[3];
            //diffs[0] += diffs[1];
            //diffs[1] = _mm_movehl_ps(diffs[1], diffs[0]);  // extract high 2 elements, avoiding an extra movaps
            diffs[1] = _mm_movehl_ps(diffs[0], diffs[0]);
            diffs[0] += diffs[1];
            vchisquare = _mm_add_pd(vchisquare, _mm_cvtps_pd(diffs[0]) );  // double accumulator
        }
        double chiSquare = _mm_cvtsd_f64(vchisquare) + _mm_cvtsd_f64(_mm_unpackhi_pd(vchisquare,vchisquare)); // ideally movhlps into another vector
        return chiSquare * (256.0/size);  // compilers optimize x / (size/256.0) to this anyway.
    }
}

在 Godbolt 上使用 GCC 和 clang 以及尝试自动矢量化为相同代码的版本来查看它。
函数的纯整数求和部分使用 GCC 根据需要进行编译：

.L16:
  movdqa  xmm0, XMMWORD PTR [rdi+512]   # second array of counts
  paddw   xmm0, XMMWORD PTR [rdi]       # first array of counts
  add     rdi, 16
  psubw   xmm0, xmm2       # subtract expected
  pmaddwd xmm0, xmm0       # dot-product with itself
  paddd   xmm0, xmm1       # accumulate 32-bit pairs
  movdqa  xmm1, xmm0       # stupid compiler, this is useless
  cmp     rax, rdi
  jne     .L16
# then hsum xmm1 and convert that scalar to double

uiCA 预测 Ivy Bridge、Skylake 和 Rocket Lake（与 Ice Lake 相同）都将以每 2 个周期大约 1 次迭代（8 个元素）的速度运行，令人惊讶的是，在 RKL 上速度要慢一些。因此，我们应该期望只有 64 个循环来处理所有 256 个桶。在 4GHz CPU 上，这是 16 纳秒。

更大的 FP 版本仍然只需要处理 256 个计数桶，所以我们不应该太疯狂地展开它。从 8 的向量开始，解包到 2 个向量，这提供了大量的指令级并行性，以避免的 4 周期延迟出现瓶颈。（我本来打算在展开方面更加激进，这会将循环开销摊销更多一些，但对每个元素的周期没有太大影响。我们只需要这个版本来处理直方图主导的第二个循环的大尺寸。除非我们经常有非常小的尺寸，比如 .）sizeuint16_tfloataddpd128

展开因子	一（高/低对，如上）	by 2 （4 浮动 vecs）	通过 4 （8 浮动车辆）
常春藤桥	7.16 个周期/ITER 预测值	12.8 摄氏度/一分	25 个 C/I
天湖	5.14 c/i	9.0 C/I的	17.0 c/i（英文）
火箭湖	5.19 c/i	9.04 C/I	17.0 c/i（英文）

即使展开系数最低，我们仍然比每个周期的 1 个元素做得更好，更多的展开只会稍微改善循环/元素。

我还没有真正看过直接从 int 到 double 的版本，避免浮点数。这将节省指令，但至少需要一个和每个，如果我们展开更多，则更多。cvtps2pdmulpdaddpd

我尝试了一个可以自动矢量化的纯 C 版本。我能够让 GCC 使用对整数版本进行矢量化，但 clang 遇到了一些麻烦。（使用 AVX2 时，它会在 .如果没有 AVX，它将一半的元素归零，浪费一半的吞吐量。对于大尺寸，浮点循环根本无法很好地编译;在 Godbolt 链接中查看我的尝试。pmaddwdvpmaddwd xmm

整数部分如下所示：

    if (size >= 256 && size <= 32768) {
        int16_t expected = size / 256;
        int chiSquare = 0;      // with size <= 32768, the worst case won't overflow a signed 32-bit int, so we can let compilers do packed 32-bit int to float conversion instead of copying to scalar integer and back for scalar int64 to float (without AVX-512 for unsigned conversion)
        //double expected = (double)size / 256.0f; // Constant expected value
        for (int i = 0; i < 256; ) {
            // 16-bit types allow the compiler to use psubw and pmaddwd when summing to int32_t
        // GCC does quite well, clang fails to use pmaddwd unless SSE4.1 is enabled, making bad code that shuffles a ton.
        //  I think clang's internals have something wrong for pmaddwd, since with AVX2 enabled it's using a YMM sum after a pmaddwd XMM, like it thinks PMADDWD widens to a wider vector?
            int16_t diff = (int16_t)(observed[0][i] + observed[1][i]) - expected;
            i += 1;
            chiSquare += (diff * (int)diff);  // pmaddwd is a 32-bit sum of two 16-bit products
        }
        return chiSquare / (double)expected;

请注意，使用强制转换和赋值来缩小变量范围，以告诉编译器我希望将结果截断为 16 位，但乘法除外。（C 没有加宽乘法;你可以通过强制转换窄输入来表示它。对于小于的类型，这已经隐式发生，但在这里显式可能很好。int

GCC13 自动矢量化为与内部函数版本具有相同 asm 的循环，但没有浪费的 .（是的，即使有，其中“成本模型”阈值“非常便宜”;早期的 GCC 根本没有启用自动矢量化。-O2pmaddwdmovdqa-O2-O2

使用（AVX2+FMA），我们得到 256 位 .-march=x86-64-v3vpmaddwd

我看到大约 70-90 毫秒的性能改进

在什么 CPU 型号上，使用什么编译器？在总时间中，即它是否快了 2 倍？速度快 1.2 倍吗？

对于 size=4K 输入，整个计算应该只需要几到几微秒。（或者您是否在同一数组上重复循环以获得更长的时间间隔？

上一个：如何有效地链接 avx2 内部函数以执行算术运算链？

下一个：如何将 8 个打包的 32 位整数（在一个 __m256i）的 +-1 个符号打包成 64 位整数的字节？

关于使用 SSE2 内部函数进一步优化该卡方函数的建议

Suggestions on further optimising this chi-square function using SSE2 intrinsics

评论

评论

手动矢量化版本（未经测试）