在 x86 程序集中取两个有符号整数的平均值的最快方法？-解网

问：

假设我们有两个寄存器长度为² 的有符号¹ 整数，比如和。我们想计算值，可以向上、向下舍入、朝向零或远离零，以更简单的方式（即我们不关心舍入方向）。ab(a + b) / 2

结果是另一个寄存器长度有符号整数（很明显，平均值必须在寄存器长度有符号整数的范围内）。

执行此计算的最快方法是什么？

您可以选择两个整数最初将位于哪个寄存器中，以及平均值最终位于哪个寄存器中。

脚注 1：对于无符号整数，我们可以在两条指令中完成。这可能是最快的方法，尽管在英特尔 CPU 上，通过进位旋转超过 1 uop。但是当计数只有 1 时，只有一对。关于无符号均值的问答的答案讨论了效率。

add rdi, rsi
rcr rdi, 1

这两个数字以和开头，平均值以结束。但是对于有符号的数字，将设置 CF，并将 a 旋转到符号位中。没有给出正确的答案。rdirsirdi-1 + 31+1

脚注 2：我指定了寄存器长度的有符号整数，这样我们就不能简单地用 or 指令对整数进行符号扩展。movsxdcdqe

我最接近的解决方案使用四个指令，其中一个在 Intel 上是 3 个 uops，在 AMD Zen 上是 1 个（https://uops.info/）：rcr

add rdi, rsi
setge al
sub al, 1          # CF = !(ge) = !(SF==OF)
rcr rdi, 1         # shift CF into the top of (a+b)>>1

我认为一个更短的解决方案可能在于以某种方式组合中间两个指令，即执行 .CF ← SF ≠ OF

我已经看到这个问题，但这不是特定于 x86 的，而且似乎没有一个答案能像我的解决方案一样好。

装配优化 x86 平均微优化

这里的“up”总是朝向 +Inf，而不是在 0 附近对称。但是，是的，对于，我们得到，所以我们得到正确的。对于，我们得到的也是 3。因此，添加 CF 的始终向上抵消了算术右移的朝向 -Inf 舍入。avg(-3,-3)-2 + -2 + CF(1)-3avg(3,3)1 + 1 + CF(1)

0赞 vengy 7/26/2022

@PeterCordes这样的东西有用吗？Lea EAX，[EDI+ESI] SHR EAX，1

0赞 Peter Cordes 7/26/2022

@vengy：对于已知不会溢出的无符号输入，是的。即已经从 31 位或更窄的零扩展。（除非您永远不会在 LEA 中使用 32 位 address-size，否则您将避免使用 address-size 前缀）。但这个问答是关于可能是负面的全范围有符号输入。所以即使还不够。lea eax, [rdi+rsi]sar eax, 1

0赞 Bernard 8/3/2022

这似乎是迄今为止唯一更快的解决方案，但缺点是操作不是可交换的。

8赞 geometrian 7/26/2022 #2

As an outside answer, consider the pavg family of instructions.

I say "outside", since this is likely not acceptable to you. It assumes the value is unsigned 8-bit or 16-bit and in an SSE register, which of course also requires SSE. I mention it mainly since it is x86's anointed equivalent to averaging instructions in other ISAs.

In its defense, SSE is ubiquitous by now, even guaranteed on x86-64. Also, this instruction is 1 cycle, and actually can do 4 at once if you like. Best of all, unlike your original solutions, it also correctly handles overflow issues.

Note that it's possible to use an unsigned routine to implement a signed routine, though in general correctly accounting for overflow issues is a nightmare. Your current solution appears to already be broken in that regard, though.

Can you maybe range-shift signed to unsigned by adding 128 (i.e. flipping the high bit)? So both inputs with , , then back to the signed range? I'd expect that to work even near overflow boundaries, since / does. And you can use other unsigned-rounding bithacks that don't rely on carry-out, if you want a vectorized version of this trick for 32-bit operand-size. (But yeah, not generally worth transferring data from GP integer regs to XMM and back for a single scalar average, especially of signed numbers.)pxorset1_epi8(0x80)pavgbpxorpavgbpavgw

0赞 geometrian 7/26/2022

@PeterCordes I can't comment on algorithms for using this for signed; that sort of thing is obnoxiously difficult to get right and it's 2am right now. And yeah, the assumption is you're already in an XMM register. Actually, what inspired this answer was a recent image processing paper where this was used for a win; you get back a lot in parallelism doing this over a whole image, and images are often 8-bit unsigned so it's essentially a perfect use-case.

0赞 Bernard 7/26/2022

"Your current solution appears to already be broken in that regard, though." Do you mean the fact that can only assign to an 8-bit register? Or is it something else I'm not aware of?setge

0赞 Bernard 7/26/2022

I suppose this solution works, but if my integer was only 16 bits long then I could just perform addition without overflow in regular registers. Unless I'm using some ancient 16-bit x86 hardware.

1赞 phuclv 7/26/2022

@Bernard but if you have lots of 8/16-bit integers then this will be much faster because it can do multiple additions at the same time

上一个：以尽可能快的速度比较（a + sqrt（b））形式的两个值？

下一个：删除了所有自动生成的构造函数/运算符的类仍然可以从函数返回吗？

在 x86 程序集中取两个有符号整数的平均值的最快方法？

Fastest way to take the average of two signed integers in x86 assembly?

评论

评论

评论