在 x86 程序集中取两个有符号整数的平均值的最快方法?

Fastest way to take the average of two signed integers in x86 assembly?

提问人:Bernard 提问时间:7/26/2022 最后编辑:Peter CordesBernard 更新时间:7/27/2022 访问量:2977

问:

假设我们有两个寄存器长度为2 的有符号1 整数,比如 和 。我们想计算值,可以向上、向下舍入、朝向零或远离零,以更简单的方式(即我们不关心舍入方向)。ab(a + b) / 2

结果是另一个寄存器长度有符号整数(很明显,平均值必须在寄存器长度有符号整数的范围内)。

执行此计算的最快方法是什么?

您可以选择两个整数最初将位于哪个寄存器中,以及平均值最终位于哪个寄存器中。


脚注 1:对于无符号整数,我们可以在两条指令中完成。这可能是最快的方法,尽管在英特尔 CPU 上,通过进位旋转超过 1 uop。但是当计数只有 1 时,只有一对。关于无符号均值的问答的答案讨论了效率。

add rdi, rsi
rcr rdi, 1

这两个数字以 和 开头,平均值以 结束。但是对于有符号的数字,将设置 CF,并将 a 旋转到符号位中。没有给出正确的答案。rdirsirdi-1 + 31+1

脚注 2:我指定了寄存器长度的有符号整数,这样我们就不能简单地用 or 指令对整数进行符号扩展。movsxdcdqe


我最接近的解决方案使用四个指令,其中一个在 Intel 上是 3 个 uops,在 AMD Zen 上是 1 个 (https://uops.info/):rcr

add rdi, rsi
setge al
sub al, 1          # CF = !(ge) = !(SF==OF)
rcr rdi, 1         # shift CF into the top of (a+b)>>1

我认为一个更短的解决方案可能在于以某种方式组合中间两个指令,即执行 .CF ← SF ≠ OF

我已经看到这个问题,但这不是特定于 x86 的,而且似乎没有一个答案能像我的解决方案一样好。

装配 优化 x86 平均 微优化

评论

4赞 Bernard 7/26/2022
尝试从 = -1 和 = 3 开始。 将设置 CF,它将被指令旋转到符号位,从而产生一些负数。但正确答案是 1。rdirsiadd rdi, rsirdircr rdi, 1
1赞 Bernard 7/26/2022
@Brendan 不,请尝试从两个大于 2^30 的整数开始。将两个整数相加将设置符号位,因此您的指令将保持符号位设置,从而产生负整数。但正确答案是肯定的。sar
3赞 Brendan 7/26/2022
Hrm(使用 RAX 而不是 RDI):.cqo; add rax,rsi; adc rdx,0; shrd rax,rdx,1
1赞 fuz 7/26/2022
@Brendan 如果将 替换为 ,也可以使用任何一对寄存器。这是一个重命名,因此基本上是免费的。cqomov hi, reg; sar hi, 63mov
1赞 Nate Eldredge 7/26/2022
如果我的数学是正确的,除非两个操作数都是奇数,否则有效,在这种情况下,我们需要在结果中加 1。也许有某种方法可以使用它?sar rdi, 1 ; sar rsi, 1 ; add rdi, rsi

答:

33赞 Nate Eldredge 7/26/2022 #1

根据我们如何解释您的宽松四舍五入要求,以下情况可能是可以接受的:

sar rdi, 1
sar rsi, 1
adc rdi, rsi

试穿 godbolt

这有效地将两个输入除以 2,将结果相加,如果为奇数,则再加 1。(请记住,根据移出的最后一个位来设置进位标志。rsisar

由于四舍五入到负无穷大,因此该算法的结果是:sar

  • 如果 RDI、RSI 都是偶数或都是奇数,则完全正确

  • 向下舍入(朝向负无穷大),如果 RDI 为奇数且 RSI 为偶数

  • 如果 RDI 为偶数且 RSI 为奇数,则向上舍入(朝向正无穷大)

作为奖励,对于随机输入,平均舍入误差为零。

在典型的 CPU 上,它应该是 3 个 uops,延迟为 2 个周期,因为两者是独立的。sar

评论

3赞 Peter Cordes 7/26/2022
这里的“up”总是朝向 +Inf,而不是在 0 附近对称。但是,是的,对于 ,我们得到 ,所以我们得到正确的 。对于 ,我们得到的也是 3。因此,添加 CF 的始终向上抵消了算术右移的朝向 -Inf 舍入。avg(-3,-3)-2 + -2 + CF(1)-3avg(3,3)1 + 1 + CF(1)
0赞 vengy 7/26/2022
@PeterCordes这样的东西有用吗?Lea EAX,[EDI+ESI] SHR EAX,1
0赞 Peter Cordes 7/26/2022
@vengy:对于已知不会溢出的无符号输入,是的。即已经从 31 位或更窄的零扩展。(除非您永远不会在 LEA 中使用 32 位 address-size,否则您将避免使用 address-size 前缀)。但这个问答是关于可能是负面的全范围有符号输入。所以即使还不够。lea eax, [rdi+rsi]sar eax, 1
0赞 Bernard 8/3/2022
这似乎是迄今为止唯一更快的解决方案,但缺点是操作不是可交换的。
8赞 geometrian 7/26/2022 #2

As an outside answer, consider the pavg family of instructions.

I say "outside", since this is likely not acceptable to you. It assumes the value is unsigned 8-bit or 16-bit and in an SSE register, which of course also requires SSE. I mention it mainly since it is x86's anointed equivalent to averaging instructions in other ISAs.

In its defense, SSE is ubiquitous by now, even guaranteed on x86-64. Also, this instruction is 1 cycle, and actually can do 4 at once if you like. Best of all, unlike your original solutions, it also correctly handles overflow issues.

Note that it's possible to use an unsigned routine to implement a signed routine, though in general correctly accounting for overflow issues is a nightmare. Your current solution appears to already be broken in that regard, though.

评论

0赞 Peter Cordes 7/26/2022
Can you maybe range-shift signed to unsigned by adding 128 (i.e. flipping the high bit)? So both inputs with , , then back to the signed range? I'd expect that to work even near overflow boundaries, since / does. And you can use other unsigned-rounding bithacks that don't rely on carry-out, if you want a vectorized version of this trick for 32-bit operand-size. (But yeah, not generally worth transferring data from GP integer regs to XMM and back for a single scalar average, especially of signed numbers.)pxorset1_epi8(0x80)pavgbpxorpavgbpavgw
0赞 geometrian 7/26/2022
@PeterCordes I can't comment on algorithms for using this for signed; that sort of thing is obnoxiously difficult to get right and it's 2am right now. And yeah, the assumption is you're already in an XMM register. Actually, what inspired this answer was a recent image processing paper where this was used for a win; you get back a lot in parallelism doing this over a whole image, and images are often 8-bit unsigned so it's essentially a perfect use-case.
0赞 Bernard 7/26/2022
"Your current solution appears to already be broken in that regard, though." Do you mean the fact that can only assign to an 8-bit register? Or is it something else I'm not aware of?setge
0赞 Bernard 7/26/2022
I suppose this solution works, but if my integer was only 16 bits long then I could just perform addition without overflow in regular registers. Unless I'm using some ancient 16-bit x86 hardware.
1赞 phuclv 7/26/2022
@Bernard but if you have lots of 8/16-bit integers then this will be much faster because it can do multiple additions at the same time