x86-64 平台上的int_fast8_t大小与 int_fast16

问：

我已经了解到，在 x86-64 平台上，使用任何 64 位寄存器都需要前缀，而任何小于 64 位的地址都需要地址大小前缀。REX

在 x86-64 位上：

E3rel8 是jrcxz

67 E3rel8 是jecxz

67是地址大小覆盖前缀的操作码。

sizeof(int_fast8_t)是 8 位，而其他和（仅在 Linux 上）是 64 位。sizeof(int_fast16_t)sizeof(int_fast32_t)

为什么只有 8 位，而其他快速 typdef 是 64 位？int_fast8_t

这与对齐有关吗？

C 程序集 x86-64 64 位低级

“快速”太宽泛，无法正确定义。在什么情况下？较大的类型可能允许您将更多信息打包到一个值中（在生成位集或扩展精度整数时很有用）。它们还可能阻止矢量化。对所有类型使用一种类型可避免符号扩展;否则可能会浪费缓存空间，并阻止将多个函数参数打包到一个寄存器中。其他问题也谈到了这一点，例如 stackoverflow.com/questions/46446453/......最后，它归结为编译器制造商认为重要的内容

0赞 greg spears 10/1/2023

“int_fast32_t 是最快的有符号整数类型，至少有 32 位。” 归功于我在这里找到的@Maroun。我希望它对你有用。我认为这是一个很好的答案，周围的讨论也很好。

1赞 BoP 10/1/2023

@Ex-Kyuto - 你如何定义“快”？在 x86 上，8 位数据很小，并且使用简短的指令。其他一切都将使用更多的数据缓存和更多的指令缓存。那么最快的类型是什么呢？

答：

13赞 Peter Cordes 10/1/2023 #1

为什么只有 int_fast8_t 是 8 位，而其他快速 typdef 是 64 位？

因为当 x86-64 是新的时，glibc 做出了一个简单且可以说是错误的选择，而这些 C99 类型是新的，并且做出了错误的决定，不将其专门用于 ~~x86-64~~。

所有这些都被定义为跨所有平台。这是在 1999 年 5 月完成的，当时 AMD64 发布了纸质规范（1999 年 10 月），开发人员大概花了一些时间来摸索。（感谢@Homer512找到提交和历史记录。int_fast16/32/64_tlong

long 是 32 位和 64 位 GNU 系统中的完整（整数）寄存器宽度。这也是指针宽度。

对于大多数 64 位 RISC，全宽是相当自然的，尽管 IDK 关于乘法和除法速度。对于 x86-64 来说，这显然是很糟糕的，因为 64 位操作数大小需要额外的代码大小，但 MIPS 和例如是相同的代码大小和可能相同的性能。（在 x86-64 之前，RISC ABI 通常始终将窄类型符号扩展为 64 位，因为 MIPS 至少实际上需要非移位指令。有关更多历史记录，请参阅 MOVZX 缺少 32 位寄存器到 64 位寄存器。dadduaddu

Glibc 的选择使得这些类型对于局部变量来说基本上是可以的，至少如果你不进行乘法或除法或任何其他可能需要更多工作和更多位的操作（尤其是在没有硬件支持的情况下）。但是，在内存中的存储空间很重要的地方并不好。__builtin_popcountpopcnt

如果您希望“只有在避免任何性能坑洼的情况下才选择大于指定尺寸”的类型，那么这并不是 glibc 为您提供的。

我似乎记得 MUSL 在 x86-64 上做出了更好的选择，比如除了 32 位之外，可能每个大小都是最小大小，避免了操作数大小前缀和部分寄存器的东西。fastfast16

Fast 提出了一个问题“Fast for what？”，每个用例的答案都不一样。例如，在可以使用 SIMD 自动矢量化的东西中，尽可能窄的整数通常是最好的，每个 16 字节向量指令完成的工作量是原来的两倍。在这种情况下，可以对齐 16 位整数。或者只是为了数组中的缓存占用空间。但不要指望类型会考虑“不要太慢”与节省数组大小之间的权衡。fastxx_t

通常，在大多数 ISA 上，窄加载/存储指令都很好，因此，如果缓存占用空间是一个相关考虑因素，则应具有 or 局部变量和窄数组元素。但 glibc 的选择往往很糟糕，即使对于当地的 vars 也是如此。intint_fastxx_t

也许 glibc 人只计算指令，而不是代码大小（REX 前缀）或乘除的成本（64 位肯定比 32 位或更窄，尤其是在那些早期的 AMD64 CPU 上;在 Intel 上，64 位的整数除法仍然要慢得多，直到 Ice Lake https://uops.info/ 和 https://agner.org/optimize/）。

并且不直接查看由于 .（尽管 x86-64 System V ABI 中没有设置类型的大小，因此最好不要在 ABI 边界使用它们，就像库 API 中涉及的结构一样。alignof(T) == 8fast

我真的不知道他们为什么犯了这么严重的错误，但它使类型对除了局部变量（不是大多数结构或数组）之外的任何东西都毫无用处，因为 x86-64 GNU/Linux 是大多数可移植代码的重要平台，你不希望你的代码在那里很糟糕。int_fastxx_t

有点像 MinGW 的脑死亡决定，即返回低质量的随机数（而不是在他们实现可用的东西之前失败），就像将放射性废物倾倒在它上面一样，只要可移植代码能够将语言功能用于预期目的。std::random_device

使用 64 位整数的少数优点之一是可以避免在 ABI 边界（函数参数和返回值）处处理高部分的垃圾。但通常这并不重要，除非您需要将其扩展到指针宽度作为寻址模式的一部分。（在 x86-64 中，寻址模式下的所有寄存器必须具有相同的宽度，例如 .AArch64 具有类似将 32 位寄存器作为 64 位寄存器索引的符号扩展的模式。但 AArch64 的机器码格式是从头开始设计的，后来才看到其他 64 位 ISA 的实际应用。[rdi + rdx*4][x0, w1 sxt]

例如可以避免在返回类型填满寄存器时将返回值扩展为零的指令。否则，它需要符号或零扩展到指针宽度，然后才能在寻址模式下使用或（32 位到 64 位）或或（8 位或 16 位到 64 位）。arr[ foo(i) ]movmovsxdmovzxmovsx

或者，使用 x86-64 System V 在最多 2 个寄存器中按值传递和返回结构的方式，64 位整数不需要任何解压缩，因为它们本身已经在寄存器中。例如将两个 s 打包到返回值中的 RAX 中，如果实际使用结果，则需要在被调用方中打包和调用方解包，而不仅仅是将对象表示形式存储到内存中的结构中。（例如将下半部分 / .或者只是使用下半部分，然后在移位时丢弃它;您无需将其零扩展到 64 位即可将其用作 32 位整数。struct ( int32_t a,b; }intmov ecx, eaxshr rax, 32add ebx, eax

在函数中，编译器在写入 32 位寄存器后会知道值已经零扩展到 64 位。并且从内存加载，甚至将符号扩展为 64 位也是免费的（而不是）。（或者在较旧的 CPU 上几乎免费，其中内存源符号扩展仍然需要 ALU uop，而不是作为负载 uop 的一部分完成。movsxd rax, [rdi]mov eax, [rdi]

Because signed integer overflow is UB, compilers are able to widen () to 64-bit in loops like , or convert it to a 64-bit pointer increment. (I wonder if GCC maybe couldn't do this back in the early 2000s when these software design decisions were being made? In that case, yes, wasted instructions to keep re-extending a loop counter to 64-bit would be an interesting consideration.)intint32_tfor (int i = 0 ; i < n ; i++ ) arr[i] += 1;movsxd

But to be fair, you can still have sign-extension instructions from using signed 32-bit integer types in computations which might produce negative results if you then use those to index arrays. So 64-bit avoids those instructions, at the cost of being worse in other cases. Maybe I'm discounting this because I know to avoid it, e.g. using when appropriate because I know it zero-extends for free on x86-64 and AArch64.int_fast32_tmovsxdunsigned

For actual computation, 32-bit operand-size is generally at least as fast as anything else including for imul/div and popcnt, and avoids partial-register penalties or extra instructions you get with 8-bit or 16-bit.movzx

The advantages of using 32bit registers/instructions in x86-64
Why is default operand size 32 bits in 64 mode? - 32-bit needs no REX or operand-size prefix.

But 8-bit is not bad, and if your numbers are that small, it's even worse to balloon them to 32 or 64-bit; there's probably more of an expectation from programmers that will be small unless it's a lot more expensive to make it larger. It isn't on x86-64; Are there any modern CPUs where a cached byte store is actually slower than a word store? - yes, most non-x86 apparently, but x86 does make bytes and 16-bit words fast for load/store as well as computation.int_fast8_t

Avoiding 16-bit is probably good, worth the cost of an extra 2 bytes in some cases. (and other imm16 instructions) have LCP decode stalls on Intel CPUs. Plus partial-register false dependencies (or on older CPUs, merging stalls).add ax, 12345

jrcxz vs. is a weird example because it uses the address-size prefix, rather operand-size. And because compilers never(?) use it. It's not as slow as the loop instruction, but it's surprisingly not single-uop even on Intel CPUs that can macro-fuse a into a single uop.jecxz67h66htest/jz

For reference, the relevant header has been in place since 1999 without platform-specific ifdefs unless I'm mistaken, see this commit. If I had to guess, I'd say the thinking was, everything will be RISC in the near 64bit future, so stick with word size, except bytes since every platform will have fast string and pixel handling. Except Alpha AXP but I vaguely remember Ulrich Drepper hating that architecture anyway; unless I'm misremembering

0赞 Homer512 10/2/2023

Err, sorry, it was apparently ARM9 that is carp (sic), not Alpha

0赞 Peter Cordes 10/2/2023

@Homer512: Sounds like it's not really the ARM architecture he disliked, but the questionable ABI design decision of aligning a small within a struct. Which apparently only existed in oABI (ARM's "old" ABI). Anyway, thanks for digging up the source of that vague memory to confirm it wasn't related to this after all. And for the sordid history of the types.char pad[3]fast

2赞 Peter Cordes 10/2/2023

That seems hilariously lazy and bad to just define as across all platforms. That was done before AMD64 was proposed, and long is a full register width in 32 and 64-bit GNU systems. For most RISCs, that width is fairly natural, although IDK about multiply and divide speeds. It's glaringly bad for x86-64 where 64-bit operand-size takes extra code-size, but MIPS and are basically eqvuialent.int_fast16/32/64_tlongdadduaddu

上一个：STM32 L4 低级 I2C 主机，第一次写入是 ACK，第二次写入是 ACK，但在代码中始终是 NACK

下一个：通过 HIDAPI C 库获得对 USB HID 设备的独家访问权？

x86-64 平台上的int_fast8_t大小与 int_fast16_t 大小

int_fast8_t size vs int_fast16_t size on x86-64 platform

评论

评论