UTF-8 编码显示字符的最大代码点数是多少？-解网

问：

我们试图将表情符号插入到我们的数据库中，但遇到了奇怪的行为。事实证明，这与 utf-8 编码有关。👍 可以正常工作，但🌶不会。这就是我们了解 utf-8 代码点的时候。👍 是一个代码点长，但🌶为 2：由 Hot Pepper （U+1F336）和 Variation Selector-16 （U+FE0F）组成。

得知这一点后，我们将数据库存储宽度增加到 2，这解决了的问题，🌶但是我们发现了一个新问题。键帽表情符号（⃣ 1️⃣2️⃣3️）由 3 个字符组成：一个（U+31）、变体选择器 16 （U+FE0F）和组合封闭键帽（U+20E3）。

“好吧，”我们说，“把它增加到4。然后是 🤦🏼 ♂️ 5 个代码点：面部手掌（U+1F926）、表情符号修饰符 Fitzpatrick Type-3 （U+1F3FC）、零宽度连接器（U+200D）、男性符号（U+2642）和变体选择器-16 （U+FE0F）。我们尝试了更多，发现英格兰的国旗🏴由7个代码点组成：

U+1F3F4：挥舞黑旗
U+E0067：标记拉丁文小写字母 G
U+E0062：标记拉丁文小写字母 B
U+E0065：标记拉丁文小写字母 E
U+E006E：标记拉丁文小写字母 N
U+E0067：标记拉丁文小写字母 G
U+E007F：取消标记

那么问题来了，显示的 unicode 字符最多可以使用多少个代码点？是否有任何大于 7 个代码点的表情符号（或其他 utf-8 字符）的例子？

此问题类似于 UTF-8 编码字符的最大字节数是多少？。这要求一个 UTF-8 代码点，最大字节数是多少？（剧透：4）.这个问题的标题与“Unicode是否具有定义的最大码位数？”类似，但该问题询问存在多少个不同的代码点。不能连续使用多少个字符来组成屏幕上显示的单个字符。

UTF-8 格式

我见过讨论过的最长序列，但不是推荐的一部分，是肤色修饰家庭序列的 11（例如，woman-medium-zwj-woman-dark-zwj-girl-light-zwj-girl-medium）。这个序列几乎肯定会压垮 Unicode（支持它会添加 4000 多个新字形），所以它不太可能出现在推荐中。

但是，仅仅因为不推荐它并不意味着它不合法。我可以将任意数量的人组合成一个家庭序列，将性别、头发颜色和肤色附加到每个人身上，这是一个合法的序列，将呈现为一个单一的“角色”。例如，这是上面讨论的带有肤色修饰符的家庭表情符号：👩🏽 👩🏿 👧🏻 👧🏽它呈现为四个人，但如果您的文本引擎工作正常，您会发现它选择为单个“字符”，因为它是。只是没有任何常见字体的特殊字形。

我还可以无限制地添加其他修饰符，您可以将它们添加到任意字符（而不仅仅是表情符号）。有关所有血腥的细节，请参阅 UTS #51：Unicode 表情符号。

然后你提出了“或其他 utf-8 字符”的进一步问题。这实际上是关于组合角色，并引入了：

ZA̡͊͠͝LGΌ ISͮ̂҉̯͈͕̹̘̱ TO͇̹̺ͅƝ̴ȳ̳ TH̘Ë͖́̉ ͠P̯͍̭O̚N̐Y̡ H̸̡̪̯ͨ͊̽̅̾̎Ȩ̬̩̾͛ͪ̈́̀́͘ ̶̧̨̱̹̭̯ͧ̾ͬC̷̙̲̝͖ͭ̏ͥͮ͟Oͮ͏̮̪̝͍M̲̖͊̒ͪͩͬ̚̚͜Ȇ̴̟̟͙̞ͩ͌͝S̨̥̫͎̭ͯ̿̔̀ͅ

Unicode puts no limit on the number of combining characters that can be attached to a single "starter" character.

However, Unicode does define a Stream-Safe Text Format, which puts a limit of 30 combining characters per "chunk." (It does actually allow additional combining characters by interposing a COMBINING GRAPHEME JOINER, but these are normalized separately.) A complete chunk will not contain more than 32 characters total, and will not require more than 128 bytes to encode in UTF-8.

As the spec notes, "the value of 30 is chosen to be significantly beyond what is required for any linguistic or technical usage." The largest I've heard of is 8 combining characters in the Tibetan HAKṢHMALAWARAYAṀ (ཧྐྵྨླྺྼྻྂ). That's one character. (That said, while I've seen this particular example thrown around a lot, I've never seen anyone describe how it is used by Tibetans, so I cannot confirm that this is actually a real character.)

Just to add: the answer may assume a canonical version. As Unicode standard tell us: some inputs (especially direct from keyboard) may have extra codepoints which are not useful (but very long sequence). In any case I assume you will transform any string to a valid canonical form, so the answer is correct (just with a caveat) (so extra warning about invalid UTF-8 sequences: invalid bytes, overlong sequences, surrogates codes, etc.)

0赞 jgawrych 10/5/2023

Amazing. Encoding is so much more complex than I could imagine. Thank you for the very detailed response. The zalgo reference is very nostalgic!

上一个：我在 TCPDF 中检查登录时遇到问题

下一个：PHP mb_detect_encoding 不再可靠地检测 UTF-8

UTF-8 编码显示字符的最大代码点数是多少？

What's the maximum number of code points for a UTF-8 encoded displayed character?

评论

评论