什么决定了字符串在内存中的编码方式?

What determines how strings are encoded in memory?

提问人:gowerc 提问时间:10/27/2022 最后编辑:AmigoJackgowerc 更新时间:11/3/2022 访问量:365

问:

假设我们有一个编码的文件,我们使用文本编辑器将该文件读入内存。我的问题是:Latin-1

  • 这些字符串将如何在内存中表示?、、 还是别的什么?Latin-1UTF-8UTF-16
  • 是什么决定了这些字符串在内存中的表示方式?是应用程序、编写应用程序的编程语言、操作系统还是硬件?

作为后续问题:

  • 然后,应用程序如何将文件保存为使用不同字符集的编码方案?F.e. 转换为对我来说似乎相当直观,因为我假设您只是解码为 Unicode 代码点,然后编码为目标编码。但是,从具有不同字符集的 to 呢?UTF-8UTF-16UTF-8Shift-JIS
Unicode UTF-8 字符编码 ISO-8859-1

评论

2赞 Mark Ransom 10/27/2022
例如,Windows 可以使用 UTF-16,因此如果应用程序也使用它,它会更容易。
2赞 Mark Ransom 10/27/2022
从一种编码转换为另一种编码是一个足够复杂的过程,您需要找到一个库或操作系统调用来为您完成此操作。
0赞 skomisa 11/1/2022
不确定您的确切用例是什么,但此 GitHub 评论指出“......从 Unicode → JISX 0201/0208 → Unicode 转换不是无损转换“(您可能已经知道)。但该评论也描述了可能的解决方法。Shift-JIS

答:

1赞 AmigoJack 11/3/2022 #1

操作系统

程序设计语言

取决于它们的年龄或它们的编译器:虽然语言本身不一定绑定到操作系统,但生成二进制文件的编译器可能会根据操作系统以不同的方式处理事物。

Application/program

Depends on the platform/OS. While the in-memory consumption of text is strongly influenced by the programming language compiler and the data types used there, using libraries (which could have been produced by entirely other compilers and programming languages) can mix this.

Strictly speaking the binary file format also has its strict encodings: on Windows the PE (used in EXE, DLL, etc.) has resource Strings in 16 bit characters again. So while f.e. the Free Pascal Compiler can (as per language) make heavy use of UTF-8 it will still build an EXE file with UTF-16 metadata in it.

Programs that deal with text (such as editors) will most likely hold any encoding "as is" in memory for the sake of performance, surely with compromises such as temporarily duplicating parts into Strings of 32 bit per character, just to quickly search through it, let alone supporting Unicode normalization.

Conversion

The most common approach is to use a common denominator:

  • Either every input is decoded into 32 bit characters which are then encoded into the target. Costs the most memory, but makes it easy to deal with.
  • In the WinAPI you either convert to UTF-16 via MultiByteToWideChar(), or from UTF-16 via WideCharToMultiByte(). To go from UTF-8 to Shift-JIS you'd make a sidestep from UTF-8 to UTF-16, then from UTF-16 to Shift-JIS. Support for all the encodings shift as per version and localized installation, there's not really a guarantee for all of them.
  • External libraries specialized on encodings alone can do this, like iconv - these support many encodings unbound to the OS support.

评论

0赞 Remy Lebeau 11/11/2022
Your information about Delphi is wrong. Its was originally introduced to support Windows ActiveX/COM APIs, but by extension also Win32 UCS-2/UTF-16 APIs as well. But its type was never . It was (which was originally like the original Pascal , holding arbitrary bytes) until 2009 when became instead (not ), and gained support for codepages (and thus became a true UTF-8 type).WideStringStringWideStringAnsiStringStringStringUnicodeStringWideStringAnsiStringUTF8String
0赞 AmigoJack 11/11/2022
I disagree: neither did I wrote that and have something in common, nor did I wrote that emerged out of . I also didn't exclude from existence. Or do you think my formulation can be misunderstood?WideStringStringUTF8StringWideStringString
0赞 Remy Lebeau 11/11/2022
Your description of Delphi didn't mention AnsiString at all, even though it was Delphi's default string type for a long time. But you described the byte strings of Pascal and FreePascal, so why not for Delphi? Just saying, there's some inconsistency in your answer regarding Unicode in various string types. And I wasn't implying that you said UTF8String emerged from WideString. It actually emerged from AnsiString.
0赞 AmigoJack 11/11/2022
I didn't mention everything, since the main difference of Pascal vs. Delphi is Unicode availability (not counting Delphi 1, but who does so anyway), which is the topic here. And my link to the PDF would explain all the other different types. It's okay if you edit that part to what you think would be a better description - my intention was also to make it not too long/detailed - maybe you can even link to a better overview of how all the types evolved over time.String