什么决定了字符串在内存中的编码方式？-解网

问：

假设我们有一个编码的文件，我们使用文本编辑器将该文件读入内存。我的问题是：Latin-1

这些字符串将如何在内存中表示？、、还是别的什么？Latin-1UTF-8UTF-16
是什么决定了这些字符串在内存中的表示方式？是应用程序、编写应用程序的编程语言、操作系统还是硬件？

作为后续问题：

然后，应用程序如何将文件保存为使用不同字符集的编码方案？F.e. 转换为对我来说似乎相当直观，因为我假设您只是解码为 Unicode 代码点，然后编码为目标编码。但是，从具有不同字符集的 to 呢？UTF-8UTF-16UTF-8Shift-JIS

Unicode UTF-8 字符编码 ISO-8859-1

操作系统

窗户
- 1993 年：Windows 采用了 Unicode 1.0 和 NT 3.1 - 当时的 Unicode 就是现在的 UCS-2。该 Windows 版本还引入了 NTFS（New Technology File System），它还以 UCS-2 类方式（16 位代码点）存储每个文件名。
- 2000 年：随着 NT 5.0（又名 Windows 2000）的推出，从 UCS-2 到 UTF-16 发生了转变/改进 - 操作系统和编码都在这一年推出。
从那时起，一切都没有改变。在内部，Windows 使用 16 位代码点已经近 30 年了，并且由于 UTF-16 还支持最新的代码点，例如表情符号。它的 API 以相同的方式工作，字节编码的兼容性函数只是将输入转换为 UTF-16 的存根。另请参阅
Unix：大多数发行版默认使用 UTF-8，因为它向后兼容最多，同时又足够面向未来。

程序设计语言

取决于它们的年龄或它们的编译器：虽然语言本身不一定绑定到操作系统，但生成二进制文件的编译器可能会根据操作系统以不同的方式处理事物。

Pascal：在1970年，它只是一个字节数组，甚至不一定意味着文本。对于文本，可以很容易地处理 ASCII 或其他单字节编码之一。String
Delphi：按照 Windows 采用，每个字符处理 16 位，以完美地利用 WinAPI 及其 Unicode 支持。后来的添加还出现了 UTF8String，它再次使用字节，但不一定每个字符只有一个字节。但自 2009 年以来，UCS4String 等创作也可用，每个字符占用 4 个字节。WideString
Free Pascal：保留旧的，但始终默认为 UTF-8 编码。虽然这在使用 WinAPI 时总是需要转换，但它也更独立于平台。还存在其他几种 String （兼容性）类型，每种类型都有不同的内存使用情况。String
ECMAScript （JavaScript）：按照标准，引擎应该对文本使用 UTF-16。另请参阅 JavaScript 字符串 - UTF-16 与 UCS-2？
Java: engines must support a minimum of encodings, including UTF-16, thus internal String handling/memory usage may differ. See also What is the Java's internal represention for String? Modified UTF-8? UTF-16?

Application/program

Depends on the platform/OS. While the in-memory consumption of text is strongly influenced by the programming language compiler and the data types used there, using libraries (which could have been produced by entirely other compilers and programming languages) can mix this.

Strictly speaking the binary file format also has its strict encodings: on Windows the PE (used in EXE, DLL, etc.) has resource Strings in 16 bit characters again. So while f.e. the Free Pascal Compiler can (as per language) make heavy use of UTF-8 it will still build an EXE file with UTF-16 metadata in it.

Programs that deal with text (such as editors) will most likely hold any encoding "as is" in memory for the sake of performance, surely with compromises such as temporarily duplicating parts into Strings of 32 bit per character, just to quickly search through it, let alone supporting Unicode normalization.

Conversion

The most common approach is to use a common denominator:

Either every input is decoded into 32 bit characters which are then encoded into the target. Costs the most memory, but makes it easy to deal with.
In the WinAPI you either convert to UTF-16 via MultiByteToWideChar(), or from UTF-16 via WideCharToMultiByte(). To go from UTF-8 to Shift-JIS you'd make a sidestep from UTF-8 to UTF-16, then from UTF-16 to Shift-JIS. Support for all the encodings shift as per version and localized installation, there's not really a guarantee for all of them.
External libraries specialized on encodings alone can do this, like iconv - these support many encodings unbound to the OS support.

Your information about Delphi is wrong. Its was originally introduced to support Windows ActiveX/COM APIs, but by extension also Win32 UCS-2/UTF-16 APIs as well. But its type was never . It was (which was originally like the original Pascal , holding arbitrary bytes) until 2009 when became instead (not ), and gained support for codepages (and thus became a true UTF-8 type).WideStringStringWideStringAnsiStringStringStringUnicodeStringWideStringAnsiStringUTF8String

0赞 AmigoJack 11/11/2022

I disagree: neither did I wrote that and have something in common, nor did I wrote that emerged out of . I also didn't exclude from existence. Or do you think my formulation can be misunderstood?WideStringStringUTF8StringWideStringString

0赞 Remy Lebeau 11/11/2022

Your description of Delphi didn't mention AnsiString at all, even though it was Delphi's default string type for a long time. But you described the byte strings of Pascal and FreePascal, so why not for Delphi? Just saying, there's some inconsistency in your answer regarding Unicode in various string types. And I wasn't implying that you said UTF8String emerged from WideString. It actually emerged from AnsiString.

0赞 AmigoJack 11/11/2022

I didn't mention everything, since the main difference of Pascal vs. Delphi is Unicode availability (not counting Delphi 1, but who does so anyway), which is the topic here. And my link to the PDF would explain all the other different types. It's okay if you edit that part to what you think would be a better description - my intention was also to make it not too long/detailed - maybe you can even link to a better overview of how all the types evolved over time.String

上一个：你能用一个朴素的 try-except 块安全地读取 utf8 和 latin1 文件吗？

下一个：MySQL Database import cause issue special characters （ě ř č ů）

什么决定了字符串在内存中的编码方式？

What determines how strings are encoded in memory?

评论

操作系统

程序设计语言

Application/program

Conversion

评论