有没有办法将非 ASCII 字符转换为整数，然后将该整数转换回来？-解网

问：

我在问是否有任何方法可以将 utf-8 字符转换为整数，然后将该整数转换回 utf-8 字符。我确信有一种方法可以使用内置函数来做到这一点，但我会使用任何外部函数来给我我正在寻找的结果。

使用 ascii 符号很容易做到这一点

code:

std::cout << (int) 'a' << "\n" << (char) 97;

output:

97
a

但是，它不适用于非 ASCII 符号

code:

std::cout << R"((int) 'Ü' = )" << (int) 'Ü' << "\n";
std::cout << "(char) 50076 = " << (char) 50076 << "\n";

output:

(int) '├£' = 50076
(char) 50076 = £

出于某种原因，“Ü”被解释为多字符（在编译时，我还收到了警告<警告：多字符字符常量 [-Wmultichar]> ）。它似乎只给了我多字符的最后一个符号的数字├£

我尝试使用与以前相同的方法获取 number of，但它们也被解释为多字符常量。├£

我认为由于函数的限制，程序可能会给出这样的结果，所以我尝试将输出放入std::coutfile.txt

code:

std::ofstream file ("file.txt");
file << 'Ü' << "\n" << "Ü" << (char) 50076;
file.close();

这对我没有帮助

output:

50076
Ãœœ

这个问题已经有一段时间了。我会接受任何建议。

C++ UTF-8

多字符常量始终以实现定义的方式进行处理。现在，如果您询问如何将 UTF-8 编码的字符串转换为整数或整数，那将有一个答案。另请注意，内置的 C++ 函数无法执行此操作。C++早于Unicode的广泛采用，C++标准委员会已经赶上了很长时间。

0赞 john 6/3/2023

此处仔细解释了转换过程。自己实现这一点并不是一项太大的任务。

0赞 Pepijn Kramer 6/3/2023

也尽量避免像“C”样式的强制转换，更喜欢使用(char)valuestd::static_cast

2赞 Sam Varshavchik 6/4/2023

你已经读过维基百科关于UTF-8的文章了吗？如果不是，为什么不呢？本文中的信息提供了执行所述任务所需的一切。

答：

1赞 RandomBits 6/4/2023 #1

我认为您想要的是在和之间转码。维基百科是一般解释的好资源。UTF-8UTF-32

有几个现有库支持此操作。在功能齐全的选项中，有 icu 和 boost。对于仅转码，仅标头 UTF8-CPP 库是一个轻量级选项。

如果这些不符合您的要求，以下代码演示了如何使用少量代码从范围内的代码点进行转码。由于代码不会将所有值映射到序列，因此需要添加错误检查。0x0000'00000x0010'FFFFuint32_tutf-8

示例代码

#include <iostream>

using std::cout, std::endl;

template<class Iterator>
void codepoint_to_utf8(uint32_t code, Iterator iter) {
    if (code <= 0x007F) {
        *iter++ = static_cast<uint8_t>(code bitand 0x7F);
    }
    else if (code <= 0x07FF) {
        *iter++ = 0xC0 bitor static_cast<uint8_t>((code >> 6) bitand 0x1F);
        *iter++ = 0x80 bitor static_cast<uint8_t>(code bitand 0x3F);
    }
    else if (code <= 0xFFFF) {
        *iter++ = 0xE0 bitor static_cast<uint8_t>((code >> 12) bitand 0x0F);
        *iter++ = 0x80 bitor static_cast<uint8_t>((code >> 6) bitand 0x3F);
        *iter++ = 0x80 bitor static_cast<uint8_t>(code bitand 0x3F);
    }
    else if (code <= 0x10'FFFF) {
        *iter++ = 0xF0 bitor static_cast<uint8_t>((code >> 18) bitand 0x07);
        *iter++ = 0x80 bitor static_cast<uint8_t>((code >> 12) bitand 0x3F);
        *iter++ = 0x80 bitor static_cast<uint8_t>((code >> 6) bitand 0x3F);
        *iter++ = 0x80 bitor static_cast<uint8_t>(code bitand 0x3F);
    }
}

template<class Iterator>
uint32_t utf8_sequence_to_codepoint(Iterator iter) {
    auto byte = static_cast<uint32_t>(*iter++);
    if (byte <= 0x7F) {
        return byte;
    }
    else if ((byte bitand 0xE0) == 0xC0) {
        uint32_t r = (byte bitand 0x1F) << 6;
        r |= static_cast<uint32_t>(*iter++) bitand 0x3F;
        return r;
    }
    else if ((byte bitand 0xF0) == 0xE0) {
        uint32_t r = (byte bitand 0x0F) << 12;
        r |= (static_cast<uint32_t>(*iter++) bitand 0x3F) << 6;
        r |= static_cast<uint32_t>(*iter++) bitand 0x3F;
        return r;
    }
    else if ((byte bitand 0xF8) == 0xF0) {
        uint32_t r = (byte bitand 0x07) << 18;
        r |= (static_cast<uint32_t>(*iter++) bitand 0x3F) << 12;
        r |= (static_cast<uint32_t>(*iter++) bitand 0x3F) << 6;
        r |= static_cast<uint32_t>(*iter++) bitand 0x3F;
        return r;
    }
    else
        return 0xFFFF'FFFF;
}

void examine(uint32_t code) {
    std::string str;
    codepoint_to_utf8(code, std::back_inserter(str));
    cout << std::hex << code << " = " << str << " = ";
    cout << std::hex << utf8_sequence_to_codepoint(str.begin()) << endl;
}

int main(int argc, const char *argv[]) {
    examine(0x0024);
    examine(0x00A3);
    examine(0x0418);
    examine(0x0939);
    examine(0x20AC);
    examine(0xD55C);
    examine(0x1'0348);
    return 0;
}

输出

24 = $ = 24
a3 = £ = a3
418 = И = 418
939 = ह = 939
20ac = € = 20ac
d55c = 한 = d55c
10348 = 𐍈 = 10348

上一个：为什么 MSVC 中的 /utf-8 标志不允许我的程序显示 Unicode 字符？

下一个：为什么在 C++ 中，多字节 UTF-8 字符串中的某些字符由负数表示？

有没有办法将非 ASCII 字符转换为整数，然后将该整数转换回来？

Is there any way to convert non-ascii character to integer and then convert that integer back?

评论

示例代码

输出