提问人:Lavam 提问时间:6/3/2023 最后编辑:EvgLavam 更新时间:6/4/2023 访问量:115
有没有办法将非 ASCII 字符转换为整数,然后将该整数转换回来?
Is there any way to convert non-ascii character to integer and then convert that integer back?
问:
我在问是否有任何方法可以将 utf-8 字符转换为整数,然后将该整数转换回 utf-8 字符。我确信有一种方法可以使用内置函数来做到这一点,但我会使用任何外部函数来给我我正在寻找的结果。
使用 ascii 符号很容易做到这一点
code:
std::cout << (int) 'a' << "\n" << (char) 97;
output:
97
a
但是,它不适用于非 ASCII 符号
code:
std::cout << R"((int) 'Ü' = )" << (int) 'Ü' << "\n";
std::cout << "(char) 50076 = " << (char) 50076 << "\n";
output:
(int) 'Ü' = 50076
(char) 50076 = £
出于某种原因,“Ü”被解释为多字符(在编译时,我还收到了警告<警告:多字符字符常量 [-Wmultichar]> )。它似乎只给了我多字符的最后一个符号的数字├£
我尝试使用与以前相同的方法获取 number of,但它们也被解释为多字符常量。├
£
我认为由于函数的限制,程序可能会给出这样的结果,所以我尝试将输出放入std::cout
file.txt
code:
std::ofstream file ("file.txt");
file << 'Ü' << "\n" << "Ü" << (char) 50076;
file.close();
这对我没有帮助
output:
50076
ܜ
这个问题已经有一段时间了。我会接受任何建议。
答:
1赞
RandomBits
6/4/2023
#1
我认为您想要的是在 和 之间转码。维基百科是一般解释的好资源。UTF-8
UTF-32
有几个现有库支持此操作。在功能齐全的选项中,有 icu 和 boost。对于仅转码,仅标头 UTF8-CPP 库是一个轻量级选项。
如果这些不符合您的要求,以下代码演示了如何使用少量代码从范围内的代码点进行转码。由于代码不会将所有值映射到序列,因此需要添加错误检查。0x0000'0000
0x0010'FFFF
uint32_t
utf-8
示例代码
#include <iostream>
using std::cout, std::endl;
template<class Iterator>
void codepoint_to_utf8(uint32_t code, Iterator iter) {
if (code <= 0x007F) {
*iter++ = static_cast<uint8_t>(code bitand 0x7F);
}
else if (code <= 0x07FF) {
*iter++ = 0xC0 bitor static_cast<uint8_t>((code >> 6) bitand 0x1F);
*iter++ = 0x80 bitor static_cast<uint8_t>(code bitand 0x3F);
}
else if (code <= 0xFFFF) {
*iter++ = 0xE0 bitor static_cast<uint8_t>((code >> 12) bitand 0x0F);
*iter++ = 0x80 bitor static_cast<uint8_t>((code >> 6) bitand 0x3F);
*iter++ = 0x80 bitor static_cast<uint8_t>(code bitand 0x3F);
}
else if (code <= 0x10'FFFF) {
*iter++ = 0xF0 bitor static_cast<uint8_t>((code >> 18) bitand 0x07);
*iter++ = 0x80 bitor static_cast<uint8_t>((code >> 12) bitand 0x3F);
*iter++ = 0x80 bitor static_cast<uint8_t>((code >> 6) bitand 0x3F);
*iter++ = 0x80 bitor static_cast<uint8_t>(code bitand 0x3F);
}
}
template<class Iterator>
uint32_t utf8_sequence_to_codepoint(Iterator iter) {
auto byte = static_cast<uint32_t>(*iter++);
if (byte <= 0x7F) {
return byte;
}
else if ((byte bitand 0xE0) == 0xC0) {
uint32_t r = (byte bitand 0x1F) << 6;
r |= static_cast<uint32_t>(*iter++) bitand 0x3F;
return r;
}
else if ((byte bitand 0xF0) == 0xE0) {
uint32_t r = (byte bitand 0x0F) << 12;
r |= (static_cast<uint32_t>(*iter++) bitand 0x3F) << 6;
r |= static_cast<uint32_t>(*iter++) bitand 0x3F;
return r;
}
else if ((byte bitand 0xF8) == 0xF0) {
uint32_t r = (byte bitand 0x07) << 18;
r |= (static_cast<uint32_t>(*iter++) bitand 0x3F) << 12;
r |= (static_cast<uint32_t>(*iter++) bitand 0x3F) << 6;
r |= static_cast<uint32_t>(*iter++) bitand 0x3F;
return r;
}
else
return 0xFFFF'FFFF;
}
void examine(uint32_t code) {
std::string str;
codepoint_to_utf8(code, std::back_inserter(str));
cout << std::hex << code << " = " << str << " = ";
cout << std::hex << utf8_sequence_to_codepoint(str.begin()) << endl;
}
int main(int argc, const char *argv[]) {
examine(0x0024);
examine(0x00A3);
examine(0x0418);
examine(0x0939);
examine(0x20AC);
examine(0xD55C);
examine(0x1'0348);
return 0;
}
输出
24 = $ = 24
a3 = £ = a3
418 = И = 418
939 = ह = 939
20ac = € = 20ac
d55c = 한 = d55c
10348 = 𐍈 = 10348
评论
(char)value
std::static_cast