在 C 中捕获奇怪的 ASCII 字符-解网

问：

我正在编写一个程序，为文档、书籍或歌曲中的所有单词构建单词列表。我开始读的书是查尔斯·狄更斯的《双城记》。这是从古滕堡项目网站下载的。它被格式化为 UTF-8 文件。

我正在运行 Ubuntu 22.04。

我在文本中遇到了许多看似无害的字符，这些字符显然不是标准的 ASCII。

在下面的代码中，违规字符是“Wo-ho”周围的左引号和右引号以及 u're 中的单引号。

如果我逐个字符浏览这一行，我可以看到这些是非 ASCII 的，并产生奇怪的 ASCII 代码，左引号为 30、128、100，右引号为 30、128、99，单引号为 30、128、103。

如果有人能帮助我理解为什么会这样，将不胜感激。

#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <ctype.h>

int main(int argc, char *argv[])
{
    char            *lineIn     = "“Wo-ho!” said the coachman. “So, then! One more pull and you’re at the";
    const   char    delim       = ' ';
    int             strLen      = strlen(lineIn);
    int             i           = 0;

    printf("Start - %s\n", lineIn);
    printf("\n");

    for (i = 0; i < strLen + 1; i++)
    {   
         
        if (isalpha(lineIn[i])) {
            printf("Alpha - (%c) ", lineIn[i]);
        } else if (iscntrl(lineIn[i])) {
            printf("Cntrl - %c\n", lineIn[i]);
        } else if (isxdigit(lineIn[i])) {
            printf("Hex - %c\n", lineIn[i]);
        } else if (isascii(lineIn[i])) {
            printf("Asc - %c\n", lineIn[i]);
        } else if (lineIn[i] == delim) {
            printf("\n");
        } else { 
            printf("Unk - %d\n", lineIn[i]);
        }
    }   

    return 0;
}

上面的输出：

Start - “Wo-ho!” said the coachman. “So, then! One more pull and you’re at the

Unk - -30
Unk - -128
Unk - -100
Alpha - (W) Alpha - (o) Asc - -
Alpha - (h) Alpha - (o) Asc - !
Unk - -30
Unk - -128
Unk - -99
Asc -  
Alpha - (s) Alpha - (a) Alpha - (i) Alpha - (d) Asc -  
Alpha - (t) Alpha - (h) Alpha - (e) Asc -  
Alpha - (c) Alpha - (o) Alpha - (a) Alpha - (c) Alpha - (h) Alpha - (m) Alpha - (a) Alpha - (n) Asc - .
Asc -  
Unk - -30
Unk - -128
Unk - -100
Alpha - (S) Alpha - (o) Asc - ,
Asc -  
Alpha - (t) Alpha - (h) Alpha - (e) Alpha - (n) Asc - !
Asc -  
Alpha - (O) Alpha - (n) Alpha - (e) Asc -  
Alpha - (m) Alpha - (o) Alpha - (r) Alpha - (e) Asc -  
Alpha - (p) Alpha - (u) Alpha - (l) Alpha - (l) Asc -  
Alpha - (a) Alpha - (n) Alpha - (d) Asc -  
Alpha - (y) Alpha - (o) Alpha - (u) Unk - -30
Unk - -128
Unk - -103
Alpha - (r) Alpha - (e) Asc -  
Alpha - (a) Alpha - (t) Asc -  
Alpha - (t) Alpha - (h) Alpha - (e) Cntrl -

c UTF-8 ASCII

如前所述，您有一个 UTF-8 编码的文本文件。这意味着您可以拥有最长 4 个字节的字符，尽管 Unicode UTF-8 中的前 128 个字符是 ASCII 的一对一映射（只需要 1 个字节）。这意味着，如果文本是英文的，您可能能够看到所有字母，就好像它们是 ASCII 一样。

您可以阅读维基百科对 UTF-8 编码方式的描述。

从编码表中，您可以看到您读取的字节是否在以下范围之一

UTF-8        character
first byte   size

0xxxxxxx      1 byte
110xxxxx      2 bytes
1110xxxx      3 bytes
11110xxx      4 bytes

该字符将分别为 1、2、3 或 4 个字节。如果您决定将字符“翻译”为某种 ASCII 等效字符，或者您想将它们完全过滤掉（删除），这将派上用场。

上一个：当我使用 newlocale（）和 uselocale（）而不是 setlocale（）时，mbrtowc（）无法转换

下一个：UTF-8 编解码器无法解码位置 377 中的字节0x87：起始字节无效

在 C 中捕获奇怪的 ASCII 字符

Trapping strange ASCII characters in C

评论