在 c 中从 txt 文件中读取具有不同数据类型的行

Reading lines with different data types from txt file in c

提问人:Szabolcs 提问时间:11/5/2023 最后编辑:John GordonSzabolcs 更新时间:11/7/2023 访问量:75

问:

如果我想将数据存储在不同的变量中,如何从 txt 文件中读取多行?每行包含相同的数据类型顺序:int string string char string,用制表符分隔。

例如,txt 文件中的一行如下所示:

11 \t I would like an apple \t What is your favourite car brand? \t b \t elephant 

提前感谢您的帮助。

我尝试用fscanf(“%d\t%s\t%s\t%c\t%s\n”,..);但我无法读取字符串,因为 %s 在第一个空格处剪切了我的句子,它只读取第一行,我无法移动到下一行。

C 字符串 FileReader 文件读取

评论

0赞 Shawn 11/5/2023
一次读取一行。一次获取一个字段 with or whatever,并复制字符串,使用 ...fgets()strtok()strtol()
0赞 chqrlie 11/5/2023
@Shawn:如果任何字段可能为空,则不合适,与 .在 POSIX 系统上,您可以使用 .strtok()fscanf()strsep()
1赞 Allan Wind 11/5/2023
与其将您的任务交给我们,不如编写一些代码并在遇到困难的地方寻求帮助。我的第一个建议是使用现有的 tsv 库。如果这是家庭作业,那么我建议您将数据的读取与解析分开。它简化了错误处理。您可以使用,但它需要已知的固定大小的输入字符串; 可以为您分配线路。您可以解析数据,但在读取字符串时始终希望使用最大字段宽度以避免缓冲区溢出。这意味着固定大小的字符串字段。fgets()getline()sscanf()
0赞 Allan Wind 11/5/2023
如果无法假定字符串大小固定,则需要在输入字符串中搜索分隔符。您可以使用,但它会修改输入,因此请考虑改用 or。您可以使用将第一个字段从字符串转换为 或使用 .考虑使用 a 来保存这 5 个相关变量。命名它们将使您的代码更易于阅读。如果您想将它们全部保留在数组(或链表)中,它会变得更容易。strtok()strchr()strpbrk()sscanf()intstrtol()struct
0赞 chux - Reinstate Monica 11/5/2023
@Szabolcs,可以读取的最长行是多少?

答:

0赞 arfneto 11/7/2023 #1

这是一个文件,被写入以使用这种类型的数据,即表格数据的大型文件。而且真的很擅长。但是--- and also and ---跳过了空格。空格包括制表符空格换行符csvscanfscanfscanfsscanffscanf

因此,使用制表符作为分隔符,您的文件是有问题的。另一个问题是,许多编辑器将制表符转换为制表位列的空格,因此制表符可能根本不会被记录。制表符作为符号在文档编辑器(如 Microsoft 的 Word)上更为常见,可以为每个制表符、换行符、段落标记等打印一个符号。在 Unix/Linux/Mac 上,您可以使用 .viset listTAB^I

我将展示一个使用 2 个替代方案的示例:[1] use 和 [2] 解析代码中的行。sscanf

像往常一样,使用封装和指针更容易,并且将每条记录作为:某种对象,因此我在这里将使用这种方式。

以下是示例中用于每条记录的定义:

typedef struct
{
    int  f_int;
    char f_string_1[80];
    char f_string_2[80];
    char f_char;
    char f_string_3[80];
} Record;

在实践中,这更好:

typedef struct
{
    int   f_int;
    char* f_string_1;
    char* f_string_2;
    char  f_char;
    char* f_string_3;
} P_Record;

将一个转换为另一个是微不足道的,并且在代码中包含一些函数可以做到这一点。在记录中使用指针的主要原因是仅使用所需的 RAM 量,而不是每组字符串的 240 字节。

示例中使用的文件

11\tI would like an apple\tWhat is your favourite car brand?\tb\telephant
-11\tI would like an apple\tWhat is your favourite car brand?\tb\telephant
0\t \t \t \tStack Overflow

这几乎与原始示例中一样,但我删除了多余的空格。末尾的空白字段至少有一个空格,用于测试消耗。对于本地解析器来说,这没有区别。scanf

\t当然,在使用中会用分隔符代替。

代码中使用的函数

Record* so_free(Record*);
char    so_get_delim(const char*, const char);
Record* so_parse(const char*, const size_t, const char);
Record* so_parse_sc(const char*, const size_t, const char);
int     so_show(Record*, const char*);
int     so_show_parms(const char* f_name, const char delim);

// conversion helpers
P_Record* so_free_pack(P_Record*);
P_Record* so_pack(Record*);
int       so_show_pack(P_Record*, const char*);
Record*   so_unpack(P_Record*);

这些是显而易见的,但是:

  • so_parse通过解析该行获取一行并返回包含提取字段的 A。Record
  • so_parse_sc做同样的事情,但使用sscanf

main用于测试

int main(int argc, char** argv)
{
    const char* df_file    = "input.txt";
    const char  df_delim   = ',';
    char        line[1024] = {0};
    if (argc > 1)
        strcpy(line, argv[1]);
    else
        strcpy(line, df_file);

    char delim = df_delim;
    if (argc > 2) delim = so_get_delim(argv[2], df_delim);
    so_show_parms(line, delim);

    FILE* in = fopen(line, "r");
    if (in == NULL) return -1;
    char*  p      = NULL;
    size_t n_line = 0;
    char   r_msg[40];
    while (NULL != (p = fgets(line, sizeof(line) - 1, in)))
    {
        // fgets returns the '\n' where possible
        if (line[strlen(line) - 1] == '\n')
            line[strlen(line) - 1] = 0;
        n_line += 1;
        // local parser
        sprintf(r_msg, "\nRecord %llu\n", n_line);
        Record* one = so_parse(line, 1023, delim);
        if (one == NULL)
        {
            fprintf(stderr, "Ignored: %s", r_msg);
            continue;
        }
        so_show(one, r_msg);

        one = so_free(one);
        // using sscanf
        sprintf(
            r_msg, "\n[using sscanf]\nRecord %llu\n",
            n_line);
        one = so_parse_sc(line, 1023, delim);
        if (one == NULL)
        {
            fprintf(stderr, "Ignored: %s", r_msg);
            continue;
        }
        so_show(one, r_msg);
        one = so_free(one);
    };  // while
    fclose(in);
    return 0;
}

需要两个参数:文件名和分隔符。默认值为“input.txt”和分隔符的逗号。分隔符可以输入为“;”或分号,“\t”表示制表符,或十进制值输入,如,;\nnn\064@

正如预期的那样,当 TAB 是分隔符时,可以正常但无法解析某些行。so_parseso_parse_sc

使用逗号作为分隔符的输出


C: SO> p input.txt ","

 file is "input.txt", delimiter is ',' = 0x2C

Record 1
             int: 11
        string 1: "I would like an apple"
        string 2: "What is your favourite car brand?"
            char: 'b'
        string 3: "elephant"

[using sscanf]
Record 1
             int: 11
        string 1: "I would like an apple"
        string 2: "What is your favourite car brand?"
            char: 'b'
        string 3: "elephant"

Record 2
             int: -11
        string 1: "I would like an apple"
        string 2: "What is your favourite car brand?"
            char: 'b'
        string 3: "elephant"

[using sscanf]
Record 2
             int: -11
        string 1: "I would like an apple"
        string 2: "What is your favourite car brand?"
            char: 'b'
        string 3: "elephant"

Record 3
             int: 0
        string 1: " "
        string 2: " "
            char: ' '
        string 3: "Stack Overflow"

[using sscanf]
Record 3
             int: 0
        string 1: " "
        string 2: " "
            char: ' '
        string 3: "Stack Overflow"

C: SO>

使用 TAB 作为分隔符的输出


C: SO> ..\x64\debug\soc23-1104-fread.exe input-tab.txt "\t"

 file is "input-tab.txt", delimiter is 0x9

Record 1
             int: 11
        string 1: "I would like an apple"
        string 2: "What is your favourite car brand?"
            char: 'b'
        string 3: "elephant"

[using sscanf]
Record 1
             int: 11
        string 1: "I would like an apple"
        string 2: "What is your favourite car brand?"
            char: 'b'
        string 3: "elephant"

Record 2
             int: -11
        string 1: "I would like an apple"
        string 2: "What is your favourite car brand?"
            char: 'b'
        string 3: "elephant"

[using sscanf]
Record 2
             int: -11
        string 1: "I would like an apple"
        string 2: "What is your favourite car brand?"
            char: 'b'
        string 3: "elephant"

Record 3
             int: 0
        string 1: "."
        string 2: "."
            char: '.'
        string 3: "Stack Overflow"

[using sscanf]
Record 3
             int: 0
        string 1: "."
        string 2: "."
            char: '.'
        string 3: "Stack Overflow"

C: SO>

并且无法读取最后一条记录,因为分隔符也是空白字段的记录。scanf

那么为什么要使用scanf

此函数可以在单个调用中解析分隔符并转换字符串、 、 和 等类型。通常能够处理和转换任何有效文件。因为我们可以尝试任何像这样的在线验证器或阅读RFC4180。没有真正正式的定义,因为该格式比互联网和 W3C 早了一段时间。charfloatdoubleintscanfcsvvalidcsv

这里使用的面具是

    char mask[] =
        "%dx%79[^x]x%79[^x]x%cx%79[^x\n]";

其中 是正在使用的分隔符。它可以解析字符串、和值。在生产代码中:xcharint

  • 它可以更精确地构建,而不是用于字节字段。:)7980
  • 我们需要知道第一行是否有字段名称,以及是否需要它们---请参阅 RFC。这里第一行有正常数据。
  • 我们需要知道字段是否被转义,以及分隔符(如果是)。例如,对于被编码的字段来说,这是很常见---请参阅 RFC。在这里,字段没有转义,因此它们内部不能有分隔符。"
  • 我们有 5 个说明符,用于 5 个字段,因此可以返回从 -1 到 5 的内容。scanf
  • csv是具有 N 个字段的 M 条记录的巨型 MxN 表,因此所有 M 行都必须在此处具有 N=4 个分隔符
  • 表示最多 79 个字符之间不分隔符。x%79[^x]xxx

完整代码C


#define CRT_SECURE_NO_WARNINGS

#include <ctype.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>

typedef struct
{
    int  f_int;
    char f_string_1[80];
    char f_string_2[80];
    char f_char;
    char f_string_3[80];
} Record;

typedef struct
{
    int   f_int;
    char* f_string_1;
    char* f_string_2;
    char  f_char;
    char* f_string_3;
} P_Record;

Record* so_free(Record*);
char    so_get_delim(const char*, const char);
Record* so_parse(const char*, const size_t, const char);
Record* so_parse_sc(const char*, const size_t, const char);
int     so_show(Record*, const char*);
int     so_show_parms(const char* f_name, const char delim);

// conversion helpers
P_Record* so_free_pack(P_Record*);
P_Record* so_pack(Record*);
int       so_show_pack(P_Record*, const char*);
Record*   so_unpack(P_Record*);

/// <summary>
/// defaults are "input.txt" for file name and
/// ',' comma for the delimiter
/// </summary>
/// <param name="argc"></param>
/// <param name="argv">
/// argv[1] is the file name
/// argv[2] is the delimiter. can be \nnn decimal or \t or
/// the delimiter itself
/// </param>
/// <returns></returns>
int main(int argc, char** argv)
{
    const char* df_file    = "input-tab.txt";
    const char  df_delim   = '\t';
    char        line[1024] = {0};
    if (argc > 1)
        strcpy(line, argv[1]);
    else
        strcpy(line, df_file);

    char delim = df_delim;
    if (argc > 2) delim = so_get_delim(argv[2], df_delim);
    so_show_parms(line, delim);

    FILE* in = fopen(line, "r");
    if (in == NULL) return -1;
    char*  p      = NULL;
    size_t n_line = 0;
    char   r_msg[40];
    while (NULL != (p = fgets(line, sizeof(line) - 1, in)))
    {
        // fgets returns the '\n' where possible
        if (line[strlen(line) - 1] == '\n')
            line[strlen(line) - 1] = 0;
        n_line += 1;
        // local parser
        sprintf(r_msg, "\nRecord %llu\n", n_line);
        Record* one = so_parse(line, 1023, delim);
        if (one == NULL)
        {
            fprintf(stderr, "Ignored: %s", r_msg);
            continue;
        }
        so_show(one, r_msg);

        one = so_free(one);
        // using sscanf
        sprintf(
            r_msg, "\n[using sscanf]\nRecord %llu\n",
            n_line);
        one = so_parse_sc(line, 1023, delim);
        if (one == NULL)
        {
            fprintf(stderr, "Ignored: %s", r_msg);
            continue;
        }
        so_show(one, r_msg);
        one = so_free(one);
    };  // while
    fclose(in);
    return 0;
}

/// <summary>
/// free...
/// </summary>
/// <param name="one"></param>
/// <returns>returns NULL</returns>
Record* so_free(Record* one)
{
    if (one == NULL) return NULL;
    free(one);
    return NULL;
}

/// <summary>
/// free a packed record
/// </summary>
/// <param name="one"></param>
/// <returns>returns NULL</returns>
P_Record* so_free_pack(P_Record* one)
{
    if (one == NULL) return NULL;
    free(one->f_string_1);
    free(one->f_string_2);
    free(one->f_string_3);
    free(one);
    return NULL;
}

/// <summary>
/// get argument from arg.
/// </summary>
/// <param name="arg"></param>can be a char or \t for a tab
/// or \nnn for a decimal value <returns>delimiter</returns>
char so_get_delim(const char* arg, const char df_delim)
{  // argument should be \t or \nnn decimal
    char delim = df_delim;
    if (arg[0] == '\\')
    {
        if (arg[1] == 't')
            delim = '\t';
        else
        {
            if (strlen(arg) > 3)
                delim = (arg[1] - '0') * 100 +
                        (arg[2] - '0') * 10 +
                        (arg[3] - '0');
            else
                delim = df_delim;
        }
    }
    else
        delim = arg[0];
    return delim;
}

/// <summary>
/// returns a new packed record from a record
/// </summary>
/// <param name="src"></param>
/// <returns></returns>
P_Record* so_pack(Record* src)
{
    size_t len = 0;
    if (src == NULL) return NULL;
    P_Record* one = malloc(sizeof(P_Record*));
    if (one == NULL) return NULL;
    one->f_int      = src->f_int;  // field 1
    len             = strlen(src->f_string_1);
    one->f_string_1 = malloc(1 + len);
    if (one->f_string_1 == NULL)
    {
        free(one);
        return NULL;
    }
    strcpy(one->f_string_1, src->f_string_1);  // field 2
    // now for the 2nd string
    len             = strlen(src->f_string_2);
    one->f_string_2 = malloc(1 + len);
    if (one->f_string_2 == NULL)
    {
        free(one->f_string_1);
        free(one);
        return NULL;
    }
    strcpy(one->f_string_2, src->f_string_2);  // field 3
    // now for the single char
    one->f_char = src->f_char;  // field 4;
    // now for the last string
    len             = strlen(src->f_string_3);
    one->f_string_3 = malloc(1 + len);
    if (one->f_string_3 == NULL)
    {
        free(one->f_string_1);
        free(one->f_string_2);
        free(one);
        return NULL;
    }
    strcpy(one->f_string_3, src->f_string_3);  // field 5
    return one;
}

/// <summary>
/// parse a line to get a Record
/// </summary>
/// <param name="line"></param>
/// <param name="limit"></param>
/// <param name="delim"></param>
/// <returns>pointer to a new Record</returns>
Record* so_parse(
    const char* line, size_t limit, const char delim)
{
    if (line == NULL) return NULL;
    size_t len = strlen(line);
    if (len > limit) return NULL;
    const size_t n_tabs  = 4;  // 5 fields
    size_t       tabs[5] = {0};
    const char*  p       = line;
    // check line format
    for (size_t i = 0; i < len; i += 1)
    {
        if (*p == delim)
        {
            tabs[0] += 1;
            tabs[tabs[0]] = i;
        }
        p++;
    }
    if (tabs[0] != 4) return NULL;
    // line has 5 fields:
    //   create record
    //   extract fields
    Record* nr = malloc(sizeof(Record));
    if (nr == NULL) return NULL;
    // first field is int
    nr->f_int    = atoi(line);
    char*  begin = NULL;
    char*  end   = NULL;
    size_t fl    = 0;

    // now for the 1st string
    begin                      = (char*)line + tabs[1];
    end                        = (char*)line + tabs[2];
    fl                         = end - begin;
    *(nr->f_string_1 + fl - 1) = 0;  // terminate string
    memcpy(nr->f_string_1, begin + 1, fl - 1);
    // now for the 2nd string
    begin                      = (char*)line + tabs[2];
    end                        = (char*)line + tabs[3];
    fl                         = end - begin;
    *(nr->f_string_2 + fl - 1) = 0;  // terminate string
    memcpy(nr->f_string_2, begin + 1, fl - 1);
    // now for the single char
    // format: <tab3><field><tab4>
    nr->f_char =
        *(line + tabs[3] + 1);  // 1st char is blank
    // now for the last string
    begin                      = (char*)line + tabs[4];
    end                        = (char*)line + len;
    fl                         = end - begin;
    *(nr->f_string_3 + fl - 1) = 0;  // terminate string
    memcpy(nr->f_string_3, begin + 1, fl - 1);
    return nr;
}

/// <summary>
/// build a record from a line, using sscanf
/// </summary>
/// <param name="line"></param>
/// <param name="limit"></param>
/// <returns>pointer to Record</returns>
Record* so_parse_sc(
    const char* line, size_t limit, const char delim)
{
    if (line == NULL) return NULL;
    // should use the size of the strings and not fix 79
    //  (“%d\t%s\t%s\t%c\t%s\n”,..)
    char mask[] =
        "%dx%79[^x]x%79[^x]x%cx%79[^x\n]";
    // change mask for delimiter in use
    for (int i = 0; mask[i] != '\n'; i += 1)
        if (mask[i] == 'x') mask[i] = delim;
    size_t len = strlen(line);
    if (len > limit) return NULL;
    Record lcl;
    int    res = sscanf(
        line, mask, &lcl.f_int, lcl.f_string_1, lcl.f_string_2,
        &lcl.f_char, lcl.f_string_3);
    if (res != 5) return NULL;
    Record* nr = malloc(sizeof(Record));
    if (nr == NULL) return NULL;
    *nr = lcl;
    return nr;
}

/// <summary>
/// display Record contents
/// </summary>
/// <param name="one"></param>
/// <param name="msg"></param>
/// <returns>0 for success or -1</returns>
int so_show(Record* one, const char* msg)
{
    if (one == NULL) return -1;
    if (msg != NULL) printf("%s", msg);
    printf("\t     int: %d\n", one->f_int);
    printf("\tstring 1: \"%s\"\n", one->f_string_1);
    printf("\tstring 2: \"%s\"\n", one->f_string_2);
    printf("\t    char: '%c' \n", one->f_char);
    printf("\tstring 3: \"%s\"\n", one->f_string_3);
    return 0;
}

/// <summary>
/// display P_Record contents
/// </summary>
/// <param name="one"></param>
/// <param name="msg"></param>
/// <returns></returns>
int so_show_pack(P_Record* one, const char* msg)
{
    if (one == NULL) return -1;
    if (msg != NULL) printf("%s", msg);
    printf("\t     int: %d\n", one->f_int);
    printf("\tstring 1: \"%s\"\n", one->f_string_1);
    printf("\tstring 2: \"%s\"\n", one->f_string_2);
    printf("\t    char: '%c' \n", one->f_char);
    printf("\tstring 3: \"%s\"\n", one->f_string_3);
    return 0;
}

/// <summary>
/// show file name and delimiter in use
/// </summary>
/// <param name="f_name"></param>
/// <param name="delim"></param>
/// <returns>0</returns>
int so_show_parms(const char* f_name, const char delim)
{
    if (f_name == NULL) return -1;
    if (isprint(delim))
        printf(
            "\f file is \"%s\", delimiter is '%c' = "
            "0x%X\n",
            f_name, delim, delim);
    else
        printf(
            "\f file is \"%s\", delimiter is 0x%x\n",
            f_name, delim);
    return 0;
}

/// <summary>
/// convert from P_Record to Record
/// </summary>
/// <param name="src"></param>
/// <returns>pointer</returns>
Record* so_unpack(P_Record* src)
{
    size_t len = 0;
    if (src == NULL) return NULL;
    Record* one = malloc(sizeof(Record));
    if (one == NULL) return NULL;
    one->f_int = src->f_int;
    if (sizeof(one->f_string_1) - strlen(src->f_string_1) <
        1)
    {
        free(one);
        return NULL;
    }
    strcpy(one->f_string_1, src->f_string_1);  // field 2
    // now for the 2nd string
    if (sizeof(one->f_string_2) - strlen(src->f_string_2) <
        1)
    {
        free(one);
        return NULL;
    }
    strcpy(one->f_string_2, src->f_string_2);  // field 3
    // now for the single char
    one->f_char = src->f_char;  // field 4;
    // now for the last string
    if (sizeof(one->f_string_3) - strlen(src->f_string_3) <
        1)
    {
        free(one);
        return NULL;
    }
    strcpy(one->f_string_3, src->f_string_3);  // field 3
    return one;
}

// https://stackoverflow.com/questions/77423959/
// reading-lines-with-different-data-types-from-
// txt-file-in-c