提高读取 csv 文件 C++ 的速度-解网

问：

我创建此代码是为了读取和过滤我的 csv 文件。它的工作原理就像我希望它适用于小文件一样。但是我刚刚尝试了一个大小为 200k 行的文件，大约需要 4 分钟，这对于我的用例来说太长了。

经过一些测试并修复了一些非常愚蠢的事情，我把时间缩短到了 3 分钟。我发现大约一半的时间花在阅读文件上，一半的时间花在生成结果向量上。

有什么方法可以提高我的程序的速度吗？尤其是从csv部分读取？我现在真的没有想法。我将不胜感激。

编辑：过滤器按时间范围或时间范围过滤数据，并在特定列中过滤单词，并将数据输出到生成的字符串向量中。

我的 CSV 文件看起来像这样>

标头包括：

ID;Timestamp;ObjectID;UserID;Area;Description;Comment;Checksum

数据是：

523;19.05.2021 12:15;####;admin;global;Parameter changed to xxx; Comment;x3J2j4

std::ifstream input_file(strComplPath, std::ios::in);

int counter = 0;
while (std::getline(input_file, record))
{
    istringstream line(record);
    while (std::getline(line, record, delimiter))
    {
        record.erase(remove(record.begin(), record.end(), '\"'), record.end());
        items.push_back(record);
        //cout << record;
    }

    csv_contents[counter] = items;
    items.clear();
    ++counter;
}
 

for (int i = 0; i < csv_contents.size(); i++) {
    string regexline = csv_contents[i][1];
    string endtime = time_upper_bound;
    string starttime = time_lower_bound;
    bool checkline = false;
    bool isInRange = false, isLater = false, isEarlier = false;

    // Check for faulty Data and replace it with an empty string 
    for (int oo = 0; oo < 8; oo++) {
        if (csv_contents[i][oo].rfind("#", 0) == 0) {
            csv_contents[i][oo] = "";
        }
    }

    if ((regex_search(starttime, m, timestampformat) && regex_search(endtime, m, timestampformat))) {
        filtertimeboth = true;
    }
    else if (regex_search(starttime, m, timestampformat)) {
        filterfromstart = true;
    }
    else if (regex_search(endtime, m, timestampformat)) {
        filtertoend = true;
    }
}

C++ CSV io

我不确定你的程序中的瓶颈到底是什么（我从问题的早期版本复制了你的代码），但你有很多正则表达式：es，并将阅读记录与后处理混合在一起。我建议您创建一个来保存这些记录之一，称为，重载，然后从文件中使用一个过滤器，您可以与读数分开设计。读取通过筛选器的记录后进行后处理。classrecordoperator>>recordstd::copy_if

我做了一个小测试，在进行过滤时，读取旧旋转盘上的 200k 记录需要 2 秒。我只使用了和过滤和额外的检查当然会让它慢一点，但应该不需要几分钟。time_lower_boundtime_upper_bound

例：

#include <algorithm>
#include <chrono>
#include <ctime>
#include <fstream>
#include <iomanip>
#include <iostream>
#include <iterator>
#include <sstream>
#include <string>
#include <thread>
#include <vector>

// the suggested class to hold a record
struct record {
    int ID;
    std::chrono::system_clock::time_point Timestamp;
    std::string ObjectID;
    std::string UserID;
    std::string Area;
    std::string Description;
    std::string Comment;
    std::string Checksum;
};

// A free function to read a time_point from an `istream`:
std::chrono::system_clock::time_point to_tp(std::istream& is, const char* fmt) {
    std::chrono::system_clock::time_point tp{};
    // C++20:
    // std::chrono::from_stream(is, tp, fmt, nullptr, nullptr);

    // C++11 to C++17 version:
    std::tm tmtp{};
    tmtp.tm_isdst = -1;
    if(is >> std::get_time(&tmtp, fmt)) {
        tp = std::chrono::system_clock::from_time_t(std::mktime(&tmtp));
    }
    return tp;
}

// The operator>> overload to read one `record` from an `istream`:
std::istream& operator>>(std::istream& is, record& r) {
    is >> r.ID;
    r.Timestamp = to_tp(is, ";%d.%m.%Y %H:%M;"); // using the helper function above
    std::getline(is, r.ObjectID, ';');
    std::getline(is, r.UserID, ';');
    std::getline(is, r.Area, ';');
    std::getline(is, r.Description, ';');
    std::getline(is, r.Comment, ';');
    std::getline(is, r.Checksum);
    return is;
}

// An operator<< overload to print one `record`:
std::ostream& operator<<(std::ostream& os, const record& r) {
    std::ostringstream oss;
    oss << r.ID;
    { // I only made a C++11 to C++17 version for this one:
        std::time_t time = std::chrono::system_clock::to_time_t(r.Timestamp);
        std::tm ts = *std::localtime(&time);
        oss << ';' << ts.tm_mday << '.' << ts.tm_mon + 1 << '.'
            << ts.tm_year + 1900 << ' ' << ts.tm_hour << ':' << ts.tm_min << ';';
    }
    oss << r.ObjectID << ';' << r.UserID << ';' << r.Area << ';'
        << r.Description << ';' << r.Comment << ';' << r.Checksum << '\n';
    return os << oss.str();
}

// The reading and filtering part of `main` would then look like this:
int main() { // not "void main()"
    std::istringstream time_lower_bound_s("20.05.2019 16:40:00");
    std::istringstream time_upper_bound_s("20.05.2021 09:40:00");

    // Your time boundaries as `std::chrono::system_clock::time_point`s - 
    // again using the `to_tp` helper function:
    auto time_lower_bound = to_tp(time_lower_bound_s, "%d.%m.%Y %H:%M:%S");
    auto time_upper_bound = to_tp(time_upper_bound_s, "%d.%m.%Y %H:%M:%S");

    // Verify that the boundaries were parsed ok:
    if(time_lower_bound == std::chrono::system_clock::time_point{} ||
       time_upper_bound == std::chrono::system_clock::time_point{}) {
        std::cerr << "failed to parse boundaries\n";
        return 1;
    }

    std::ifstream is("data"); // whatever your file is called
    if(is) {
        std::vector<record> recs; // a vector with all the records

        // create your filter
        auto filter = [&time_lower_bound, &time_upper_bound](const record& r) {
            // Only copy those `record`s within the set boundaries.
            // You can add additional conditions here too.
            return r.Timestamp >= time_lower_bound &&
                   r.Timestamp <= time_upper_bound;
        };

        // Copy those records that pass the filter:
        std::copy_if(std::istream_iterator<record>(is),
                     std::istream_iterator<record>{}, std::back_inserter(recs),
                     filter);

        // .. post process `recs` here ...

        // print result
        for(auto& r : recs) std::cout << r;
    }
}

我非常感谢您的帮助！我刚刚尝试了您的解决方案，我可以对其进行编译，但无法将任何输出输入到 recs 向量中。我可以看到有来自 ifstream 的输入。现在我想知道复制实际上是如何与ifstreams一起工作的。由于输入是开始/结束和后插入器以及copy_if条件。但是我无法让它复制任何东西，即使是复制。

0赞 Ted Lyngmo 1/11/2022

@DoItWithFlow 不客气！“......我可以编译......“ - 您是否需要对其进行更改才能编译？为我编译干净。不知道这是一条溪流。这就是 s 所做的。如果你愿意，我可以看看你对我的建议的实施情况，看看我是否能看到问题出在哪里。把它放在 pastebin.com 或类似的地方，我会检查一下。clang++ -Weverything -Wno-c++98-compat -Wno-missing-prototypes -Wno-paddedstd::copy_ifstd::istream_iterator

1赞 DoItWithFlow 1/11/2022

我不必为它进行更改即可编译。这个 pastebin 不是我的最终实现，因为最后一个具有许多符合 SPS-Controller 标准的自定义数据类型。（不用担心）pastebin.com/wbMdyQpC

0赞 Ted Lyngmo 1/11/2022

@DoItWithFlow 对类中的类型所做的每次更改都需要对重载进行相应的更改。由于您更改为，因此需要更改为对下一行的格式字符串也有影响：（删除了首字母） - 我注意到您在格式字符串中添加了时间戳，但这些时间戳不在您提供的示例数据中，因此通过这两个更改，您将复制所有记录recordoperator>>int IDstd::string RecordID;is >> RecordID;std::getline(is, r.RecordID, ';');r.Timestamp = to_tp(is, "%d.%m.%Y %H:%M;");;"std::copy

0赞 DoItWithFlow 1/12/2022

那好吧。我让它工作了......对不起，基本问题。但是我不得不发现由于某种原因使用的 ODK-Server Im 不适用于您的解决方案（不幸的是，关于 ODK-Server 的文档很少）。我让它完全适用于控制台输出，但是一旦我实现了dll，ODK-Server就会在执行时给我一个异常。我必须深入研究这个问题，否则就能找到不同的方法......我希望你有美好的一天。

1赞 A M 1/11/2022 #2

泰德已经给出了答案。我在同一时间做了一个解决方案。所以让我另外展示一下。

我创建了带有 500k 记录的测试数据，所有解析内容都在我的机器上不到 3 秒内完成。

此外，我还创建了类。

通过使用，增加输入缓冲区大小并使用来获得速度。std::movereservestd::vector

请参阅下面的另一个解决方案。我省略了过滤。泰德已经展示了。

#include <iostream>
#include <fstream>
#include <iomanip>
#include <string>
#include <ctime>
#include <vector>
#include <chrono>
#include <sstream>
#include <algorithm>
#include <iterator>

constexpr size_t MaxLines = 600'000u;
constexpr size_t NumberOfLines = 500'000u;
const std::string fileName{ "test.csv" };

// Dummy rtoutine for writing a test file
void createFile() {
    if (std::ofstream ofs{ fileName }; ofs) {
        std::time_t ttt = 0;
        for (size_t k = 0; k < NumberOfLines; ++k) {
            std::time_t time = static_cast<time_t>(ttt);
            ttt += 1000;
            ofs << k << ';'
#pragma warning(suppress : 4996)
                << std::put_time(std::localtime(&time), "%d.%m.%Y  %H:%M") << ';'
                << k << ';'
                << "UserID" << k << ';'
                << "Area" << k << ';'
                << "Description" << k << ';'
                << "Comment" << k << ';'
                << "Checksum" << k << '\n';
        }
    }
    else std::cerr << "\n*** Error: Could not open '" << fileName << "' for writing\n\n";
}


// We will create a bigger input buffer for our stream
constexpr size_t ifStreamBufferSize = 100'000u;
static char buffer[ifStreamBufferSize];


// Object oriented Model. Class for one record
struct Record {

    // Data
    long id{};
    std::tm time{};
    long objectId{};
    std::string userId{};
    std::string area{};
    std::string description{};
    std::string comment{};
    std::string checkSum{};

    // Methods
    // Extractor operator
    friend std::istream& operator >> (std::istream& is, Record& r) {

        // Read one complete line
        if (std::string line; std::getline(is, line)) {

            // Here we will stor the parts of the line after the split
            std::vector<std::string> parts{};

            // Convert line to istringstream for further extraction of line parts
            std::istringstream iss{ line };

            // One part of a line
            std::string part{};
            bool wrongData = false;

            // Split
            while (std::getline(iss, part, ';')) {

                // Check fpor error
                if (part[0] == '#') {
                    is.setstate(std::ios::failbit);
                    break;
                }
                // add part
                parts.push_back(std::move(part));
            }
            // If all was OK
            if (is) {
                // If we have enough parts
                if (parts.size() == 8) {

                    // Convert parts to target data in record
                    r.id = std::strtol(parts[0].c_str(), nullptr, 10);

                    std::istringstream ss{parts[1]};
                    ss >> std::get_time(& r.time, "%d.%m.%Y  %H:%M");
                    if (ss.fail()) 
                        is.setstate(std::ios::failbit);

                    r.objectId = std::strtol(parts[2].c_str(), nullptr, 10);

                    r.userId = std::move(parts[3]);

                    r.area = std::move(parts[4]);

                    r.description = std::move(parts[5]);

                    r.comment = std::move(parts[6]);

                    r.checkSum = std::move(parts[7]);
                }
                else is.setstate(std::ios::failbit);
            }
        }
        return is;
    }
    // Simple inserter function
    friend std::ostream& operator << (std::ostream& os, const Record& r) {
        return os << r.id << "   "
#pragma warning(suppress : 4996)
            << std::put_time(&r.time, "%d.%m.%Y  %H:%M") << "   "  
            << r.objectId << "   " << r.userId << "   " << r.area << "   " << r.description << "   " << r.comment << "   " << r.checkSum;
    }
};

// Data will hold all records
struct Data {

    // Data part
    std::vector<Record> records{};

    // Constructor will reserve space to avaoid reallocation
    Data() { records.reserve(MaxLines); }

    // Simple extractor. Will call Record's exractor
    friend std::istream& operator >> (std::istream& is, Data& d) {

        // Set bigger file buffer. This is a time saver
        is.rdbuf()->pubsetbuf(buffer, ifStreamBufferSize);
        std::copy(std::istream_iterator<Record>(is), {}, std::back_inserter(d.records));
        return is;
    }
    // Simple inserter
    friend std::ostream& operator >> (std::ostream& os, const Data& d) {
        std::copy(d.records.begin(), d.records.end(), std::ostream_iterator<Record>(os, "\n"));
        return os;
    }

};

int main() {
    // createFile();

    auto start = std::chrono::system_clock::now();
    auto elapsed = std::chrono::duration_cast<std::chrono::milliseconds>(std::chrono::system_clock::now() - start);

    if (std::ifstream ifs{ fileName }; ifs) {

        Data data;

        // Start time measurement
        start = std::chrono::system_clock::now();

        // Read and parse complete data
        ifs >> data;

        // End of time measurement
        elapsed = std::chrono::duration_cast<std::chrono::milliseconds>(std::chrono::system_clock::now() - start);
        std::cout << "\nReading and splitting. Duration: " << elapsed.count() << " ms\n";

        // Some debug output
        std::cout << "\n\nNumber of read records:  " << data.records.size() << "\n\n";
        for (size_t k{}; k < 10; ++k)
            std::cout << data.records[k] << '\n';
    }
    else std::cerr << "\n*** Error: Could not open '" << fileName << "' for reading\n\n";
}

是的，我使用了“ctime”。

提高读取 csv 文件 C++ 的速度

Increasing the speed of reading a csv file C++

评论

评论

评论