如何提高正则表达式数组的查询性能？-解网

问：

我的程序旨在从包含指纹列表的文件中获取输入，此文件有超过 20,000 条指纹记录，文件如下所示：

abc     Body="123"
think       Header="think"
xxx         Response=="xxx"
fff         Body!="abc"

etc.

我需要做的是将这个文件加载到一个向量中（读取文件并动态生成正则表达式），然后按字符串搜索向量：

// abc  Body="123"
struct Rule
{
    std::string name; // abc
    std::string type; // Body
    std::string op;   // =, ==, !=, do regex matching according to op
    std::string rule; // "123"
    std::shared_ptr<re2::RE2> reg;
}


std::vector<Rule> rules = load('fingerprint.txt');

load 函数的近似内容如下：

std::vector<Rule> load (const std::string& file)
{
    std::vector<Rule> res;
    std::ifstream file(filename);
    std::string line;

    while(std::getline(file,line))
    {
        Rule rule;
        // line = abc   Body="123"
        // rule.name = "abc";
        // rule.type = "Body";
        // rule.op   = "=";
        // rule.rule = "\"123\"";
        rule = parse_line(line);
        rule.reg.reset(new re2::RE2(rule.rule));
        res.push_back(rule);
    }
    return res;
}

以下是我的查询函数：

struct QueryObject
{
    std::string header;
    std::string body;
    std::string title;
    std::string response;
}

// std::vector<Rule> rules = load('fingerprint.txt');

std::string get_query_str(const Rule& rule, const QueryObject& obj)
{
    if(rule.type=="Header") return obj.header;
    if(rule.type=="Body") return obj.body;
    etc...
    
}


std::vector<std::string> query(const QueryObject& obj)
{
    std::vector<std::string> result;
    std::string query_str;
    for(size_t i = 0; i < rules.size(); ++i)
    {
        const Rule &rule = rules[i];
        query_str = get_query_str(rule,obj);
        if (rule.op == "=" 
            && RE2::PartialMatch(query_str, *rule.reg))
        {
            result.push_back(rule.name);
        }
        else if(rule.op == "!=" 
                && !RE2::PartialMatch(query_str, *rule.reg))
        {
            result.push_back(rule.name);
        }
        else if(rule.op == "==" && query_str == rule.rule)
        {
            result.push_back(rule.name);
        }

    }
    return result;
}

我尝试过使用多线程，但改进并不显着。伪代码如下：

// reload query function
std::vector<std::string> query(const QueryObject& obj, size_t begin, size_t end)
{
    for(size_t i = begin; i < end; ++i)
    {
        //...
    }
}
// std::vector<Rule> rules = load('fingerprint.txt');
QueryObject obj;
auto task1 = thread_pool.enque(query, obj,0,rules.size()/2);
auto tast2 = thread_pool.enque(query, obj,rules.size()/2+1, rules.size());

// use task1.get(), task2.get() to get the results of query()

我测试过这个函数，查询一条记录需要很长时间。如何提高效率？

单线程	2 个线程
6980毫秒	6647毫秒

任何帮助都非常感谢。英语不是我的母语;请原谅我的任何错误。

更新：

编译器： g++ （Ubuntu 11.4.0-1ubuntu1~22.04） 11.4.0 构建工具：xmake 编译器命令：

xmake f -p linux -a x86_64 -m release
xmake build -w my_app

xmake 为 makefile 生成的 cxxflags 可能包含以下内容：

fingerprint_CXXFLAGS=-m64 -fvisibility=hidden -fvisibility-inlines-hidden -O2 -isystem /home/hsiao/.xmake/packages/f/fmt/10.1.1/include  -fomit-frame-pointer -DNDEBUG

xmake.lua：

add_rules("mode.debug", "mode.release")
add_requires("fmt","re2","spdlog","libcurl","openssl","gtest","nlohmann_json")
add_rules("plugin.compile_commands.autoupdate", {outputdir = ".vscode"})


if is_mode("debug") then 
    add_defines("DEBUG")
    set_symbols("debug")
    set_optimize("none")
end 


if is_mode("release") then 
    set_strip("all")
    add_cxflags("-fomit-frame-pointer")
    add_mxflags("-fomit-frame-pointer")
    set_optimize("faster")
end 

target("analyze_fingerprint")
    set_kind("binary")
    add_files("src/*.cpp")
    add_packages("fmt","re2","spdlog","nlohmann_json")
    set_targetdir("bin")

测量执行时间的代码：

QueryObject obj;
obj.title="ruijie";

auto begin = std::chrono::system_clock::now();
auto result = query(obj);
auto end = std::chrono::system_clock::now();
std::chrono::duration<double, std::milli> duration = end - begin;

C++ 数组字符串

如果规则不包含任何特殊的正则表达式字符，如“.”或“*”，那么一个简单的字符串：：find 可能会更快（或者不，你需要测试它）。此外，将所有相同类型和 op 的正则表达式组合成一个巨大的正则表达式并匹配一次，而不是在循环中单独匹配每个正则表达式，可能会更快。也就是说，如果您有和，则匹配而不是进行两个单独的匹配可能是有益的。同样，您需要对此进行测试。Body="123"Body="456""123|456"

答： 暂无答案

上一个：如何提高加扰代码的性能？

下一个：从字符串文字初始化 char* 向量的好方法是什么？

如何提高正则表达式数组的查询性能？

How to improve query performance of regex array?

评论