提问人:programmerHsiao 提问时间:10/31/2023 最后编辑:programmerHsiao 更新时间:10/31/2023 访问量:101
如何提高正则表达式数组的查询性能?
How to improve query performance of regex array?
问:
我的程序旨在从包含指纹列表的文件中获取输入, 此文件有超过 20,000 条指纹记录,文件如下所示:
abc Body="123"
think Header="think"
xxx Response=="xxx"
fff Body!="abc"
etc.
我需要做的是将这个文件加载到一个向量中(读取文件并动态生成正则表达式),然后按字符串搜索向量:
// abc Body="123"
struct Rule
{
std::string name; // abc
std::string type; // Body
std::string op; // =, ==, !=, do regex matching according to op
std::string rule; // "123"
std::shared_ptr<re2::RE2> reg;
}
std::vector<Rule> rules = load('fingerprint.txt');
load 函数的近似内容如下:
std::vector<Rule> load (const std::string& file)
{
std::vector<Rule> res;
std::ifstream file(filename);
std::string line;
while(std::getline(file,line))
{
Rule rule;
// line = abc Body="123"
// rule.name = "abc";
// rule.type = "Body";
// rule.op = "=";
// rule.rule = "\"123\"";
rule = parse_line(line);
rule.reg.reset(new re2::RE2(rule.rule));
res.push_back(rule);
}
return res;
}
以下是我的查询函数:
struct QueryObject
{
std::string header;
std::string body;
std::string title;
std::string response;
}
// std::vector<Rule> rules = load('fingerprint.txt');
std::string get_query_str(const Rule& rule, const QueryObject& obj)
{
if(rule.type=="Header") return obj.header;
if(rule.type=="Body") return obj.body;
etc...
}
std::vector<std::string> query(const QueryObject& obj)
{
std::vector<std::string> result;
std::string query_str;
for(size_t i = 0; i < rules.size(); ++i)
{
const Rule &rule = rules[i];
query_str = get_query_str(rule,obj);
if (rule.op == "="
&& RE2::PartialMatch(query_str, *rule.reg))
{
result.push_back(rule.name);
}
else if(rule.op == "!="
&& !RE2::PartialMatch(query_str, *rule.reg))
{
result.push_back(rule.name);
}
else if(rule.op == "==" && query_str == rule.rule)
{
result.push_back(rule.name);
}
}
return result;
}
我尝试过使用多线程,但改进并不显着。伪代码如下:
// reload query function
std::vector<std::string> query(const QueryObject& obj, size_t begin, size_t end)
{
for(size_t i = begin; i < end; ++i)
{
//...
}
}
// std::vector<Rule> rules = load('fingerprint.txt');
QueryObject obj;
auto task1 = thread_pool.enque(query, obj,0,rules.size()/2);
auto tast2 = thread_pool.enque(query, obj,rules.size()/2+1, rules.size());
// use task1.get(), task2.get() to get the results of query()
我测试过这个函数,查询一条记录需要很长时间。如何提高效率?
单线程 | 2 个线程 |
---|---|
6980毫秒 | 6647毫秒 |
任何帮助都非常感谢。英语不是我的母语;请原谅我的任何错误。
更新:
编译器: g++ (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 构建工具:xmake 编译器命令:
xmake f -p linux -a x86_64 -m release
xmake build -w my_app
xmake 为 makefile 生成的 cxxflags 可能包含以下内容:
fingerprint_CXXFLAGS=-m64 -fvisibility=hidden -fvisibility-inlines-hidden -O2 -isystem /home/hsiao/.xmake/packages/f/fmt/10.1.1/include -fomit-frame-pointer -DNDEBUG
xmake.lua:
add_rules("mode.debug", "mode.release")
add_requires("fmt","re2","spdlog","libcurl","openssl","gtest","nlohmann_json")
add_rules("plugin.compile_commands.autoupdate", {outputdir = ".vscode"})
if is_mode("debug") then
add_defines("DEBUG")
set_symbols("debug")
set_optimize("none")
end
if is_mode("release") then
set_strip("all")
add_cxflags("-fomit-frame-pointer")
add_mxflags("-fomit-frame-pointer")
set_optimize("faster")
end
target("analyze_fingerprint")
set_kind("binary")
add_files("src/*.cpp")
add_packages("fmt","re2","spdlog","nlohmann_json")
set_targetdir("bin")
测量执行时间的代码:
QueryObject obj;
obj.title="ruijie";
auto begin = std::chrono::system_clock::now();
auto result = query(obj);
auto end = std::chrono::system_clock::now();
std::chrono::duration<double, std::milli> duration = end - begin;
答: 暂无答案
评论
shared_ptr
rules.size()/2+1
Body="123"
Body="456"
"123|456"