提问人:hanshenrik 提问时间:2/24/2023 最后编辑:hanshenrik 更新时间:2/24/2023 访问量:84
为什么 DOMDocument 使用 80 分钟,而 XMLParser 使用 10 秒来分析 120MB 的 XML?
Why does DOMDocument use 80 minutes when XMLParser use 10 seconds to parse 120MB of XML?
问:
我有一个 120MB 的大 XML 文件,我编写了 2 个解析器来解析它,1 个使用 DOMDocument API,它大约需要 80 分钟来解析它,另一个使用 XML 解析器 API 的解析器,大约需要 10 秒来解析它。(我也在某个时候写了一个基于SimpleXML的解析器,但它并不比DOMDocument快,我认为它甚至更慢,而且我没有保留代码)
这是一个疯狂的性能差异!为什么?
CPU:中端 2018 Intel Laptop CPU (Intel Core i7-8565U)
PHP版本:PHP 8.2.1
为什么会有疯狂的性能差异?由于托管大型文本文件的难度,我对其进行了 zstd 压缩,base64 将其编码为 8.7MB,原始 XML 可以通过以下方式获得:
wget 'https://gist.githubusercontent.com/divinity76/02dd2b9ab596bd1fd31d1c2e5a72075a/raw/895cfb598f6c5ebdf5837e593113c1c072e7cb5e/ellos.xml.zstd.b64' -O- | \
base64 -d | \
zstd -d > ellos.xml
完整性可以通过 B3SUM 确认
cat ellos.xml | b3sum
3d27fe1ed272b91dc25f2f81c2920dec091f0e536866d5dedd74bd74fca303bf -
这两个版本都应该运行类似
ini_set("memory_limit", "3G");
基于 DOMDocument 的分析器代码为:
function ellos_xml_file_to_item_array(string $xml_file_path): array
{
// due to the large size of the xml file,
// which API is used to parse it can be the difference between
// an hour of parsing, or 10 seconds of parsing.
// DOMDocument parser: ~80 minutes
// alex-oleshkevich/php-fast-xml-parser (based on XMLParser): ~10 seconds
// however XMLParser is difficult to use,
// php-fast-xml-parser has issues with PHP>=8.2 and UTF-8 ( https://github.com/alex-oleshkevich/php-fast-xml-parser/issues/9 )
// and we don't have time to fix it, so we use the slower DOMDocument parser
$domd = new DOMDocument();
$domd->load($xml_file_path);
$itemsNodeList = $domd->getElementsByTagName("item");
$items_parsed = [];
foreach ($itemsNodeList as $itemNode) {
$item = [];
foreach ($itemNode->childNodes as $childNode) {
if ($childNode->nodeType !== XML_ELEMENT_NODE) {
if ($childNode->nodeType === XML_TEXT_NODE && trim($childNode->nodeValue) === "") {
// whitespace "nodes"
continue;
}
// investigate other node types
throw new LogicException(var_export(["UNEXPECTED NODE TYPE", $childNode->nodeType, $childNode->nodeName, $childNode->nodeValue], true));
}
$item[$childNode->nodeName] = $childNode->nodeValue;
}
if (false) {
// sample:
$item = array(
'Discount_Percentage' => '30%',
'g:id' => '1650866-02',
'g:availability' => 'in stock',
'g:condition' => 'new',
'g:description' => 'Väst i lätt vadderad modell med randig quiltstickning. • Rak modell • Dragkedja med skavskydd vid hakan • Hög krage • Infällda sidfickor med dragkedja • Smal ribbmudd nertill fram • Längd ca 67 i stl M',
'g:image_link' => 'http://assets.ellosgroup.com/i/ellos/ell_1650866-02_Fs?w=1661',
'g:link' => 'https://www.ellos.se/jack-jones/vast-jjluke-light-vest/1650866-02-L',
'g:title' => 'Jack & Jones - Väst jjLuke Light Vest',
'g:price' => '599.00 SEK',
'g:gtin' => '5715095009566',
'g:mpn' => 'Jack & Jones1650866',
'g:brand' => 'Jack & Jones',
'g:additional_image_link' => 'http://assets.ellosgroup.com/i/ellos/ell_1650866-02_Fs?w=66,http://assets.ellosgroup.com/i/ellos/ell_1650866-02_Fs?w=33,http://assets.ellosgroup.com/i/ellos/ell_1650866-02_Fs?w=665,http://assets.ellosgroup.com/i/ellos/ell_1650866-02_Fs?w=342,http://assets.ellosgroup.com/i/ellos/ell_1650866-02_Fs?w=152',
'g:color' => 'Grön',
'g:gender' => 'Male',
'g:item_group_id' => '1650866',
'g:product_type' => 'Herr>Mode>Jackor & rockar>Västar',
'g:sale_price' => '419.00 SEK',
'g:custom_label_0' => '',
);
}
$items_parsed[] = $item;
//echo "parsed: " . count($items_parsed) . "\r";
}
//echo "deserialised to php array, time: " . (microtime(true) - $t) . "s\n";
return $items_parsed;
}
基于 XML 解析器的解析器代码为:
function ellos_xml_file_to_item_array_optimized(string $xml_file_path): array
{
// this is a highly optimized, but difficult to read/maintain, parser.
// the speed difference is INSANE on 120MB document:
// DOMDocument parser: ~80 MINUTES,
// xml_parser: ~10 seconds
$parser = xml_parser_create();
xml_parser_set_option($parser, XML_OPTION_CASE_FOLDING, 0);
xml_parser_set_option($parser, XML_OPTION_SKIP_WHITE, 1);
xml_parse_into_struct($parser, file_get_contents($xml_file_path), $vals, $index);
xml_parser_free($parser);
unset($parser);
$items_parsed = [];
$is_in_item = false;
$current_item_level = null;
$item = [];
foreach ($vals as $key => $val) {
$type = $val["type"];
$tag = $val["tag"] ?? null;
if (!$is_in_item && $tag !== "item") {
continue;
}
$level = $val["level"];
if (!$is_in_item && $tag === "item") {
assert($type === "open");
$current_item_level = $level;
$is_in_item = true;
continue;
}
if ($tag === "item" && $type === "close") {
// finished with an item.
if (false) {
// sample:
$item = array(
'Discount_Percentage' => '40%',
'g:id' => '1650894-04',
'g:availability' => 'in stock',
'g:condition' => 'new',
'g:description' => '• Vid, finstickad modell • Rund halsringning • Nerhasad axel • Ribbstickade kanter ',
'g:image_link' => 'http://assets.ellosgroup.com/i/ellos/ell_1650894-04_Fs?w=1661',
'g:link' => 'https://www.ellos.se/moss-copenhagen/troja-femme-mohair-o-pullover/1650894-04-XSS',
'g:title' => 'Moss Copenhagen - Tröja Femme Mohair O Pullover',
'g:price' => '799.00 SEK',
'g:gtin' => '5712808579095',
'g:mpn' => 'Moss Copenhagen1650894',
'g:brand' => 'Moss Copenhagen',
'g:additional_image_link' => 'http://assets.ellosgroup.com/i/ellos/ell_1650894-04_Fs?w=66,http://assets.ellosgroup.com/i/ellos/ell_1650894-04_Fs?w=33,http://assets.ellosgroup.com/i/ellos/ell_1650894-04_Fs?w=665,http://assets.ellosgroup.com/i/ellos/ell_1650894-04_Fs?w=342,http://assets.ellosgroup.com/i/ellos/ell_1650894-04_Fs?w=152',
'g:color' => 'Brun',
'g:gender' => 'Female',
'g:item_group_id' => '1650894',
'g:product_type' => 'Dam>Mode>Tröjor & koftor>Stickade tröjor',
'g:sale_price' => '479.00 SEK',
'g:custom_label_0' => '',
);
}
$items_parsed[] = $item;
//echo "parsed items: " . count($items_parsed) . "\n";
$item = [];
$is_in_item = false;
$current_item_level = null;
continue;
}
$item[$tag] = $val["value"] ?? "";
}
return $items_parsed;
}
FWIW公司
$t = microtime(true);
$domd->load($xml_file_path);
echo "DOMDocument load time: " . (microtime(true) - $t) . "s\n";
DOMDocument load time: 2.3969311714172s
$t = microtime(true);
$itemsNodeList = $domd->getElementsByTagName("item");
echo "DOMDocument getElementsByTagName time: " . number_format(microtime(true) - $t, 9) . "s\n";die();
DOMDocument getElementsByTagName time: 0.000005960s
$t = microtime(true);
xml_parse_into_struct($parser, file_get_contents($xml_file_path), $vals, $index);
echo "xml_parse_into_struct time: " . number_format(microtime(true) - $t, 9) . "s\n";die();
xml_parse_into_struct time: 4.495072126s
xml_parse_into_struct time: 2.977376938s
xml_parse_into_struct time: 2.852163076s
xml_parse_into_struct time: 2.776274920s
xml_parse_into_struct time: 2.799090862s
答:
0赞
Pedro Pires
2/24/2023
#1
通过查看基准测试代码,我们可以注意到以下几点:
- 当您使用 DOMDocument 时,您调用它,我认为这意味着它将遍历整个文档以查找标记为“item”的所有元素,然后将它们返回到,然后再次遍历每个项目以构造数据数组。
$domd->getElementsByTagName("item")
$itemsNodeList
- 使用 XMLParser 时,可以同时执行这两个步骤,从而降低时间复杂性。
if ($tag === "item" && $type === "close") {
XMLParser 中的代码可读性较差,但只要稍加重构,例如将代码移动到构造元素的函数中并添加一些注释,它就可以成为代码库的重要组成部分。希望我能提供一些启示。//finished with an item.
评论
0赞
hanshenrik
2/24/2023
FWIW 在 0.000005 秒(5 微秒)内完成,因此这应该无关紧要 ^^ - 顶部帖子已更新$domd->getElementsByTagName("item")
0赞
Pedro Pires
2/25/2023
呜,这让我的答案大打折扣
评论
xml_parse_into_struct($parser, file_get_contents($xml_file_path), $vals, $index);
DOMDocument::load
xml_parse_into_struct