为什么 DOMDocument 使用 80 分钟,而 XMLParser 使用 10 秒来分析 120MB 的 XML?

Why does DOMDocument use 80 minutes when XMLParser use 10 seconds to parse 120MB of XML?

提问人:hanshenrik 提问时间:2/24/2023 最后编辑:hanshenrik 更新时间:2/24/2023 访问量:84

问:

我有一个 120MB 的大 XML 文件,我编写了 2 个解析器来解析它,1 个使用 DOMDocument API,它大约需要 80 分钟来解析它,另一个使用 XML 解析器 API 的解析大约需要 10 秒来解析它。(我也在某个时候写了一个基于SimpleXML的解析器,但它并不比DOMDocument快,我认为它甚至更慢,而且我没有保留代码)

这是一个疯狂的性能差异!为什么?

CPU:中端 2018 Intel Laptop CPU (Intel Core i7-8565U)

PHP版本:PHP 8.2.1

为什么会有疯狂的性能差异?由于托管大型文本文件的难度,我对其进行了 zstd 压缩,base64 将其编码为 8.7MB,原始 XML 可以通过以下方式获得:

wget 'https://gist.githubusercontent.com/divinity76/02dd2b9ab596bd1fd31d1c2e5a72075a/raw/895cfb598f6c5ebdf5837e593113c1c072e7cb5e/ellos.xml.zstd.b64' -O- | \
base64 -d | \
zstd -d > ellos.xml

完整性可以通过 B3SUM 确认

cat ellos.xml | b3sum
3d27fe1ed272b91dc25f2f81c2920dec091f0e536866d5dedd74bd74fca303bf  -

这两个版本都应该运行类似

ini_set("memory_limit", "3G");

基于 DOMDocument 的分析器代码为:

function ellos_xml_file_to_item_array(string $xml_file_path): array
{
    // due to the large size of the xml file,
    // which API is used to parse it can be the difference between
    // an hour of parsing, or 10 seconds of parsing.
    // DOMDocument parser: ~80 minutes
    // alex-oleshkevich/php-fast-xml-parser (based on XMLParser): ~10 seconds
    // however XMLParser is difficult to use,
    // php-fast-xml-parser has issues with PHP>=8.2 and UTF-8 ( https://github.com/alex-oleshkevich/php-fast-xml-parser/issues/9 )
    // and we don't have time to fix it, so we use the slower DOMDocument parser
    $domd = new DOMDocument();
    $domd->load($xml_file_path);
    $itemsNodeList = $domd->getElementsByTagName("item");
    $items_parsed = [];
    foreach ($itemsNodeList as $itemNode) {
        $item = [];
        foreach ($itemNode->childNodes as $childNode) {
            if ($childNode->nodeType !== XML_ELEMENT_NODE) {
                if ($childNode->nodeType === XML_TEXT_NODE && trim($childNode->nodeValue) === "") {
                    // whitespace "nodes"
                    continue;
                }
                // investigate other node types
                throw new LogicException(var_export(["UNEXPECTED NODE TYPE", $childNode->nodeType, $childNode->nodeName, $childNode->nodeValue], true));
            }
            $item[$childNode->nodeName] = $childNode->nodeValue;
        }
        if (false) {
            // sample:
            $item = array(
                'Discount_Percentage' => '30%',
                'g:id' => '1650866-02',
                'g:availability' => 'in stock',
                'g:condition' => 'new',
                'g:description' => 'Väst i lätt vadderad modell med randig quiltstickning.  •  Rak modell •  Dragkedja med skavskydd vid hakan •  Hög krage •  Infällda sidfickor med dragkedja •  Smal ribbmudd nertill fram •  Längd ca 67 i stl M',
                'g:image_link' => 'http://assets.ellosgroup.com/i/ellos/ell_1650866-02_Fs?w=1661',
                'g:link' => 'https://www.ellos.se/jack-jones/vast-jjluke-light-vest/1650866-02-L',
                'g:title' => 'Jack & Jones - Väst jjLuke Light Vest',
                'g:price' => '599.00 SEK',
                'g:gtin' => '5715095009566',
                'g:mpn' => 'Jack & Jones1650866',
                'g:brand' => 'Jack & Jones',
                'g:additional_image_link' => 'http://assets.ellosgroup.com/i/ellos/ell_1650866-02_Fs?w=66,http://assets.ellosgroup.com/i/ellos/ell_1650866-02_Fs?w=33,http://assets.ellosgroup.com/i/ellos/ell_1650866-02_Fs?w=665,http://assets.ellosgroup.com/i/ellos/ell_1650866-02_Fs?w=342,http://assets.ellosgroup.com/i/ellos/ell_1650866-02_Fs?w=152',
                'g:color' => 'Grön',
                'g:gender' => 'Male',
                'g:item_group_id' => '1650866',
                'g:product_type' => 'Herr>Mode>Jackor & rockar>Västar',
                'g:sale_price' => '419.00 SEK',
                'g:custom_label_0' => '',
            );
        }
        $items_parsed[] = $item;
        //echo "parsed: " . count($items_parsed) . "\r";
    }
    //echo "deserialised to php array, time: " . (microtime(true) - $t) . "s\n";
    return $items_parsed;
}

基于 XML 解析器的解析器代码为:

function ellos_xml_file_to_item_array_optimized(string $xml_file_path): array
{
    // this is a highly optimized, but difficult to read/maintain, parser.
    // the speed difference is INSANE on 120MB document:
    // DOMDocument parser: ~80 MINUTES,
    // xml_parser: ~10 seconds
    $parser = xml_parser_create();
    xml_parser_set_option($parser, XML_OPTION_CASE_FOLDING, 0);
    xml_parser_set_option($parser, XML_OPTION_SKIP_WHITE, 1);
    xml_parse_into_struct($parser, file_get_contents($xml_file_path), $vals, $index);
    xml_parser_free($parser);
    unset($parser);
    $items_parsed = [];
    $is_in_item = false;
    $current_item_level = null;
    $item = [];
    foreach ($vals as $key => $val) {
        $type = $val["type"];
        $tag = $val["tag"] ?? null;
        if (!$is_in_item && $tag !== "item") {
            continue;
        }
        $level = $val["level"];
        if (!$is_in_item && $tag === "item") {
            assert($type === "open");
            $current_item_level = $level;
            $is_in_item = true;
            continue;
        }
        if ($tag === "item" && $type === "close") {
            // finished with an item.
            if (false) {
                // sample:
                $item = array(
                    'Discount_Percentage' => '40%',
                    'g:id' => '1650894-04',
                    'g:availability' => 'in stock',
                    'g:condition' => 'new',
                    'g:description' => '•  Vid, finstickad modell •  Rund halsringning •  Nerhasad axel •  Ribbstickade kanter ',
                    'g:image_link' => 'http://assets.ellosgroup.com/i/ellos/ell_1650894-04_Fs?w=1661',
                    'g:link' => 'https://www.ellos.se/moss-copenhagen/troja-femme-mohair-o-pullover/1650894-04-XSS',
                    'g:title' => 'Moss Copenhagen - Tröja Femme Mohair O Pullover',
                    'g:price' => '799.00 SEK',
                    'g:gtin' => '5712808579095',
                    'g:mpn' => 'Moss Copenhagen1650894',
                    'g:brand' => 'Moss Copenhagen',
                    'g:additional_image_link' => 'http://assets.ellosgroup.com/i/ellos/ell_1650894-04_Fs?w=66,http://assets.ellosgroup.com/i/ellos/ell_1650894-04_Fs?w=33,http://assets.ellosgroup.com/i/ellos/ell_1650894-04_Fs?w=665,http://assets.ellosgroup.com/i/ellos/ell_1650894-04_Fs?w=342,http://assets.ellosgroup.com/i/ellos/ell_1650894-04_Fs?w=152',
                    'g:color' => 'Brun',
                    'g:gender' => 'Female',
                    'g:item_group_id' => '1650894',
                    'g:product_type' => 'Dam>Mode>Tröjor & koftor>Stickade tröjor',
                    'g:sale_price' => '479.00 SEK',
                    'g:custom_label_0' => '',
                );
            }
            $items_parsed[] = $item;
            //echo "parsed items: " . count($items_parsed) . "\n";
            $item = [];
            $is_in_item = false;
            $current_item_level = null;
            continue;
        }
        $item[$tag] = $val["value"] ?? "";
    }
    return $items_parsed;
}

FWIW公司

    $t = microtime(true);
    $domd->load($xml_file_path);
    echo "DOMDocument load time: " . (microtime(true) - $t) . "s\n";

DOMDocument load time: 2.3969311714172s

    $t = microtime(true);
    $itemsNodeList = $domd->getElementsByTagName("item");
    echo "DOMDocument getElementsByTagName time: " . number_format(microtime(true) - $t, 9) . "s\n";die();

DOMDocument getElementsByTagName time: 0.000005960s

    $t = microtime(true);
    xml_parse_into_struct($parser, file_get_contents($xml_file_path), $vals, $index);
    echo "xml_parse_into_struct time: " . number_format(microtime(true) - $t, 9) . "s\n";die();

xml_parse_into_struct time: 4.495072126s
xml_parse_into_struct time: 2.977376938s
xml_parse_into_struct time: 2.852163076s
xml_parse_into_struct time: 2.776274920s
xml_parse_into_struct time: 2.799090862s

PHP 性能 XML 解析 domdocument libxml2

评论

0赞 CBroe 2/24/2023
DOMDocument 需要一次性读取整个文档,而如果我没记错的话,XML 解析器 API 是基于流的。
0赞 hanshenrik 2/24/2023
@CBroe您可以使用基于流的“XML 解析器 API”,但这不是我在这里所做的,我在 1 次中读取了整个文件xml_parse_into_struct($parser, file_get_contents($xml_file_path), $vals, $index);
0赞 rustyx 2/24/2023
尝试仅对解析部分进行计时,而不是 .DOMDocument::loadxml_parse_into_struct
0赞 hanshenrik 2/24/2023
@rustyx谢谢,添加到帖子:DOMDocument::load 大约使用 2.3 秒。xml_parse_into_struct() 大约需要 4.5 秒。问题不在那里。

答:

0赞 Pedro Pires 2/24/2023 #1

通过查看基准测试代码,我们可以注意到以下几点:

  • 当您使用 DOMDocument 时,您调用它,我认为这意味着它将遍历整个文档以查找标记为“item”的所有元素,然后将它们返回到,然后再次遍历每个项目以构造数据数组。$domd->getElementsByTagName("item")$itemsNodeList
  • 使用 XMLParser 时,可以同时执行这两个步骤,从而降低时间复杂性。if ($tag === "item" && $type === "close") {

XMLParser 中的代码可读性较差,但只要稍加重构,例如将代码移动到构造元素的函数中并添加一些注释,它就可以成为代码库的重要组成部分。希望我能提供一些启示。//finished with an item.

评论

0赞 hanshenrik 2/24/2023
FWIW 在 0.000005 秒(5 微秒)内完成,因此这应该无关紧要 ^^ - 顶部帖子已更新$domd->getElementsByTagName("item")
0赞 Pedro Pires 2/25/2023
呜,这让我的答案大打折扣