从所有 HTML 标记中删除不在白名单中的所有属性-解网

问：

因此，到目前为止，我只能保留一个属性，但我正在尝试将 class 和 id 属性都保留在 HTML 标签中

法典：

$string = '<div id="one-id" class="someClassName">Some text <a href="#" title="Words" id="linkId" class="classLink">link</a> with only the class and id  attrtibutes.</div>';

preg_replace("/<([a-z][a-z0-9]*)(?:[^>]*(\sclass=['\"][^'\"]*['\"]))?[^>]*?(\/?)>/i", '<$1$2$3>', $string);

输出：

<div class="someClassName">Some text <a class="classLink">link</a> with only the class and id  attrtibutes./div>

我正在尝试从每个标签中删除除 class 和 id 属性之外的所有其他属性。

使用 DOMDocument（）;出于某种原因在输出中添加额外的 p 标签，我相信 xpath 更快？

php html 属性 domdocument 清理

$html = '<div id="one-id" class="someClassName">Some text <a href="#" title="Words" id="linkId" class="classLink">link</a> with only the class and id  attrtibutes.</div>';

$dom = new DOMDocument();
$dom->loadHTML($html, LIBXML_HTML_NOIMPLIED | LIBXML_HTML_NODEFDTD);
$xpath = new DOMXPath($dom);
foreach ($xpath->query('//*') as $node) {
    for ($i = $node->attributes->length - 1; $i >= 0; --$i) {
        $attr = $node->attributes->item($i);
        if (!in_array($attr->name, ['id', 'class'])) {
            $node->removeAttribute($attr->name);
        }
    }
}
echo $dom->saveHTML();

输出：

<div id="one-id" class="someClassName">Some text <a id="linkId" class="classLink">link</a> with only the class and id  attrtibutes.</div>

...实际上，XPath 并不是真正需要的，因为我们正在迭代 dom 中的每个节点。)

$dom = new DOMDocument();
$dom->loadHTML($html, LIBXML_HTML_NOIMPLIED | LIBXML_HTML_NODEFDTD);
foreach ($dom->getElementsByTagName('*') as $node) {
    for ($i = $node->attributes->length - 1; $i >= 0; --$i) {
        $attr = $node->attributes->item($i);
        if (!in_array($attr->name, ['id', 'class'])) {
            $node->removeAttribute($attr->name);
        }
    }
}
echo $dom->saveHTML();

尝试使用正则表达式解析有效的 HTML 将是以下一项或多项：

不准确/不可靠，
複曲/冗长，
难以阅读，
难以维护

正则表达式不知道标签和仅看起来像标签的文本之间的区别。如果 HTML 标签和属性使用大写和小写怎么办？如果使用单引号、双引号和/或反引号怎么办？如果属性没有赋值（例如或者）？如果属性名称以或结尾怎么办？如果属性值包含一个引号符号，该引号是转义的，而不是html编码的，该怎么办？如果文本看起来像起始标签，但根本不是标签，该怎么办？readonlycheckeddata-idtitle

这些是避免使用正则表达式解析有效 HTML 的正当理由。

从所有 HTML 标记中删除不在白名单中的所有属性

Remove all attributes not in whitelist from all HTML tags

评论

评论