提问人:th3_sh0w3r 提问时间:9/21/2023 更新时间:9/22/2023 访问量:58
php strip_tags正则表达式,用于将单个<替换为 HTML 实体
php strip_tags regex for replacing single < with HTML entities
问:
我正在使用 strip_tags 来确保在保存字符串之前删除每个 HTML 标签。
现在我遇到了一个问题,即删除了没有任何结束标签的单曲。
现在的想法是用匹配的 HTML 实体替换每个实体,我为此得到了一个正则表达式,但它只替换了第一个发现,知道我该如何调整它吗?<
<
<
这是我现在得到的正则表达式:preg_replace("/<([^>]*(<|$))/", "<$1", $string);
我想要这个:
<p> Hello < 30 </p> < < < <!-- Test --> <> > > >
成为第一个这个:preg_replace(REGEX, REPLACE, $string)
<p> Hello < 30 </p> < < < <!-- Test --> <> > > >
然后是之后:strip_tags($string)
Hello < 30 < < < <> > > >
知道我该如何实现这一目标吗?
也许你甚至知道更好的方法。
答:
你的问题很有趣,因此我花时间尝试和 解决它。我认为唯一的方法是分几个步骤来完成:
第一步是删除 HTML 注释。
下一步是尝试将所有 HTML 标签与常规 表达式,以便将它们重写为另一种形式,将 and 字符替换为其他内容,例如 和 分别。
<
>
[[
]]
之后,您可以替换 by 和 by .
<
<
>
>
我们用 原始 HTML 标记和 .
[[tag attr="value"]]
[[/tag]]
<tag attr="value">
</tag>
我们现在可以使用
strip_tags(
) 或使用更安全、更灵活的库(如 HTMLPurifier)来剥离我们想要的 HTML 标签。
PHP代码
抱歉,由于我使用 Nowdoc 字符串来简化编辑,颜色突出显示似乎有错误:
<?php
define('LINE_LENGTH', 60);
// A regular expression to find HTML tags.
define('REGEX_HTML_TAG', <<<'END_OF_REGEX'
~
<(?!!--) # Opening of a tag, but not for HTML comments.
(?<tagcontent> # Capture the text between the "<" and ">" chars.
\s*/?\s* # Optional spaces, optional slash (for closing tags).
(?<tagname>[a-z]+\b) # The tag name.
(?<attributes> # The tag attributes. Handle a possible ">" in one of them.
(?:
(?<quote>["']).*?\k<quote>
| # Or
[^>] # Any char not beeing ">"
)*
)
\s*/?\s* # For self-closing tags such as <img .../>.
)
> # Closing of a tag.
~isx
END_OF_REGEX);
// A regular expression to find double-bracketed tags.
define('REGEX_BRACKETED_TAG', <<<'END_OF_REGEX'
~
\[\[ # Opening of a bracketed tag.
(?<tagcontent> # Capture the text between the brackets.
.*?
)
\]\] # Closing of a bracketed tag.
~xs
END_OF_REGEX);
$html = <<<'END_OF_HTML'
<p> Hello < 30 </p> < < < <!-- Test --> <> > > >
<p><span class="icon icon-print">print</SPAN></p>
<LABEL for="firstname">First name:</LABEL>
<input required type="text" id="firstname" name="firstname" /><!-- with self-closing slash -->
<label for="age">Age:</label>
<INPut required type="number" id="age" name="age"><!-- without self-closing slash -->
Shit should not happen with malformed HTML --> <p id="paragraph-58">Isn't closed
Or something not opened </div>
Be carefull with ">" in tag attribute values (seems to be allowed unescaped):
<input type="password" pattern="(?!.*[><]).{8,}" name="password">
<abbr data-symbol=">" title="Greater than">gt</abbr>
<nav class="floating-nav">
<ul class="menu">
<li><a href="/">Home</a></li>
<li><a href="/contact">Contact</a></li>
</ul>
</nav>
Test with spaces: <
textarea id="text"
name="text"
class="decorative"
>Some text< / textarea>
END_OF_HTML;
/**
* Just to print a title or a step of the operations.
*
* @param string $text The text to print.
* @param bool $is_a_step If set to false then no step counter will be printed
* and incremented.
* @return void
*/
function printTitle($text, $is_a_step = true) {
static $counter = 1;
if ($is_a_step) {
print "\n\nSTEP $counter : $text\n";
$counter++;
} else {
print "\n\n$text\n";
}
print str_repeat('=', LINE_LENGTH) . "\n\n";
}
printTitle('Input HTML:', false);
print $html;
printTitle('Strip out HTML comments');
$output = preg_replace('/<!--.*?-->/', '', $html);
print $output;
printTitle('replace all HTML tags by [[tag]]');
// preg_replace() doesn't support named groups but pre_replace_callback() does, so we'll use $1.
$output = preg_replace(REGEX_HTML_TAG, '[[$1]]', $output);
print $output;
printTitle('replace all < and > by < and >');
$output = htmlspecialchars($output, ENT_HTML5); // ENT_HTML5 will leave single and double quotes.
print $output;
printTitle('replace back [[tag]] by <tag>');
$output = preg_replace(REGEX_BRACKETED_TAG, '<$1>', $output);
print $output;
printTitle('Strip the HTML tags with strip_tags()');
$output = strip_tags($output);
print $output;
// It seems that the crapy strip_tags() doesn't always manage it's job!!!
// So let's see if we find some left HTML tags.
printTitle('Check what strip_tags() did');
if (preg_match_all(REGEX_HTML_TAG, $output, $matches, PREG_SET_ORDER)) {
print "Oups! strip_tags() didn't clean up everything!\n";
print "Found " . count($matches) . " occurrences:\n";
foreach ($matches as $i => $match) {
$indented_match = preg_replace('/\r?\n/', '$0 ', $match[0]);
print '- match ' . ($i + 1) . " : $indented_match\n";
}
print "\n\nLet's try to do it ourselves, by replacing the matched tags by nothing.\n\n";
$output = preg_replace(REGEX_HTML_TAG, '', $output);
print $output;
}
else {
print "Ok, no tag found.\n";
}
您可以在此处运行它: https://onlinephp.io/c/005a3
对于正则表达式,我使用代替通常的方法来分隔模式和标志。这只是因为我们
然后可以使用斜杠而不在模式中转义它。~
/
我还使用了 ex趋向符号的标志,这样我就可以
在我的模式中加入一些注释,并将其写在几行上。x
为了提高可读性和灵活性,我还使用了命名捕获
组,例如,这样我们就没有索引,这
如果我们添加一些其他捕获组,可能会移动。反向引用
而不是索引版本。(?<quote>)
\k<quote>
\4
HTML5 似乎很宽容,因为 char 似乎可以
放在属性值中,而不将其替换为 。我
假设这在过去是不允许的,它变成了“OK/accepted”
帮助用户在字段上编写可读属性。我添加了一个密码字段的示例,其中您不是
允许使用 AND 字符。这是为了展示如何
在正则表达式中处理它,方法是接受一个属性
单引号或双引号值。>
>
pattern
<input>
<
>
输出:
Input HTML:
============================================================
<p> Hello < 30 </p> < < < <!-- Test --> <> > > >
<p><span class="icon icon-print">print</SPAN></p>
<LABEL for="firstname">First name:</LABEL>
<input required type="text" id="firstname" name="firstname" /><!-- with self-closing slash -->
<label for="age">Age:</label>
<INPut required type="number" id="age" name="age"><!-- without self-closing slash -->
Shit should not happen with malformed HTML --> <p id="paragraph-58">Isn't closed
Or something not opened </div>
Be carefull with ">" in tag attribute values (seems to be allowed unescaped):
<input type="password" pattern="(?!.*[><]).{8,}" name="password">
<abbr data-symbol=">" title="Greater than">gt</abbr>
<nav class="floating-nav">
<ul class="menu">
<li><a href="/">Home</a></li>
<li><a href="/contact">Contact</a></li>
</ul>
</nav>
Test with spaces: <
textarea id="text"
name="text"
class="decorative"
>Some text< / textarea>
STEP 1 : Strip out HTML comments
============================================================
<p> Hello < 30 </p> < < < <> > > >
<p><span class="icon icon-print">print</SPAN></p>
<LABEL for="firstname">First name:</LABEL>
<input required type="text" id="firstname" name="firstname" />
<label for="age">Age:</label>
<INPut required type="number" id="age" name="age">
Shit should not happen with malformed HTML --> <p id="paragraph-58">Isn't closed
Or something not opened </div>
Be carefull with ">" in tag attribute values (seems to be allowed unescaped):
<input type="password" pattern="(?!.*[><]).{8,}" name="password">
<abbr data-symbol=">" title="Greater than">gt</abbr>
<nav class="floating-nav">
<ul class="menu">
<li><a href="/">Home</a></li>
<li><a href="/contact">Contact</a></li>
</ul>
</nav>
Test with spaces: <
textarea id="text"
name="text"
class="decorative"
>Some text< / textarea>
STEP 2 : replace all HTML tags by [[tag]]
============================================================
[[p]] Hello < 30 [[/p]] < < < <> > > >
[[p]][[span class="icon icon-print"]]print[[/SPAN]][[/p]]
[[LABEL for="firstname"]]First name:[[/LABEL]]
[[input required type="text" id="firstname" name="firstname" /]]
[[label for="age"]]Age:[[/label]]
[[INPut required type="number" id="age" name="age"]]
Shit should not happen with malformed HTML --> [[p id="paragraph-58"]]Isn't closed
Or something not opened [[/div]]
Be carefull with ">" in tag attribute values (seems to be allowed unescaped):
[[input type="password" pattern="(?!.*[><]).{8,}" name="password"]]
[[abbr data-symbol=">" title="Greater than"]]gt[[/abbr]]
[[nav class="floating-nav"]]
[[ul class="menu"]]
[[li]][[a href="/"]]Home[[/a]][[/li]]
[[li]][[a href="/contact"]]Contact[[/a]][[/li]]
[[/ul]]
[[/nav]]
Test with spaces: [[
textarea id="text"
name="text"
class="decorative"
]]Some text[[ / textarea]]
STEP 3 : replace all < and > by < and >
============================================================
[[p]] Hello < 30 [[/p]] < < < &#60;> > > >
[[p]][[span class="icon icon-print"]]print[[/SPAN]][[/p]]
[[LABEL for="firstname"]]First name:[[/LABEL]]
[[input required type="text" id="firstname" name="firstname" /]]
[[label for="age"]]Age:[[/label]]
[[INPut required type="number" id="age" name="age"]]
Shit should not happen with malformed HTML --> [[p id="paragraph-58"]]Isn't closed
Or something not opened [[/div]]
Be carefull with ">" in tag attribute values (seems to be allowed unescaped):
[[input type="password" pattern="(?!.*[><]).{8,}" name="password"]]
[[abbr data-symbol=">" title="Greater than"]]gt[[/abbr]]
[[nav class="floating-nav"]]
[[ul class="menu"]]
[[li]][[a href="/"]]Home[[/a]][[/li]]
[[li]][[a href="/contact"]]Contact[[/a]][[/li]]
[[/ul]]
[[/nav]]
Test with spaces: [[
textarea id="text"
name="text"
class="decorative"
]]Some text[[ / textarea]]
STEP 4 : replace back [[tag]] by <tag>
============================================================
<p> Hello < 30 </p> < < < &#60;> > > >
<p><span class="icon icon-print">print</SPAN></p>
<LABEL for="firstname">First name:</LABEL>
<input required type="text" id="firstname" name="firstname" />
<label for="age">Age:</label>
<INPut required type="number" id="age" name="age">
Shit should not happen with malformed HTML --> <p id="paragraph-58">Isn't closed
Or something not opened </div>
Be carefull with ">" in tag attribute values (seems to be allowed unescaped):
<input type="password" pattern="(?!.*[><]).{8,}" name="password">
<abbr data-symbol=">" title="Greater than">gt</abbr>
<nav class="floating-nav">
<ul class="menu">
<li><a href="/">Home</a></li>
<li><a href="/contact">Contact</a></li>
</ul>
</nav>
Test with spaces: <
textarea id="text"
name="text"
class="decorative"
>Some text< / textarea>
STEP 5 : Strip the HTML tags with strip_tags()
============================================================
Hello < 30 < < < &#60;> > > >
print
First name:
Age:
Shit should not happen with malformed HTML --> Isn't closed
Or something not opened
Be carefull with ">" in tag attribute values (seems to be allowed unescaped):
gt
Home
Contact
Test with spaces: <
textarea id="text"
name="text"
class="decorative"
>Some text< / textarea>
STEP 6 : Check what strip_tags() did
============================================================
Oups! strip_tags() didn't clean up everything!
Found 2 occurrences:
- match 1 : <
textarea id="text"
name="text"
class="decorative"
>
- match 2 : < / textarea>
Let's try to do it ourselves, by replacing the matched tags by nothing.
Hello < 30 < < < &#60;> > > >
print
First name:
Age:
Shit should not happen with malformed HTML --> Isn't closed
Or something not opened
Be carefull with ">" in tag attribute values (seems to be allowed unescaped):
gt
Home
Contact
Test with spaces: Some text
如您所见,strip_tags()
没有处理标签周围的空格
名字,我觉得完全不安全!这就是为什么我建议
使用 HTMLPurifier 等库或
一个 DOM 解析器。
评论
htmlentities