替换所有 HTML 标记，除了一些在 Python 中使用正则表达式的标记-解网

问：

我正在尝试使用正则表达式和 Python 将所有 HTML 标签替换为两个例外。

现在这是我正在使用的正则表达式：这是我试图“清理”的示例文本：<\/?(?!(?:em|strong)\b)[a-z](?:[^>\"']|\"[^\"]*\"|'[^']*')*>

<p><em>test text</em></p>
<p><strong>test test test</strong></p>
<div>some other text</div>
<div></div>
<div></div>
<div>

输出如下：

test text
test test test
some other text

即使它应该是：

<em>test text</em>
<strong>test test test</strong>
some other text

我用来测试一切的代码是：

import re

text = '''<p><em>test text</em></p>
<p><strong>test test test</strong></p>
<div>some other text</div>
<div></div>
<div></div>
<div>'''

res = re.sub("<\/?(?!(?:em|strong)\b)[a-z](?:[^>\"']|\"[^\"]*\"|'[^']*')*>", '', text)

print(res.strip())

现在我相信正则表达式是首选，因为我必须清理的文本来自 API，所以它不会以任何方式扩展。

有什么建议吗？为什么它取代了“em”和“strong”标签，即使有负面的展望？

python-3.x 正则表达式

替换所有 HTML 标记，除了一些在 Python 中使用正则表达式的标记

Replace all HTML tags except some using Regex in Python

评论