在 Python 中拆分标记上的文本-解网

问：

我有以下一行文字：

<code>stuff</code> and stuff and $\LaTeX$ and <pre class="mermaid">stuff</pre>

使用 Python，我想分解标记实体以获取以下列表：

['<code>', 'stuff', '</code>', ' and stuff and $\\LaTeX$ ', '<pre class="mermaid">', 'stuff', '</pre>']

到目前为止，我使用了：

markup = re.compile(r"(<(?P<tag>[a-z]+).*>)(.*?)(<\/(?P=tag)>)")
text = '<code>stuff</code> and stuff and $\LaTeX$ and <pre class="mermaid">stuff</pre>'
words = re.split(markup, text)

但它产生：

['<code>', 'code', 'stuff', '</code>', ' and stuff and $\\LaTeX$ ', '<pre class="mermaid">', 'pre', 'stuff', '</pre>']

问题是该组被添加到列表中，因为它已被捕获。我捕获它只是为了获得最接近的结束标签。(?P=tag)

假设代码一次只处理一行，我怎么能在结果列表中摆脱它？

Python 正则表达式拆分

s = r'<code>stuff</code> and stuff and $\LaTeX$ and <pre class="mermaid">stuff</pre>'

l = []

for i in range(len(s)):
    if s[i] == ">":
        l[-1] += s[i]
        l.append("")
    elif s[i] == "<":
        l.append("")
        l[-1] += s[i]
    else:
        l[-1] += s[i]
        
l.pop()
print(l)

输出：['<code>', 'stuff', '</code>', ' and stuff and $\\LaTeX$ and ', '<pre class="mermaid">', 'stuff', '</pre>']

3赞 ApaxPhoenix 8/9/2023 #2

您可以使用 which 是专为其设计的模块，它是的同义词。xmlxml fileshtml

import xml.etree.ElementTree as ET

text = '<code>stuff</code> and stuff and $\LaTeX$ and <pre class="mermaid">stuff</pre>'

root = ET.fromstring(f'<root>{text}</root>')

result = []

for element in root:
    if element.tag:
        result.append(f'<{element.tag}>')
    if element.text:
        result.extend(element.text.split())
    if element.tail:
        result.append(element.tail)

print(result)

2赞 Luatic 8/9/2023 #3

RegEx 不适合解析 HTML。但是，它通常足以进行标记化。使用，标记化变成单行：re.finditer

list(map(lambda x: x.group(0), re.finditer(r"(?:<(?:.*?>)?)|[^<]+", s)))

解释：

仅使用非捕获组;我们在这里不需要特定的捕获。(?:...)
匹配“标签”（可能是无效的（只是符号），只能通过其开头识别，直到）或纯文本。<(?:.*?>)?<<>[^<]+

这将处理您的测试用例

s = '<code>stuff</code> and stuff and $\LaTeX$ and <pre class="mermaid">stuff</pre>'

正确地，生产

['<code>', 'stuff', '</code>', ' and stuff and $\\LaTeX$ and ', '<pre class="mermaid">', 'stuff', '</pre>']

但请注意，一个成熟的 HTML 分词器需要更复杂的常规语法来正确处理属性等。你最好使用现成的库来为你做标记解析（甚至只是标记化）。onclick = "console.log(1 < 2)"

上一个：直接通过“re.split”（在 Python 中）将每个相邻的不同数字之间的字符串分开？

下一个：将数字列表的句子按其数字拆分

在 Python 中拆分标记上的文本

Split text on markup in Python

评论