需要 Python 中正则表达式模式的帮助 – 解析复杂的 HTML 结构-解网

问：

我正在尝试使用 Python 的 re 模块解析复杂的 HTML 结构，但我的正则表达式模式遇到了障碍。这是我想做的：

我有包含嵌套元素的 HTML 文本，我想提取最内层标签的内容。但是，我似乎无法正确处理我的正则表达式模式。这是我正在使用的代码：

import re

html_text = """
<div>
    <div>
        <div>
            Innermost Content 1
        </div>
    </div>
    <div>
        Innermost Content 2
    </div>
</div>
"""

pattern = r'<div>(.*?)<\/div>'
result = re.findall(pattern, html_text, re.DOTALL)

print(result)

我希望这段代码返回最里面元素的内容，如下所示：

['Innermost Content 1', 'Innermost Content 2']

但它没有按预期工作。我的正则表达式模式做错了什么，我该如何修复它以达到预期的结果？任何帮助将不胜感激！

python 正则表达式解析 web-scraping html-parsing

import re

html_text = """
<div>
    <div>
        <div>
            Innermost Content 1
        </div>
    </div>
    <div>
        Innermost Content 2
    </div>
</div>
"""

pattern = r'<div>([^<]*?)<\/div>'
result = re.findall(pattern, html_text, re.DOTALL)

result = [content.strip() for content in result if content.strip()]

print(result)

0赞 LetzerWille 9/2/2023 #3

可以使用 re.split（）

print([st.strip() for st in re.split(r'<div>\n?|<.div>\n?|\n', html_text) if not st.isspace() and st])

['Innermost Content 1', 'Innermost Content 2']

上一个：Tailwind CSS 样式在抓取 Next.js 页面后未应用于 React 应用程序

下一个：如何提取每个 <a href> 标签中的内容？

需要 Python 中正则表达式模式的帮助 – 解析复杂的 HTML 结构

Need Assistance with a regex pattern in Python – Parsing complex HTML structures

评论