如何让 BeautifulSoup 在获取文本时忽略原始 HTML 中的任何缩进

How do I make BeautifulSoup ignore any indents in original HTML when getting text

提问人:clel 提问时间:9/27/2023 最后编辑:clel 更新时间:9/27/2023 访问量:124

问:

我想,我基本上想要与函数的作用相反。prettify()

当一个人有HTML代码(摘录)时,如:

      <p>
        Test text with something in it
        Test text with something in it
        <i>and italic text</i> inside that text.
        Test text with something in it.
      </p>
      <p>
        Next paragraph with more text.
      </p>

如何在没有换行符和缩进的情况下将文本放入其中?同时在树上递归循环,以便能够覆盖嵌套标签?

解析和处理后的结果应如下所示:

Test text with something in it Test text with something in it \textit{and italic text} inside that text. Test text with something in it.

Next paragraph with more text.

此外,为了进一步处理,最好在 Python 中单独获取斜体标签的内容。

这意味着(简化;实际上,我想调用函数来编写文档):pylatex

string result = ""
for child in soup.children:
    for subchild in child.children:
        # Some processing
        result += subchild.string

显然,这也应该适用于更复杂的示例:

      <p>
        Test text with something in it
        Test text with <i>something <b>in
        <em>it</em> test</b></i>
        <i>and <em>italic</em> text</i> inside that text.
        Test text with something in it.
      </p>

其中大部分并不复杂,但是如何正确处理嵌套文本的换行符和空格呢?

浏览器似乎正确地呈现了这一点。

如果 BeautifulSoup 无法做到这一点,另一个 Python 库也可以这样做。

我很震惊,这在 BeautifulSoup 中默认没有处理,而且我也没有找到任何函数做我想要的。

python html 解析 beautifulsoup

评论


答:

0赞 Andrej Kesely 9/27/2023 #1

您可以使用(使用正确的参数):.get_text()strip=Trueseparator=

import re

from bs4 import BeautifulSoup

html_text = """\
      <p>
        Test text with something in it
        Test text with something in it
        <i>and italic text</i> inside that text.
        Test text with something in it.
      </p>
      <p>
        Next paragraph with more text.
      </p>
"""

soup = BeautifulSoup(html_text, "html.parser")


def my_get_text(tag):
    t = tag.get_text(strip=True, separator=" ")
    return re.sub(r"\s{2,}", " ", t)


# replace all <i></i> with \textit{ ... }
for i in soup.select("i"):
    i.replace_with("\\textit{{{}}}".format(i.text))

for p in soup.select("p"):
    print(my_get_text(p))

指纹:

Test text with something in it Test text with something in it \textit{and italic text} inside that text. Test text with something in it.
Next paragraph with more text.

编辑:使用递归:

import re

from bs4 import BeautifulSoup, NavigableString, Tag

html_text = """\
      <p>
        Test text with something in it
        Test text with something in it
        <i>and italic text</i> inside that text.
        Test text with something in it.
      </p>
      <p>
        Next paragraph with more text.
      </p>
"""


soup = BeautifulSoup(html_text, "html.parser")


def my_get_text(tag):
    t = tag.get_text(strip=True, separator=" ")
    return re.sub(r"\s{2,}", " ", t)


def get_text(tag):
    s = []
    for c in tag.contents:
        match c:
            case NavigableString():
                if c := my_get_text(c):
                    s.append(c)
            case Tag() if c.name == "p":
                yield from get_text(c)
            case Tag() if c.name == "i":
                s.append("\\textit{{{}}}".format(c.text))
    if s:
        yield s


for t in get_text(soup):
    print(" ".join(t))

指纹:

Test text with something in it Test text with something in it \textit{and italic text} inside that text. Test text with something in it.
Next paragraph with more text.

评论

0赞 clel 9/27/2023
谢谢!我需要看看这是否对我的用例有帮助。可能是我把这个例子简化得太厉害了。如果我想单独处理斜体文本,该怎么办?
0赞 Andrej Kesely 9/27/2023
@clel 您是否正在将 HTML 转换为 Markdown?也许这会有所帮助:pypi.org/project/markdownify?但我相信还有更多的库。
0赞 clel 9/27/2023
不幸的是,我没有转换为 Markdown;这只是一个简单的例子。目标是转换为 LaTeX,尽管为了多功能性,一些通用答案可能是最好的。
0赞 clel 9/27/2023
p.get_text(strip=True, separator=" ")不适用于多行缩进文本。至少在我测试时,它只适用于段落的第一行,而不是多个。
1赞 clel 9/27/2023
谢谢!我明天去看看。
1赞 PierXuY 9/27/2023 #2

你可以使用 lxml 来做到这一点。与 beautifulsoup 相比,它在某些方面会更加自由:

import re
import textwrap
from lxml import etree

html_str = """
  <p>
    Test text with something in it
    Test text with something in it
    <i>and italic text</i> inside that text.
    Test text with something in it.
  </p>
  <p>
    Next paragraph with more text.
  </p>
"""
root = etree.HTML(html_str)

# customize special element processing logic
handle_tag_dict = {"i": lambda x: "\\textit{%s}" % x}
# tags that do not require additional line breaks
not_lb_tags = ["i"]

result = ""
for elem in root.iterdescendants():
    tag = elem.tag
    if result and tag not in not_lb_tags:
        result += "\n"

    # not None, then remove indent from paragraph text, replace \n with white space and strip
    if (text := elem.text) and (
        text := textwrap.dedent(text).replace("\n", " ").strip()
    ):
        # add spaces to separate text
        if tag in not_lb_tags:
            result += " "
        # convert excess space into a single
        text = re.sub("\s{2,}", " ", text)
        if tag in handle_tag_dict:
            text = handle_tag_dict[tag](text)
        result += text

    if (tail := elem.tail) and (
        tail := textwrap.dedent(tail).replace("\n", " ").strip()
    ):
        result += " "
        tail = re.sub("\s{2,}", " ", tail)
        result += tail

print(result)

打印:

Test text with something in it Test text with something in it \textit{and italic text} inside that text. Test text with something in it.
Next paragraph with more text.

使用递归来处理嵌套结构:

import re
import textwrap
from lxml import etree

html_str = """
  <p>
    Test text with something in it
    Test text with <i>something <b>in
    <em>it</em> test</b></i>
    <i>and <em>italic</em> text</i> inside that text.
    Test text with something in it.
  </p>
    <p>
        Next paragraph with more text.
  </p>
"""
root = etree.HTML(html_str)

# customize special element processing logic
handle_tag_dict = {
    "i": lambda x: "\\textit{%s}" % x,
    "b": lambda x: "\\textbf{%s}" % x,
}

# https://developer.mozilla.org/en-US/docs/Web/HTML/Inline_elements#Elements
# https://github.com/gawel/pyquery/blob/0a7cbf0c21132727d0ebf6fb5b78120d4a037221/pyquery/text.py#L5
INLINE_TAGS = {
    'a', 'abbr', 'acronym', 'b', 'bdo', 'big', 'br', 'button', 'cite',
    'code', 'dfn', 'em', 'i', 'img', 'input', 'kbd', 'label', 'map',
    'object', 'q', 'samp', 'script', 'select', 'small', 'span', 'strong',
    'sub', 'sup', 'textarea', 'time', 'tt', 'var'
}

def get_text(elem: etree) -> str:
    result = ""
    tag = elem.tag

    if text := elem.text:
        # remove indent from paragraph text, replace \n with white space and strip
        text = textwrap.dedent(text).replace("\n", " ").strip()
        # convert excess space into a single
        text = re.sub("\s{2,}", " ", text)
    else:
        text = ""

    children = elem.getchildren()
    if len(children) > 0:
        for child in children:
            text += get_text(child)

    if tag in handle_tag_dict:
        text = handle_tag_dict[tag](text)

    if tag in INLINE_TAGS:
        result += " " + text
    else:
        result += "\n" + text

    if (tail := elem.tail) and (
        tail := textwrap.dedent(tail).replace("\n", " ").strip()
    ):
        result += " "
        tail = re.sub("\s{2,}", " ", tail)
        result += tail

    return result


ret = get_text(root).lstrip()
print(ret)

打印:

Test text with something in it Test text with \textit{something \textbf{in it test}} \textit{and italic text} inside that text. Test text with something in it.
Next paragraph with more text.

评论

0赞 clel 9/27/2023
我认为,BeautifulSoup 也可能有类似的逻辑。我首先认为这可能会有所不同。不幸的是,如果嵌套标记的深度比示例中更深,则此方法现在会失败。我在这个问题中添加了一个更复杂的示例。iterdescendants()lxml
0赞 PierXuY 9/27/2023
这可能需要逐层递归才能实现。
0赞 clel 9/27/2023
正是我的想法。这可能已经在另一个答案中部分处理了。
0赞 PierXuY 9/27/2023
是的,虽然可能有点麻烦,但我认为结合当前的解决方案是可行的。祝你好运。
0赞 PierXuY 9/27/2023
我提供了一个递归处理嵌套结构的解决方案,这可能是可行的,供参考。
0赞 pkExec 9/27/2023 #3

我会使用Pyquery及其方法。text()

from pyquery import PyQuery

doc = PyQuery("""\
      <p>
        Test text with something in it
        Test text with something in it
        <i>and italic text</i> inside that text.
        Test text with something in it.
      </p>
      <p>
        Next paragraph with more text.
      </p>
""")
print(doc.text())
#Just italics:
print([i.text() for i in doc.items('i')])

默认情况下,该方法会压缩新行,这是您想要的。 如果要保留换行符,请使用参数text()squash_space=False

评论

0赞 clel 9/27/2023
我玩了一下。不幸的是,我还找不到如何处理元素中的项目并从中创建修改后的文本的方法。它可能使用该方法起作用;不确定。icontents()
0赞 pkExec 9/27/2023
如果你觉得我的回答对你原来的问题有帮助,你可以接受我的回答。您可以考虑提出一个新问题,询问您尝试过使用斜体项目但不起作用的问题。
0赞 clel 9/27/2023
但这已经包含在我最初的问题中。而你的答案现在还没有涵盖这方面。