如何让 BeautifulSoup 在获取文本时忽略原始 HTML 中的任何缩进-解网

问：

我想，我基本上想要与函数的作用相反。prettify()

当一个人有HTML代码（摘录）时，如：

      <p>
        Test text with something in it
        Test text with something in it
        <i>and italic text</i> inside that text.
        Test text with something in it.
      </p>
      <p>
        Next paragraph with more text.
      </p>

如何在没有换行符和缩进的情况下将文本放入其中？同时在树上递归循环，以便能够覆盖嵌套标签？

解析和处理后的结果应如下所示：

Test text with something in it Test text with something in it \textit{and italic text} inside that text. Test text with something in it.

Next paragraph with more text.

此外，为了进一步处理，最好在 Python 中单独获取斜体标签的内容。

这意味着（简化;实际上，我想调用函数来编写文档）：pylatex

string result = ""
for child in soup.children:
    for subchild in child.children:
        # Some processing
        result += subchild.string

显然，这也应该适用于更复杂的示例：

      <p>
        Test text with something in it
        Test text with <i>something <b>in
        <em>it</em> test</b></i>
        <i>and <em>italic</em> text</i> inside that text.
        Test text with something in it.
      </p>

其中大部分并不复杂，但是如何正确处理嵌套文本的换行符和空格呢？

浏览器似乎正确地呈现了这一点。

如果 BeautifulSoup 无法做到这一点，另一个 Python 库也可以这样做。

我很震惊，这在 BeautifulSoup 中默认没有处理，而且我也没有找到任何函数做我想要的。

python html 解析 beautifulsoup

import re

from bs4 import BeautifulSoup

html_text = """\
      <p>
        Test text with something in it
        Test text with something in it
        <i>and italic text</i> inside that text.
        Test text with something in it.
      </p>
      <p>
        Next paragraph with more text.
      </p>
"""

soup = BeautifulSoup(html_text, "html.parser")


def my_get_text(tag):
    t = tag.get_text(strip=True, separator=" ")
    return re.sub(r"\s{2,}", " ", t)


# replace all <i></i> with \textit{ ... }
for i in soup.select("i"):
    i.replace_with("\\textit{{{}}}".format(i.text))

for p in soup.select("p"):
    print(my_get_text(p))

指纹：

Test text with something in it Test text with something in it \textit{and italic text} inside that text. Test text with something in it.
Next paragraph with more text.

编辑：使用递归：

import re

from bs4 import BeautifulSoup, NavigableString, Tag

html_text = """\
      <p>
        Test text with something in it
        Test text with something in it
        <i>and italic text</i> inside that text.
        Test text with something in it.
      </p>
      <p>
        Next paragraph with more text.
      </p>
"""


soup = BeautifulSoup(html_text, "html.parser")


def my_get_text(tag):
    t = tag.get_text(strip=True, separator=" ")
    return re.sub(r"\s{2,}", " ", t)


def get_text(tag):
    s = []
    for c in tag.contents:
        match c:
            case NavigableString():
                if c := my_get_text(c):
                    s.append(c)
            case Tag() if c.name == "p":
                yield from get_text(c)
            case Tag() if c.name == "i":
                s.append("\\textit{{{}}}".format(c.text))
    if s:
        yield s


for t in get_text(soup):
    print(" ".join(t))

指纹：

Test text with something in it Test text with something in it \textit{and italic text} inside that text. Test text with something in it.
Next paragraph with more text.

import re
import textwrap
from lxml import etree

html_str = """
  <p>
    Test text with something in it
    Test text with something in it
    <i>and italic text</i> inside that text.
    Test text with something in it.
  </p>
  <p>
    Next paragraph with more text.
  </p>
"""
root = etree.HTML(html_str)

# customize special element processing logic
handle_tag_dict = {"i": lambda x: "\\textit{%s}" % x}
# tags that do not require additional line breaks
not_lb_tags = ["i"]

result = ""
for elem in root.iterdescendants():
    tag = elem.tag
    if result and tag not in not_lb_tags:
        result += "\n"

    # not None, then remove indent from paragraph text, replace \n with white space and strip
    if (text := elem.text) and (
        text := textwrap.dedent(text).replace("\n", " ").strip()
    ):
        # add spaces to separate text
        if tag in not_lb_tags:
            result += " "
        # convert excess space into a single
        text = re.sub("\s{2,}", " ", text)
        if tag in handle_tag_dict:
            text = handle_tag_dict[tag](text)
        result += text

    if (tail := elem.tail) and (
        tail := textwrap.dedent(tail).replace("\n", " ").strip()
    ):
        result += " "
        tail = re.sub("\s{2,}", " ", tail)
        result += tail

print(result)

打印：

Test text with something in it Test text with something in it \textit{and italic text} inside that text. Test text with something in it.
Next paragraph with more text.

使用递归来处理嵌套结构：

import re
import textwrap
from lxml import etree

html_str = """
  <p>
    Test text with something in it
    Test text with <i>something <b>in
    <em>it</em> test</b></i>
    <i>and <em>italic</em> text</i> inside that text.
    Test text with something in it.
  </p>
    <p>
        Next paragraph with more text.
  </p>
"""
root = etree.HTML(html_str)

# customize special element processing logic
handle_tag_dict = {
    "i": lambda x: "\\textit{%s}" % x,
    "b": lambda x: "\\textbf{%s}" % x,
}

# https://developer.mozilla.org/en-US/docs/Web/HTML/Inline_elements#Elements
# https://github.com/gawel/pyquery/blob/0a7cbf0c21132727d0ebf6fb5b78120d4a037221/pyquery/text.py#L5
INLINE_TAGS = {
    'a', 'abbr', 'acronym', 'b', 'bdo', 'big', 'br', 'button', 'cite',
    'code', 'dfn', 'em', 'i', 'img', 'input', 'kbd', 'label', 'map',
    'object', 'q', 'samp', 'script', 'select', 'small', 'span', 'strong',
    'sub', 'sup', 'textarea', 'time', 'tt', 'var'
}

def get_text(elem: etree) -> str:
    result = ""
    tag = elem.tag

    if text := elem.text:
        # remove indent from paragraph text, replace \n with white space and strip
        text = textwrap.dedent(text).replace("\n", " ").strip()
        # convert excess space into a single
        text = re.sub("\s{2,}", " ", text)
    else:
        text = ""

    children = elem.getchildren()
    if len(children) > 0:
        for child in children:
            text += get_text(child)

    if tag in handle_tag_dict:
        text = handle_tag_dict[tag](text)

    if tag in INLINE_TAGS:
        result += " " + text
    else:
        result += "\n" + text

    if (tail := elem.tail) and (
        tail := textwrap.dedent(tail).replace("\n", " ").strip()
    ):
        result += " "
        tail = re.sub("\s{2,}", " ", tail)
        result += tail

    return result


ret = get_text(root).lstrip()
print(ret)

打印：

Test text with something in it Test text with \textit{something \textbf{in it test}} \textit{and italic text} inside that text. Test text with something in it.
Next paragraph with more text.

from pyquery import PyQuery

doc = PyQuery("""\
      <p>
        Test text with something in it
        Test text with something in it
        <i>and italic text</i> inside that text.
        Test text with something in it.
      </p>
      <p>
        Next paragraph with more text.
      </p>
""")
print(doc.text())
#Just italics:
print([i.text() for i in doc.items('i')])

默认情况下，该方法会压缩新行，这是您想要的。如果要保留换行符，请使用参数text()squash_space=False

如何让 BeautifulSoup 在获取文本时忽略原始 HTML 中的任何缩进

How do I make BeautifulSoup ignore any indents in original HTML when getting text

评论

评论

评论

评论