提问人:clel 提问时间:9/27/2023 最后编辑:clel 更新时间:9/27/2023 访问量:124
如何让 BeautifulSoup 在获取文本时忽略原始 HTML 中的任何缩进
How do I make BeautifulSoup ignore any indents in original HTML when getting text
问:
我想,我基本上想要与函数的作用相反。prettify()
当一个人有HTML代码(摘录)时,如:
<p>
Test text with something in it
Test text with something in it
<i>and italic text</i> inside that text.
Test text with something in it.
</p>
<p>
Next paragraph with more text.
</p>
如何在没有换行符和缩进的情况下将文本放入其中?同时在树上递归循环,以便能够覆盖嵌套标签?
解析和处理后的结果应如下所示:
Test text with something in it Test text with something in it \textit{and italic text} inside that text. Test text with something in it.
Next paragraph with more text.
此外,为了进一步处理,最好在 Python 中单独获取斜体标签的内容。
这意味着(简化;实际上,我想调用函数来编写文档):pylatex
string result = ""
for child in soup.children:
for subchild in child.children:
# Some processing
result += subchild.string
显然,这也应该适用于更复杂的示例:
<p>
Test text with something in it
Test text with <i>something <b>in
<em>it</em> test</b></i>
<i>and <em>italic</em> text</i> inside that text.
Test text with something in it.
</p>
其中大部分并不复杂,但是如何正确处理嵌套文本的换行符和空格呢?
浏览器似乎正确地呈现了这一点。
如果 BeautifulSoup 无法做到这一点,另一个 Python 库也可以这样做。
我很震惊,这在 BeautifulSoup 中默认没有处理,而且我也没有找到任何函数做我想要的。
答:
0赞
Andrej Kesely
9/27/2023
#1
您可以使用(使用正确的参数):.get_text()
strip=True
separator=
import re
from bs4 import BeautifulSoup
html_text = """\
<p>
Test text with something in it
Test text with something in it
<i>and italic text</i> inside that text.
Test text with something in it.
</p>
<p>
Next paragraph with more text.
</p>
"""
soup = BeautifulSoup(html_text, "html.parser")
def my_get_text(tag):
t = tag.get_text(strip=True, separator=" ")
return re.sub(r"\s{2,}", " ", t)
# replace all <i></i> with \textit{ ... }
for i in soup.select("i"):
i.replace_with("\\textit{{{}}}".format(i.text))
for p in soup.select("p"):
print(my_get_text(p))
指纹:
Test text with something in it Test text with something in it \textit{and italic text} inside that text. Test text with something in it.
Next paragraph with more text.
编辑:使用递归:
import re
from bs4 import BeautifulSoup, NavigableString, Tag
html_text = """\
<p>
Test text with something in it
Test text with something in it
<i>and italic text</i> inside that text.
Test text with something in it.
</p>
<p>
Next paragraph with more text.
</p>
"""
soup = BeautifulSoup(html_text, "html.parser")
def my_get_text(tag):
t = tag.get_text(strip=True, separator=" ")
return re.sub(r"\s{2,}", " ", t)
def get_text(tag):
s = []
for c in tag.contents:
match c:
case NavigableString():
if c := my_get_text(c):
s.append(c)
case Tag() if c.name == "p":
yield from get_text(c)
case Tag() if c.name == "i":
s.append("\\textit{{{}}}".format(c.text))
if s:
yield s
for t in get_text(soup):
print(" ".join(t))
指纹:
Test text with something in it Test text with something in it \textit{and italic text} inside that text. Test text with something in it.
Next paragraph with more text.
评论
0赞
clel
9/27/2023
谢谢!我需要看看这是否对我的用例有帮助。可能是我把这个例子简化得太厉害了。如果我想单独处理斜体文本,该怎么办?
0赞
Andrej Kesely
9/27/2023
@clel 您是否正在将 HTML 转换为 Markdown?也许这会有所帮助:pypi.org/project/markdownify?但我相信还有更多的库。
0赞
clel
9/27/2023
不幸的是,我没有转换为 Markdown;这只是一个简单的例子。目标是转换为 LaTeX,尽管为了多功能性,一些通用答案可能是最好的。
0赞
clel
9/27/2023
p.get_text(strip=True, separator=" ")
不适用于多行缩进文本。至少在我测试时,它只适用于段落的第一行,而不是多个。
1赞
clel
9/27/2023
谢谢!我明天去看看。
1赞
PierXuY
9/27/2023
#2
你可以使用 lxml 来做到这一点。与 beautifulsoup 相比,它在某些方面会更加自由:
import re
import textwrap
from lxml import etree
html_str = """
<p>
Test text with something in it
Test text with something in it
<i>and italic text</i> inside that text.
Test text with something in it.
</p>
<p>
Next paragraph with more text.
</p>
"""
root = etree.HTML(html_str)
# customize special element processing logic
handle_tag_dict = {"i": lambda x: "\\textit{%s}" % x}
# tags that do not require additional line breaks
not_lb_tags = ["i"]
result = ""
for elem in root.iterdescendants():
tag = elem.tag
if result and tag not in not_lb_tags:
result += "\n"
# not None, then remove indent from paragraph text, replace \n with white space and strip
if (text := elem.text) and (
text := textwrap.dedent(text).replace("\n", " ").strip()
):
# add spaces to separate text
if tag in not_lb_tags:
result += " "
# convert excess space into a single
text = re.sub("\s{2,}", " ", text)
if tag in handle_tag_dict:
text = handle_tag_dict[tag](text)
result += text
if (tail := elem.tail) and (
tail := textwrap.dedent(tail).replace("\n", " ").strip()
):
result += " "
tail = re.sub("\s{2,}", " ", tail)
result += tail
print(result)
打印:
Test text with something in it Test text with something in it \textit{and italic text} inside that text. Test text with something in it.
Next paragraph with more text.
使用递归来处理嵌套结构:
import re
import textwrap
from lxml import etree
html_str = """
<p>
Test text with something in it
Test text with <i>something <b>in
<em>it</em> test</b></i>
<i>and <em>italic</em> text</i> inside that text.
Test text with something in it.
</p>
<p>
Next paragraph with more text.
</p>
"""
root = etree.HTML(html_str)
# customize special element processing logic
handle_tag_dict = {
"i": lambda x: "\\textit{%s}" % x,
"b": lambda x: "\\textbf{%s}" % x,
}
# https://developer.mozilla.org/en-US/docs/Web/HTML/Inline_elements#Elements
# https://github.com/gawel/pyquery/blob/0a7cbf0c21132727d0ebf6fb5b78120d4a037221/pyquery/text.py#L5
INLINE_TAGS = {
'a', 'abbr', 'acronym', 'b', 'bdo', 'big', 'br', 'button', 'cite',
'code', 'dfn', 'em', 'i', 'img', 'input', 'kbd', 'label', 'map',
'object', 'q', 'samp', 'script', 'select', 'small', 'span', 'strong',
'sub', 'sup', 'textarea', 'time', 'tt', 'var'
}
def get_text(elem: etree) -> str:
result = ""
tag = elem.tag
if text := elem.text:
# remove indent from paragraph text, replace \n with white space and strip
text = textwrap.dedent(text).replace("\n", " ").strip()
# convert excess space into a single
text = re.sub("\s{2,}", " ", text)
else:
text = ""
children = elem.getchildren()
if len(children) > 0:
for child in children:
text += get_text(child)
if tag in handle_tag_dict:
text = handle_tag_dict[tag](text)
if tag in INLINE_TAGS:
result += " " + text
else:
result += "\n" + text
if (tail := elem.tail) and (
tail := textwrap.dedent(tail).replace("\n", " ").strip()
):
result += " "
tail = re.sub("\s{2,}", " ", tail)
result += tail
return result
ret = get_text(root).lstrip()
print(ret)
打印:
Test text with something in it Test text with \textit{something \textbf{in it test}} \textit{and italic text} inside that text. Test text with something in it.
Next paragraph with more text.
评论
0赞
clel
9/27/2023
我认为,BeautifulSoup 也可能有类似的逻辑。我首先认为这可能会有所不同。不幸的是,如果嵌套标记的深度比示例中更深,则此方法现在会失败。我在这个问题中添加了一个更复杂的示例。iterdescendants()
lxml
0赞
PierXuY
9/27/2023
这可能需要逐层递归才能实现。
0赞
clel
9/27/2023
正是我的想法。这可能已经在另一个答案中部分处理了。
0赞
PierXuY
9/27/2023
是的,虽然可能有点麻烦,但我认为结合当前的解决方案是可行的。祝你好运。
0赞
PierXuY
9/27/2023
我提供了一个递归处理嵌套结构的解决方案,这可能是可行的,供参考。
0赞
pkExec
9/27/2023
#3
我会使用Pyquery及其方法。text()
from pyquery import PyQuery
doc = PyQuery("""\
<p>
Test text with something in it
Test text with something in it
<i>and italic text</i> inside that text.
Test text with something in it.
</p>
<p>
Next paragraph with more text.
</p>
""")
print(doc.text())
#Just italics:
print([i.text() for i in doc.items('i')])
默认情况下,该方法会压缩新行,这是您想要的。
如果要保留换行符,请使用参数text()
squash_space=False
评论
0赞
clel
9/27/2023
我玩了一下。不幸的是,我还找不到如何处理元素中的项目并从中创建修改后的文本的方法。它可能使用该方法起作用;不确定。i
contents()
0赞
pkExec
9/27/2023
如果你觉得我的回答对你原来的问题有帮助,你可以接受我的回答。您可以考虑提出一个新问题,询问您尝试过使用斜体项目但不起作用的问题。
0赞
clel
9/27/2023
但这已经包含在我最初的问题中。而你的答案现在还没有涵盖这方面。
评论