问：

我正在尝试使用 BS4 来取消收入发布的公开文件，有一个名为“reconciliation（s）”的关键字，所以我尝试使用 Regex 进行搜索。我发现所有的 reconciliation 关键字都应该在某个 div 标签中，所以我将我的搜索函数设置为：，但不知何故，我发现返回的结果为空的情况，这是其中之一：for text_tag in soup.find_all('div', text=re.compile('(reconciliation)|(reconciliations)', re.IGNORECASE), recursive=True):

         <div style="clear:both;max-width:100%;position:relative;">
          <p style="font-family:'Times New Roman','Times','serif';font-size:10pt;line-height:1.19;text-align:center;margin:0pt;">
           <font style="font-size:11pt;">
            PENSKE AUTOMOTIVE GROUP, INC.
           </font>
          </p>
          <p style="font-family:'Times New Roman','Times','serif';font-size:10pt;line-height:1.19;text-align:center;margin:0pt;">
           <font style="font-size:11pt;">
            Consolidated Non-GAAP Reconciliations
           </font>
          </p>

它位于带有单词协调的 div 标签下

我什么也没尝试，因为我不知道从哪里开始修复它......

蟒蛇网页抓取 beautifulsoup python-re

# imports
from bs4 import BeautifulSoup

# function to search in the divs
def contains_reconciliation(tag):
    return 'reconciliation' in tag.text.lower()

# Search results
result_divs = soup.find_all(contains_reconciliation, 'div')

# do something with the results
for div in result_divs:
    print(div)

根据您发布的 html 代码，您要搜索元素而不是`fontdiv`

# imports
from bs4 import BeautifulSoup

# function to search in the divs
def contains_reconciliation(tag):
    return 'reconciliation' in tag.text.lower()

# Search results
result_fonts = soup.find_all(contains_reconciliation, 'font ')

# do something with the results
for font in result_fonts:
    # you can get the parent div if thats what you are looking for
    div = font.parent().parent()
    print(div)
    print(font)

from bs4 import BeautifulSoup

html_text = """\
         <div style="clear:both;max-width:100%;position:relative;">
          <p style="font-family:'Times New Roman','Times','serif';font-size:10pt;line-height:1.19;text-align:center;margin:0pt;">
           <font style="font-size:11pt;">
            PENSKE AUTOMOTIVE GROUP, INC.
           </font>
          </p>
          <p style="font-family:'Times New Roman','Times','serif';font-size:10pt;line-height:1.19;text-align:center;margin:0pt;">
           <font style="font-size:11pt;">
            Consolidated Non-GAAP Reconciliations
           </font>
          </p>
         </div>"""

soup = BeautifulSoup(html_text, "html.parser")


for text_tag in soup.find_all(
    lambda tag: tag.name == "div" and "reconciliation" in tag.get_text().lower()
):
    print(text_tag)

指纹：

<div style="clear:both;max-width:100%;position:relative;">
<p style="font-family:'Times New Roman','Times','serif';font-size:10pt;line-height:1.19;text-align:center;margin:0pt;">
<font style="font-size:11pt;">
            PENSKE AUTOMOTIVE GROUP, INC.
           </font>
</p>
<p style="font-family:'Times New Roman','Times','serif';font-size:10pt;line-height:1.19;text-align:center;margin:0pt;">
<font style="font-size:11pt;">
            Consolidated Non-GAAP Reconciliations
           </font>
</p>
</div>

无法使用 bs4 和 re 定位 html 标签

Not able to position the html tag with bs4 and re

评论

根据您发布的 html 代码，您要搜索元素而不是`fontdiv`

评论

评论

无法使用 bs4 和 re 定位 html 标签

Not able to position the html tag with bs4 and re

评论

根据您发布的 html 代码，您要搜索元素而不是fontdiv

评论

评论

根据您发布的 html 代码，您要搜索元素而不是`fontdiv`