使用 python 从 HTML 中提取句子

Extract sentence from HTML using python

提问人:Los 提问时间:12/19/2022 更新时间:12/19/2022 访问量:70

问:

我使用 python(BeautifulSoup) 从 HTML 文件中提取了一个感兴趣的组件 我的代码:

import pandas as pd
import numpy as np
from lxml import html
from html.parser import HTMLParser
from bs4 import BeautifulSoup


HTMLFile = open("/home/kospsych/Desktop/projects/dark_web/file", "r")

index = HTMLFile.read()
S = BeautifulSoup(index, 'lxml')

Tag = S.select_one('.inner')


print(Tag)

这将打印以下结果:

<div class="inner" id="msg_550811">Does anyone know if it takes a set length of time to be given verified vendor status by sending a signed PGP message to the admin (in stead of paying the vendor bond)?<br/><br/>I'm regularly on Agora but I want to join the Abraxas club as well.<br/><br/>Mindful-Shaman</div>

和类型:

<class 'bs4.element.Tag'>

我想以某种方式删除 div 标签和 br 标签,并只得到一个字符串,这将是上面的句子。 如何有效地做到这一点?

python-3.x beautifulsoup html解析

评论


答:

2赞 Andrej Kesely 12/19/2022 #1

您可以使用 或 方法:.text.get_text()

from bs4 import BeautifulSoup

soup = BeautifulSoup(
    """<div class="inner" id="msg_550811">Does anyone know if it takes a set length of time to be given verified vendor status by sending a signed PGP message to the admin (in stead of paying the vendor bond)?<br/><br/>I'm regularly on Agora but I want to join the Abraxas club as well.<br/><br/>Mindful-Shaman</div>""",
    "html.parser",
)

Tag = soup.select_one(".inner")
print(Tag.get_text(strip=True, separator=" "))

指纹:

Does anyone know if it takes a set length of time to be given verified vendor status by sending a signed PGP message to the admin (in stead of paying the vendor bond)? I'm regularly on Agora but I want to join the Abraxas club as well. Mindful-Shaman