如何解析HTML元素?

How to parse HTML elements?

提问人:JJH 提问时间:10/9/2022 最后编辑:HedgeHogJJH 更新时间:10/9/2022 访问量:1015

问:

我正在寻找从 Github 网页列表中提取“类别”下列出的项目。

在示例代码中,我能够识别需要解析的文本块,但是当我解析文本时,输出如下所示:

['\n\n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n ', '\n\n  Continuous integration\n\n\n  Security\n\n']

我正在寻找的输出是:

[Continuous integration, Security]

如何更改代码行以获得最终结果?get_text()

from bs4 import BeautifulSoup
import requests

websites = ['https://github.com/marketplace/actions/yq-portable-yaml-processor','https://github.com/marketplace/actions/TruffleHog-OSS']

for links in websites:
URL = requests.get(links)
detailsoup = BeautifulSoup(URL.content, "html.parser")

categories = detailsoup.findAll('div', {'class': 'ml-n1 clearfix'})
print(categories)
categoriesList = [categories.get_text() for categories in categories]
print(categoriesList)

# keep only 1st element & maintain type as list
categoriesList = categoriesList[1:2]
if not categoriesList:
    categoriesList.insert(0, 'Error')
python-3.x 网页抓取 beautifulsoup html 解析

评论

0赞 furas 10/10/2022
get_text()has - 但您始终可以将 -loop 与(和其他函数)一起使用来修改列表。strip=Truefor.strip()

答:

1赞 HedgeHog 10/9/2022 #1

只需添加参数:strip=True

categoriesList = [categories.get_text(strip=True) for category in categories]

此外,请尝试更具体地选择您的元素:

categories = detailsoup.find_all('a', {'class': 'topic-tag'})

在较新的代码中,避免使用旧的语法 findAll(),而是使用 find_all() 或 select()css 选择器 - 欲了解更多信息,请花一分钟时间检查文档

from bs4 import BeautifulSoup
import requests

websites = ['https://github.com/marketplace/actions/yq-portable-yaml-processor','https://github.com/marketplace/actions/TruffleHog-OSS']

for links in websites:
    URL = requests.get(links)
    detailsoup = BeautifulSoup(URL.content, "html.parser")

    categories = detailsoup.find_all('a', {'class': 'topic-tag'})
    categoriesList = [categories.get_text(strip=True) for category in categories]
    print(categoriesList)