如何提取每个 <a href> 标签中的内容？-解网

问：

我正在从事一个涉及从网站中提取一些数据的项目。具体来说，我有兴趣提取每个类别的名称及其描述。

我曾考虑过在 Python 中使用像 BeautifulSoup 这样的网页抓取库，但我不确定如何浏览每个类别链接以获取所需的信息。

该网站列出了多个类别名称，每个类别都有自己的页面，其中包含参数和描述。我不确定如何以编程方式“单击”每个链接以抓取数据。

import requests
from bs4 import BeautifulSoup
URL = "https://docs.derivative.ca/Category:CHOPs"
page = requests.get(URL)
soup = BeautifulSoup(page.content, "html.parser")
results = soup.find(id="mw-pages")
print(results.prettify())
chop_elements = results.find_all("div", class_="mw-content-ltr")
for chop_element in chop_elements:
    print(chop_element, end="\n"*2)

<li><a href="/Analyze_CHOP" title="Analyze CHOP">Analyze CHOP</a></li>
<li><a href="/Angle_CHOP" title="Angle CHOP">Angle CHOP</a></li>
<li><a href="/Attribute_CHOP" title="Attribute CHOP">Attribute CHOP</a></li>
<li><a href="/Audio_Band_EQ_CHOP" title="Audio Band EQ CHOP">Audio Band EQ CHOP</a></li>

网站 https://docs.derivative.ca/Category:CHOPs

浏览每个类别链接并提取所需数据的最佳方法是什么？但是，我不完全确定我在做什么，也不确定我是否正确检查了 HTML 结构。我正在寻找有关如何解决此问题的指导。

python 网页抓取 beautifulsoup html-解析

rootUrl = 'https://docs.derivative.ca'
req = requests.get(rootUrl+'/Category:CHOPs')
req.raise_for_status() # in case of error

## getting the links
soup = BeautifulSoup(req.content, 'html.parser')
groups = soup.select('div.mw-category-group:has(h3~ul)')
chops = [{
    'group': g.h3.get_text(strip=True),
    'category': a.get_text(strip=True),
    'link': rootUrl + a['href']
} for g in groups for a in g.select('li>a[href]')]

cLen = len(chops)
print('found', cLen, 'categories with links')

## getting the descriptions
for i, c in enumerate(chops):
    # print(f'scraping {i+1} of {cLen}: {repr(c["category"])} from {c["link"]}')
    cReq = requests.get(c['link'])
    try: cReq.raise_for_status()
    except: continue
    cSoup = BeautifulSoup(cReq.content, 'html.parser')

    summary_p1 = cSoup.select_one('h2:has(span#Summary)~p')
    if summary_p1: chops[i]['description'] = summary_p1.get_text(strip=True)

上一个：需要 Python 中正则表达式模式的帮助 – 解析复杂的 HTML 结构

下一个：网站没有给我json文件

如何提取每个 <a href> 标签中的内容？

How to Extract Content Inside Each <a href> Tag?

评论

如何提取每个 &lt;a href&gt; 标签中的内容？

How to Extract Content Inside Each <a href> Tag?

评论

如何提取每个 <a href> 标签中的内容？