检索锚标记中的所有 href

Retrieving all hrefs in anchor tags

提问人:Paul Corcoran 提问时间:3/18/2023 更新时间:3/18/2023 访问量:27

问:

import warnings
import numpy as np
from datetime import datetime
import json
from bs4 import BeautifulSoup

warnings.filterwarnings('ignore')

url = "https://understat.com/league/EPL/2022"
response = requests.get(url)

soup = BeautifulSoup(response.content, "html.parser")

for link in soup.find_all("a", class_="match-info"):
    href = link.get("href")
    print(href)

不幸的是,此代码未找到任何结果,所需的结果是网页这一部分中的 hrefs

< class=“match-info” data-isresult=“true” href = “match/18265” >

有什么想法吗?

蟒蛇 美汤

评论

0赞 sytech 3/18/2023
你确定这真的在回应中吗? 可能会向您显示其他情况。请记住,不会执行 Javascript 代码。print(response.text)requests.get

答:

1赞 Andrej Kesely 3/18/2023 #1

您在页面上看到的数据是在 element 内部编码的,因此看不到它。要对其进行解码并将其放入 Pandas 数据帧中,您可以使用以下示例:<script>beautifulsoup

import re
import json
import requests
import pandas as pd


url = "https://understat.com/league/EPL/2022"
html_doc = requests.get(url).text

data = re.search(r"datesData\s*=\s*JSON\.parse\('(.*?)'\)", html_doc).group(1)
data = re.sub(r'\\x([\dA-F]{2})', lambda g: chr(int(g.group(1), 16)), data)
data = json.loads(data)

all_data = []
for d in data:
    all_data.append({
        'Team 1': d['h']['title'],
        'Team 2': d['a']['title'],
        'Goals': f'{d["goals"]["h"]} - {d["goals"]["a"]}',
        'Date': d['datetime'],
        'xG': [d['xG']['h'], d['xG']['a']],
        'forecast': list(d.get('forecast', {}).values())
    })

df = pd.DataFrame(all_data)
print(df)

指纹:

                      Team 1                   Team 2        Goals                 Date                     xG                  forecast
0             Crystal Palace                  Arsenal        0 - 2  2022-08-05 19:00:00     [1.20637, 1.43601]  [0.2864, 0.2912, 0.4224]
1                     Fulham                Liverpool        2 - 2  2022-08-06 11:30:00     [1.26822, 2.34111]  [0.1225, 0.2133, 0.6642]
2                Bournemouth              Aston Villa        2 - 0  2022-08-06 14:00:00   [0.588341, 0.488895]   [0.3213, 0.4397, 0.239]
3                      Leeds  Wolverhampton Wanderers        2 - 1  2022-08-06 14:00:00     [0.88917, 1.10119]  [0.2798, 0.3166, 0.4036]
4           Newcastle United        Nottingham Forest        2 - 0  2022-08-06 14:00:00     [1.8591, 0.235825]  [0.8023, 0.1695, 0.0282]
5                  Tottenham              Southampton        4 - 1  2022-08-06 14:00:00     [1.6172, 0.386546]  [0.7002, 0.2209, 0.0789]
6                    Everton                  Chelsea        0 - 1  2022-08-06 16:30:00    [0.541983, 1.92315]    [0.06, 0.1717, 0.7683]
7          Manchester United                 Brighton        1 - 2  2022-08-07 13:00:00      [1.42103, 1.7289]      [0.281, 0.269, 0.45]
8                  Leicester                Brentford        2 - 2  2022-08-07 13:00:00   [0.455695, 0.931067]  [0.1615, 0.3491, 0.4894]

...and so on.

评论

1赞 Paul Corcoran 3/18/2023
奇妙的答案和很好的解释
1赞 sytech 3/18/2023 #2

问题是该页面上的那些锚点是由 javascript 生成的。它们不是 检索到的响应的一部分。requests.get

您可以使用浏览器工具获取页面并呈现其完整内容,作为使用 .因为 selenium 控制浏览器,它将执行呈现你期望的 HTML 的 JS。seleniumrequests

from selenium import webdriver
import time
driver = webdriver.Chrome()
url = ...
driver.get(url)

rendered_html = driver.page_source

soup = BeautifulSoup(rendered_html, "html.parser")

for link in soup.find_all("a", class_="match-info"):
    href = link.get("href")
    print(href)

或者,您可以检查返回的页面的内容(而不是呈现的 HTML),并根据该内容重新进行解析。如果 JS 发出额外的服务器请求来呈现页面,您也必须考虑这一点。requests.get