提问人:Paul Corcoran 提问时间:3/18/2023 更新时间:3/18/2023 访问量:27
检索锚标记中的所有 href
Retrieving all hrefs in anchor tags
问:
import warnings
import numpy as np
from datetime import datetime
import json
from bs4 import BeautifulSoup
warnings.filterwarnings('ignore')
url = "https://understat.com/league/EPL/2022"
response = requests.get(url)
soup = BeautifulSoup(response.content, "html.parser")
for link in soup.find_all("a", class_="match-info"):
href = link.get("href")
print(href)
不幸的是,此代码未找到任何结果,所需的结果是网页这一部分中的 hrefs
< class=“match-info” data-isresult=“true” href = “match/18265” >
有什么想法吗?
答:
1赞
Andrej Kesely
3/18/2023
#1
您在页面上看到的数据是在 element 内部编码的,因此看不到它。要对其进行解码并将其放入 Pandas 数据帧中,您可以使用以下示例:<script>
beautifulsoup
import re
import json
import requests
import pandas as pd
url = "https://understat.com/league/EPL/2022"
html_doc = requests.get(url).text
data = re.search(r"datesData\s*=\s*JSON\.parse\('(.*?)'\)", html_doc).group(1)
data = re.sub(r'\\x([\dA-F]{2})', lambda g: chr(int(g.group(1), 16)), data)
data = json.loads(data)
all_data = []
for d in data:
all_data.append({
'Team 1': d['h']['title'],
'Team 2': d['a']['title'],
'Goals': f'{d["goals"]["h"]} - {d["goals"]["a"]}',
'Date': d['datetime'],
'xG': [d['xG']['h'], d['xG']['a']],
'forecast': list(d.get('forecast', {}).values())
})
df = pd.DataFrame(all_data)
print(df)
指纹:
Team 1 Team 2 Goals Date xG forecast
0 Crystal Palace Arsenal 0 - 2 2022-08-05 19:00:00 [1.20637, 1.43601] [0.2864, 0.2912, 0.4224]
1 Fulham Liverpool 2 - 2 2022-08-06 11:30:00 [1.26822, 2.34111] [0.1225, 0.2133, 0.6642]
2 Bournemouth Aston Villa 2 - 0 2022-08-06 14:00:00 [0.588341, 0.488895] [0.3213, 0.4397, 0.239]
3 Leeds Wolverhampton Wanderers 2 - 1 2022-08-06 14:00:00 [0.88917, 1.10119] [0.2798, 0.3166, 0.4036]
4 Newcastle United Nottingham Forest 2 - 0 2022-08-06 14:00:00 [1.8591, 0.235825] [0.8023, 0.1695, 0.0282]
5 Tottenham Southampton 4 - 1 2022-08-06 14:00:00 [1.6172, 0.386546] [0.7002, 0.2209, 0.0789]
6 Everton Chelsea 0 - 1 2022-08-06 16:30:00 [0.541983, 1.92315] [0.06, 0.1717, 0.7683]
7 Manchester United Brighton 1 - 2 2022-08-07 13:00:00 [1.42103, 1.7289] [0.281, 0.269, 0.45]
8 Leicester Brentford 2 - 2 2022-08-07 13:00:00 [0.455695, 0.931067] [0.1615, 0.3491, 0.4894]
...and so on.
评论
1赞
Paul Corcoran
3/18/2023
奇妙的答案和很好的解释
1赞
sytech
3/18/2023
#2
问题是该页面上的那些锚点是由 javascript 生成的。它们不是 检索到的响应的一部分。requests.get
您可以使用浏览器工具获取页面并呈现其完整内容,作为使用 .因为 selenium 控制浏览器,它将执行呈现你期望的 HTML 的 JS。selenium
requests
from selenium import webdriver
import time
driver = webdriver.Chrome()
url = ...
driver.get(url)
rendered_html = driver.page_source
soup = BeautifulSoup(rendered_html, "html.parser")
for link in soup.find_all("a", class_="match-info"):
href = link.get("href")
print(href)
或者,您可以检查返回的页面源的内容(而不是呈现的 HTML),并根据该内容重新进行解析。如果 JS 发出额外的服务器请求来呈现页面,您也必须考虑这一点。requests.get
下一个:从 MLS 页面检索 hrefs
评论
print(response.text)
requests.get