提问人:Paul Corcoran 提问时间:3/8/2023 最后编辑:GeorgePaul Corcoran 更新时间:3/8/2023 访问量:41
从页面中提取 hrefs 或特定标记
Extracting hrefs or specific tag from a page
问:
我一直在尝试多种方法,但事实证明这个网站很难通过 bs4 抓取。
我正在尝试提取在其中一个匹配项的截图中找到的 href 值。id 是将页面中的所有 href 标签提取到一个列表中。我没有返回任何值,理想的结果是包含所有 hrefs 的列表,例如 //www.premierleague.com/match/74911
import warnings
import numpy as np
from datetime import datetime
import requests
from bs4 import BeautifulSoup
warnings.filterwarnings('ignore')
# set up empty dataframe in a list for storage. errors is set up to handle any matches that dont scrape.
dataframe = []
errors = []
url = "https://www.premierleague.com/results"
response = requests.get(url)
soup = BeautifulSoup(response.text, "html.parser")
matches = {}
soup.find_all("div", {"class": "competitionContainer"})
答:
1赞
Andrej Kesely
3/8/2023
#1
您在页面上看到的数据是通过 JavaScript 从外部源加载的(您可以在浏览器中打开 Web 开发人员工具 -> 网络选项卡,然后开始向下滚动页面。您应该在那里看到 Ajax 请求):
import json
import requests
api_url = "https://footballapi.pulselive.com/football/fixtures"
params = {
"comps": "1",
"compSeasons": "489",
"teams": "127,1,2,130,131,4,6,7,34,9,26,10,11,12,23,15,20,21,25,38",
"page": "1",
"pageSize": "40",
"sort": "desc",
"statuses": "C",
"altIds": "true",
}
headers = {
'Origin': 'https://www.premierleague.com',
}
page = 0
while True:
params['page'] = page
data = requests.get(api_url, params=params, headers=headers).json()
# uncoment this to print all data:
# print(json.dumps(data, indent=4))
for c in data['content']:
team1, team2 = c['teams'][0]['team']['name'], c['teams'][1]['team']['name']
print(f'{team1:<30} {team2:<30} https://www.premierleague.com/match/{int(c["id"])}')
if page > data['pageInfo']['numPages']:
break
page += 1
指纹:
...
Chelsea Tottenham Hotspur https://www.premierleague.com/match/74925
Nottingham Forest West Ham United https://www.premierleague.com/match/74928
Brentford Manchester United https://www.premierleague.com/match/74923
Arsenal Leicester City https://www.premierleague.com/match/74921
Brighton & Hove Albion Newcastle United https://www.premierleague.com/match/74924
Manchester City Bournemouth https://www.premierleague.com/match/74927
Southampton Leeds United https://www.premierleague.com/match/74929
Wolverhampton Wanderers Fulham https://www.premierleague.com/match/74930
Aston Villa Everton https://www.premierleague.com/match/74922
West Ham United Manchester City https://www.premierleague.com/match/74920
Leicester City Brentford https://www.premierleague.com/match/74916
Manchester United Brighton & Hove Albion https://www.premierleague.com/match/74919
Everton Chelsea https://www.premierleague.com/match/74913
Bournemouth Aston Villa https://www.premierleague.com/match/74912
Leeds United Wolverhampton Wanderers https://www.premierleague.com/match/74915
Newcastle United Nottingham Forest https://www.premierleague.com/match/74917
Tottenham Hotspur Southampton https://www.premierleague.com/match/74918
Fulham Liverpool https://www.premierleague.com/match/74914
Crystal Palace Arsenal https://www.premierleague.com/match/74911
上一个:单击页面上的选项卡以降低硒以抓取
下一个:检索锚标记中的所有 href
评论