提问人:Paul Corcoran 提问时间:6/20/2023 更新时间:6/20/2023 访问量:42
从 MLS 页面检索 hrefs
Retrieving hrefs from the MLS page
问:
我目前正在尝试从此页面检索相关的匹配链接,这些链接是 hrefs。我似乎无法使用硒/汤立即找到它们。 我知道它们可能来自 api,但我无法弄清楚如何在 mls-l-module mls-l-module--match-list 的部分类下找到它们
import requests
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.common.exceptions import TimeoutException, NoSuchElementException
from selenium.webdriver import ActionChains
from selenium.webdriver.common.by import By
from time import sleep, time
import pandas as pd
import warnings
import numpy as np
from datetime import datetime
import json
from bs4 import BeautifulSoup
warnings.filterwarnings('ignore')
base_url = 'https://www.mlssoccer.com/schedule/scores#competition=mls-regular-season&club=all&date=2023-02-20'
# create an empty list to store urls.
urls = []
option = Options()
option.headless = False
driver = webdriver.Chrome("##########",options=option)
driver.get(base_url)
# click the cookie pop up
WebDriverWait(driver, 15).until(EC.element_to_be_clickable((By.XPATH, '/html/body/div[3]/div[2]/div/div[1]/div/div[2]/div/button[2]'))).click()
输出应该是此页面的 URL 列表,我将循环到下一页并收集所有 href 链接以进行匹配。也许使用 selenium 来渲染 soup 的页面是更好的选择
答:
1赞
Andrej Kesely
6/20/2023
#1
如评论中所述,您可以完全绕过并直接使用他们的 Ajax API:selenium
import requests
params = {
"culture": "en-us",
"dateFrom": "2023-02-19",
"dateTo": "2023-02-27",
"competition": "98",
"matchType": "Regular",
"excludeSecondaryTeams": "true",
}
api_url = 'https://sportapi.mlssoccer.com/api/matches'
base_url = 'https://www.mlssoccer.com/competitions/mls-regular-season/2023/matches/'
data = requests.get(api_url, params=params).json()
for m in data:
h, a = m['home']['fullName'], m['away']['fullName']
print(f'{h:<30} {a:<30} {base_url + m["slug"]}/')
指纹:
Nashville SC New York City Football Club https://www.mlssoccer.com/competitions/mls-regular-season/2023/matches/nshvsnyc-02-25-2023/
Atlanta United San Jose Earthquakes https://www.mlssoccer.com/competitions/mls-regular-season/2023/matches/atlvssj-02-25-2023/
Charlotte FC New England Revolution https://www.mlssoccer.com/competitions/mls-regular-season/2023/matches/cltvsne-02-25-2023/
FC Cincinnati Houston Dynamo FC https://www.mlssoccer.com/competitions/mls-regular-season/2023/matches/cinvshou-02-25-2023/
D.C. United Toronto FC https://www.mlssoccer.com/competitions/mls-regular-season/2023/matches/dcvstor-02-25-2023/
Inter Miami CF CF Montréal https://www.mlssoccer.com/competitions/mls-regular-season/2023/matches/miavsmtl-02-25-2023/
Orlando City New York Red Bulls https://www.mlssoccer.com/competitions/mls-regular-season/2023/matches/orlvsrbny-02-25-2023/
Philadelphia Union Columbus Crew https://www.mlssoccer.com/competitions/mls-regular-season/2023/matches/phivsclb-02-25-2023/
Austin FC St. Louis CITY SC https://www.mlssoccer.com/competitions/mls-regular-season/2023/matches/atxvsstl-02-25-2023/
FC Dallas Minnesota United https://www.mlssoccer.com/competitions/mls-regular-season/2023/matches/dalvsmin-02-25-2023/
Vancouver Whitecaps FC Real Salt Lake https://www.mlssoccer.com/competitions/mls-regular-season/2023/matches/vanvsrsl-02-25-2023/
Seattle Sounders FC Colorado Rapids https://www.mlssoccer.com/competitions/mls-regular-season/2023/matches/seavscol-02-26-2023/
评论
0赞
Paul Corcoran
6/20/2023
代码就是这样。出于好奇,当我使用开发工具网络向下滚动页面时,我在这里的任何地方都找不到 sportsapi,您是如何找到它的?
0赞
Andrej Kesely
6/20/2023
@PaulCorcoran 尝试单击“所有赛季”、“常规赛”等下拉菜单,然后在开发人员工具中观看“网络”选项卡。
1赞
Paul Corcoran
6/20/2023
知道了,谢谢你的指导
1赞
undetected Selenium
6/20/2023
#2
要打印链接,可以使用以下任一定位器策略:
使用CSS_SELECTOR:
driver.get("https://www.mlssoccer.com/schedule/scores#competition=mls-regular-season&club=all&date=2023-02-20") WebDriverWait(driver, 10).until(EC.element_to_be_clickable((By.CSS_SELECTOR, "button#onetrust-accept-btn-handler"))).click() print([my_elem.get_attribute("href") for my_elem in WebDriverWait(driver, 20).until(EC.visibility_of_all_elements_located((By.CSS_SELECTOR, "div.mls-c-match-list__match a")))])
使用 XPATH:
driver.get("https://www.mlssoccer.com/schedule/scores#competition=mls-regular-season&club=all&date=2023-02-20") WebDriverWait(driver, 10).until(EC.element_to_be_clickable((By.XPATH, "//button[@id='onetrust-accept-btn-handler']"))).click() print([my_elem.get_attribute("href") for my_elem in WebDriverWait(driver, 20).until(EC.visibility_of_all_elements_located((By.XPATH, "//div[contains(@class, 'mls-c-match-list__match')]//a")))])
控制台输出:
['https://www.mlssoccer.com/competitions/mls-regular-season/2023/matches/nshvsnyc-02-25-2023', 'https://www.mlssoccer.com/competitions/mls-regular-season/2023/matches/atlvssj-02-25-2023', 'https://www.mlssoccer.com/competitions/mls-regular-season/2023/matches/cltvsne-02-25-2023', 'https://www.mlssoccer.com/competitions/mls-regular-season/2023/matches/cinvshou-02-25-2023', 'https://www.mlssoccer.com/competitions/mls-regular-season/2023/matches/dcvstor-02-25-2023', 'https://www.mlssoccer.com/competitions/mls-regular-season/2023/matches/miavsmtl-02-25-2023', 'https://www.mlssoccer.com/competitions/mls-regular-season/2023/matches/orlvsrbny-02-25-2023', 'https://www.mlssoccer.com/competitions/mls-regular-season/2023/matches/phivsclb-02-25-2023', 'https://www.mlssoccer.com/competitions/mls-regular-season/2023/matches/atxvsstl-02-25-2023', 'https://www.mlssoccer.com/competitions/mls-regular-season/2023/matches/dalvsmin-02-25-2023', 'https://www.mlssoccer.com/competitions/mls-regular-season/2023/matches/vanvsrsl-02-25-2023']
注意:您必须添加以下导入:
from selenium.webdriver.support.ui import WebDriverWait from selenium.webdriver.common.by import By from selenium.webdriver.support import expected_conditions as EC
评论
1赞
Paul Corcoran
6/20/2023
谢谢你
0赞
undetected Selenium
6/20/2023
@PaulCorcoran 也许使用 selenium 来渲染 soup 页面是更好的选择
上一个:检索锚标记中的所有 href
评论
https://sportapi.mlssoccer.com/api/matches?...