从 MLS 页面检索 hrefs

Retrieving hrefs from the MLS page

提问人:Paul Corcoran 提问时间:6/20/2023 更新时间:6/20/2023 访问量:42

问:

我目前正在尝试从此页面检索相关的匹配链接,这些链接是 hrefs。我似乎无法使用硒/汤立即找到它们。 我知道它们可能来自 api,但我无法弄清楚如何在 mls-l-module mls-l-module--match-list 的部分类下找到它们

import requests
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.common.exceptions import TimeoutException, NoSuchElementException
from selenium.webdriver import ActionChains
from selenium.webdriver.common.by import By
from time import sleep, time
import pandas as pd
import warnings
import numpy as np
from datetime import datetime
import json
from bs4 import BeautifulSoup

warnings.filterwarnings('ignore')

base_url = 'https://www.mlssoccer.com/schedule/scores#competition=mls-regular-season&club=all&date=2023-02-20'

# create an empty list to store urls.
urls = []

option = Options()
option.headless = False
driver = webdriver.Chrome("##########",options=option)
driver.get(base_url)

# click the cookie pop up
WebDriverWait(driver, 15).until(EC.element_to_be_clickable((By.XPATH, '/html/body/div[3]/div[2]/div/div[1]/div/div[2]/div/button[2]'))).click()

enter image description here

输出应该是此页面的 URL 列表,我将循环到下一页并收集所有 href 链接以进行匹配。也许使用 selenium 来渲染 soup 的页面是更好的选择

python selenium-webdriver web-scraping beautifulsoup

评论

1赞 sudden_appearance 6/20/2023
您可以使用对 api 的 xhr 请求来代替使用 selenium。只需检查devtools网络中的请求https://sportapi.mlssoccer.com/api/matches?...

答:

1赞 Andrej Kesely 6/20/2023 #1

如评论中所述,您可以完全绕过并直接使用他们的 Ajax API:selenium

import requests

params = {
    "culture": "en-us",
    "dateFrom": "2023-02-19",
    "dateTo": "2023-02-27",
    "competition": "98",
    "matchType": "Regular",
    "excludeSecondaryTeams": "true",
}

api_url = 'https://sportapi.mlssoccer.com/api/matches'
base_url = 'https://www.mlssoccer.com/competitions/mls-regular-season/2023/matches/'

data = requests.get(api_url, params=params).json()

for m in data:
    h, a = m['home']['fullName'], m['away']['fullName']
    print(f'{h:<30} {a:<30} {base_url + m["slug"]}/')

指纹:

Nashville SC                   New York City Football Club    https://www.mlssoccer.com/competitions/mls-regular-season/2023/matches/nshvsnyc-02-25-2023/
Atlanta United                 San Jose Earthquakes           https://www.mlssoccer.com/competitions/mls-regular-season/2023/matches/atlvssj-02-25-2023/
Charlotte FC                   New England Revolution         https://www.mlssoccer.com/competitions/mls-regular-season/2023/matches/cltvsne-02-25-2023/
FC Cincinnati                  Houston Dynamo FC              https://www.mlssoccer.com/competitions/mls-regular-season/2023/matches/cinvshou-02-25-2023/
D.C. United                    Toronto FC                     https://www.mlssoccer.com/competitions/mls-regular-season/2023/matches/dcvstor-02-25-2023/
Inter Miami CF                 CF Montréal                    https://www.mlssoccer.com/competitions/mls-regular-season/2023/matches/miavsmtl-02-25-2023/
Orlando City                   New York Red Bulls             https://www.mlssoccer.com/competitions/mls-regular-season/2023/matches/orlvsrbny-02-25-2023/
Philadelphia Union             Columbus Crew                  https://www.mlssoccer.com/competitions/mls-regular-season/2023/matches/phivsclb-02-25-2023/
Austin FC                      St. Louis CITY SC              https://www.mlssoccer.com/competitions/mls-regular-season/2023/matches/atxvsstl-02-25-2023/
FC Dallas                      Minnesota United               https://www.mlssoccer.com/competitions/mls-regular-season/2023/matches/dalvsmin-02-25-2023/
Vancouver Whitecaps FC         Real Salt Lake                 https://www.mlssoccer.com/competitions/mls-regular-season/2023/matches/vanvsrsl-02-25-2023/
Seattle Sounders FC            Colorado Rapids                https://www.mlssoccer.com/competitions/mls-regular-season/2023/matches/seavscol-02-26-2023/

评论

0赞 Paul Corcoran 6/20/2023
代码就是这样。出于好奇,当我使用开发工具网络向下滚动页面时,我在这里的任何地方都找不到 sportsapi,您是如何找到它的?
0赞 Andrej Kesely 6/20/2023
@PaulCorcoran 尝试单击“所有赛季”、“常规赛”等下拉菜单,然后在开发人员工具中观看“网络”选项卡。
1赞 Paul Corcoran 6/20/2023
知道了,谢谢你的指导
1赞 undetected Selenium 6/20/2023 #2

要打印链接,可以使用以下任一定位器策略

  • 使用CSS_SELECTOR

    driver.get("https://www.mlssoccer.com/schedule/scores#competition=mls-regular-season&club=all&date=2023-02-20")
    WebDriverWait(driver, 10).until(EC.element_to_be_clickable((By.CSS_SELECTOR, "button#onetrust-accept-btn-handler"))).click()
    print([my_elem.get_attribute("href") for my_elem in WebDriverWait(driver, 20).until(EC.visibility_of_all_elements_located((By.CSS_SELECTOR, "div.mls-c-match-list__match a")))])
    
  • 使用 XPATH:

    driver.get("https://www.mlssoccer.com/schedule/scores#competition=mls-regular-season&club=all&date=2023-02-20")
    WebDriverWait(driver, 10).until(EC.element_to_be_clickable((By.XPATH, "//button[@id='onetrust-accept-btn-handler']"))).click()
    print([my_elem.get_attribute("href") for my_elem in WebDriverWait(driver, 20).until(EC.visibility_of_all_elements_located((By.XPATH, "//div[contains(@class, 'mls-c-match-list__match')]//a")))])
    
  • 控制台输出:

    ['https://www.mlssoccer.com/competitions/mls-regular-season/2023/matches/nshvsnyc-02-25-2023', 'https://www.mlssoccer.com/competitions/mls-regular-season/2023/matches/atlvssj-02-25-2023', 'https://www.mlssoccer.com/competitions/mls-regular-season/2023/matches/cltvsne-02-25-2023', 'https://www.mlssoccer.com/competitions/mls-regular-season/2023/matches/cinvshou-02-25-2023', 'https://www.mlssoccer.com/competitions/mls-regular-season/2023/matches/dcvstor-02-25-2023', 'https://www.mlssoccer.com/competitions/mls-regular-season/2023/matches/miavsmtl-02-25-2023', 'https://www.mlssoccer.com/competitions/mls-regular-season/2023/matches/orlvsrbny-02-25-2023', 'https://www.mlssoccer.com/competitions/mls-regular-season/2023/matches/phivsclb-02-25-2023', 'https://www.mlssoccer.com/competitions/mls-regular-season/2023/matches/atxvsstl-02-25-2023', 'https://www.mlssoccer.com/competitions/mls-regular-season/2023/matches/dalvsmin-02-25-2023', 'https://www.mlssoccer.com/competitions/mls-regular-season/2023/matches/vanvsrsl-02-25-2023']
    
  • 注意:您必须添加以下导入:

    from selenium.webdriver.support.ui import WebDriverWait
    from selenium.webdriver.common.by import By
    from selenium.webdriver.support import expected_conditions as EC
    

评论

1赞 Paul Corcoran 6/20/2023
谢谢你
0赞 undetected Selenium 6/20/2023
@PaulCorcoran 也许使用 selenium 来渲染 soup 页面是更好的选择