通过遍历一个非常大的 URL 列表并保留存在的 URL 来查找每个 gameId-解网

问：

我正在尝试从这里获取每个 boxscore url 的所有列表：gameId

https://www.espn.com/nhl/boxscore/_/gameId/

每个 URL 都以特定的，结尾，例如gameID

https://www.espn.com/nhl/boxscore/_/gameId/4014559236

我遇到的问题是我不知道所有 s 的范围或数量。在 2023-2024 赛季开始时，它们似乎以 1 开头并递增。但是，比如说 2007-2008 赛季的开始，他们以 .gameId4014559236271009021

我想尽可能早地得到它们。

我使用了此处找到的代码，它允许我指定一些 s，检查 URL 是否存在，如果存在，则输出 .gameIdgameId

我在这里的代码只使用了 2023-2024 赛季开始的三个秒：gameId

library(httr)
library(purrr)
library(RCurl)

urls <- paste0("https://www.espn.com/nhl/boxscore/_/gameId/",4014559236:4014559240)

safe_url_logical <- map(urls, http_error)
temp <- cbind(unlist(safe_url_logical), unlist(urls))
colnames(temp) <- c("logical","url")
temp <- as.data.frame(temp)
safe_urls <- temp %>% 
  dplyr::filter(logical=="FALSE")
dead_urls <- temp %>% 
  dplyr::filter(logical=="TRUE")

df_exist <- list()

for (i in 1:nrow(safe_urls)) {
  url <- as.character(safe_urls$url[i])
  exist <- url.exists(url)
  df_exist <- rbind(df_exist, url)
}

urls = df_exist

game_ids = sub('.*\\/', '', urls)
print(game_ids)
[1] "401559238" "401559239" "401559240"

但是，如果我指定 from say to ，则需要检查大量数字和 URL。2710090214014559236

有没有其他方法可以提高速度和效率？

我还想获得每场比赛的日期，尽管我还没能找到。

r 咕噜咕噜 httr rcurl

url <- "https://www.espn.com/nhl/team/schedule/_/name/ana/season/2022"
page <- read_html(url)

#get the main table
schedule <- page %>% html_elements("table") 

#now take the each row, take the third column and find the "a" subnode
# from that subnode extract the link to the game stats
linkstogames <- schedule %>% html_elements(xpath = ".//tr //td[3] //a") %>%
                    html_attr("href")


 [1] "https://www.espn.com/nhl/game/_/gameId/401349148" "https://www.espn.com/nhl/game/_/gameId/401349152"
 [3] "https://www.espn.com/nhl/game/_/gameId/401349170" "https://www.espn.com/nhl/game/_/gameId/401349182"
 [5] "https://www.espn.com/nhl/game/_/gameId/401349193" "https://www.espn.com/nhl/game/_/gameId/401349208"
 [7] "https://www.espn.com/nhl/game/_/gameId/401349228" "https://www.espn.com/nhl/game/_/gameId/401349240"
 [9] "https://www.espn.com/nhl/game/_/gameId/401349249" "https://www.espn.com/nhl/game/_/gameId/401349262"
[11] "https://www.espn.com/nhl/game/_/gameId/401349275" "https://www.espn.com/nhl/game/_/gameId/401349293

通过遍历一个非常大的 URL 列表并保留存在的 URL 来查找每个 gameId

Find each gameId by looping through a very large list of URLs and keeping those that exist

评论

评论