通过遍历一个非常大的 URL 列表并保留存在的 URL 来查找每个 gameId

Find each gameId by looping through a very large list of URLs and keeping those that exist

提问人:Michael 提问时间:10/14/2023 更新时间:10/14/2023 访问量:40

问:

我正在尝试从这里获取每个 boxscore url 的所有列表:gameId

https://www.espn.com/nhl/boxscore/_/gameId/

每个 URL 都以特定的 , 结尾,例如gameID

https://www.espn.com/nhl/boxscore/_/gameId/4014559236

我遇到的问题是我不知道所有 s 的范围或数量。在 2023-2024 赛季开始时,它们似乎以 1 开头并递增。但是,比如说 2007-2008 赛季的开始,他们以 .gameId4014559236271009021

我想尽可能早地得到它们。

我使用了此处找到的代码,它允许我指定一些 s,检查 URL 是否存在,如果存在,则输出 .gameIdgameId

我在这里的代码只使用了 2023-2024 赛季开始的三个秒:gameId

library(httr)
library(purrr)
library(RCurl)

urls <- paste0("https://www.espn.com/nhl/boxscore/_/gameId/",4014559236:4014559240)

safe_url_logical <- map(urls, http_error)
temp <- cbind(unlist(safe_url_logical), unlist(urls))
colnames(temp) <- c("logical","url")
temp <- as.data.frame(temp)
safe_urls <- temp %>% 
  dplyr::filter(logical=="FALSE")
dead_urls <- temp %>% 
  dplyr::filter(logical=="TRUE")

df_exist <- list()

for (i in 1:nrow(safe_urls)) {
  url <- as.character(safe_urls$url[i])
  exist <- url.exists(url)
  df_exist <- rbind(df_exist, url)
}

urls = df_exist

game_ids = sub('.*\\/', '', urls)
print(game_ids)
[1] "401559238" "401559239" "401559240"

但是,如果我指定 from say to ,则需要检查大量数字和 URL。2710090214014559236

有没有其他方法可以提高速度和效率?

我还想获得每场比赛的日期,尽管我还没能找到。

r 咕噜咕噜 httr rcurl

评论


答:

1赞 Dave2e 10/14/2023 #1

你可以从每年每个团队的时间表开始。例如:https://www.espn.com/nhl/team/schedule/_/name/ana/season/2022(2022-23 赛季的鸭子)并从“结果”列中提取 gameID。

这是它的代码:

url <- "https://www.espn.com/nhl/team/schedule/_/name/ana/season/2022"
page <- read_html(url)

#get the main table
schedule <- page %>% html_elements("table") 

#now take the each row, take the third column and find the "a" subnode
# from that subnode extract the link to the game stats
linkstogames <- schedule %>% html_elements(xpath = ".//tr //td[3] //a") %>%
                    html_attr("href")


 [1] "https://www.espn.com/nhl/game/_/gameId/401349148" "https://www.espn.com/nhl/game/_/gameId/401349152"
 [3] "https://www.espn.com/nhl/game/_/gameId/401349170" "https://www.espn.com/nhl/game/_/gameId/401349182"
 [5] "https://www.espn.com/nhl/game/_/gameId/401349193" "https://www.espn.com/nhl/game/_/gameId/401349208"
 [7] "https://www.espn.com/nhl/game/_/gameId/401349228" "https://www.espn.com/nhl/game/_/gameId/401349240"
 [9] "https://www.espn.com/nhl/game/_/gameId/401349249" "https://www.espn.com/nhl/game/_/gameId/401349262"
[11] "https://www.espn.com/nhl/game/_/gameId/401349275" "https://www.espn.com/nhl/game/_/gameId/401349293

评论

0赞 Michael 10/15/2023
这是一个很好的起点。我会试一试。