提问人:Michael 提问时间:10/14/2023 更新时间:10/14/2023 访问量:40
通过遍历一个非常大的 URL 列表并保留存在的 URL 来查找每个 gameId
Find each gameId by looping through a very large list of URLs and keeping those that exist
问:
我正在尝试从这里获取每个 boxscore url 的所有列表:gameId
https://www.espn.com/nhl/boxscore/_/gameId/
每个 URL 都以特定的 , 结尾,例如gameID
https://www.espn.com/nhl/boxscore/_/gameId/4014559236
我遇到的问题是我不知道所有 s 的范围或数量。在 2023-2024 赛季开始时,它们似乎以 1 开头并递增。但是,比如说 2007-2008 赛季的开始,他们以 .gameId
4014559236
271009021
我想尽可能早地得到它们。
我使用了此处找到的代码,它允许我指定一些 s,检查 URL 是否存在,如果存在,则输出 .gameId
gameId
我在这里的代码只使用了 2023-2024 赛季开始的三个秒:gameId
library(httr)
library(purrr)
library(RCurl)
urls <- paste0("https://www.espn.com/nhl/boxscore/_/gameId/",4014559236:4014559240)
safe_url_logical <- map(urls, http_error)
temp <- cbind(unlist(safe_url_logical), unlist(urls))
colnames(temp) <- c("logical","url")
temp <- as.data.frame(temp)
safe_urls <- temp %>%
dplyr::filter(logical=="FALSE")
dead_urls <- temp %>%
dplyr::filter(logical=="TRUE")
df_exist <- list()
for (i in 1:nrow(safe_urls)) {
url <- as.character(safe_urls$url[i])
exist <- url.exists(url)
df_exist <- rbind(df_exist, url)
}
urls = df_exist
game_ids = sub('.*\\/', '', urls)
print(game_ids)
[1] "401559238" "401559239" "401559240"
但是,如果我指定 from say to ,则需要检查大量数字和 URL。271009021
4014559236
有没有其他方法可以提高速度和效率?
我还想获得每场比赛的日期,尽管我还没能找到。
答:
1赞
Dave2e
10/14/2023
#1
你可以从每年每个团队的时间表开始。例如:https://www.espn.com/nhl/team/schedule/_/name/ana/season/2022(2022-23 赛季的鸭子)并从“结果”列中提取 gameID。
这是它的代码:
url <- "https://www.espn.com/nhl/team/schedule/_/name/ana/season/2022"
page <- read_html(url)
#get the main table
schedule <- page %>% html_elements("table")
#now take the each row, take the third column and find the "a" subnode
# from that subnode extract the link to the game stats
linkstogames <- schedule %>% html_elements(xpath = ".//tr //td[3] //a") %>%
html_attr("href")
[1] "https://www.espn.com/nhl/game/_/gameId/401349148" "https://www.espn.com/nhl/game/_/gameId/401349152"
[3] "https://www.espn.com/nhl/game/_/gameId/401349170" "https://www.espn.com/nhl/game/_/gameId/401349182"
[5] "https://www.espn.com/nhl/game/_/gameId/401349193" "https://www.espn.com/nhl/game/_/gameId/401349208"
[7] "https://www.espn.com/nhl/game/_/gameId/401349228" "https://www.espn.com/nhl/game/_/gameId/401349240"
[9] "https://www.espn.com/nhl/game/_/gameId/401349249" "https://www.espn.com/nhl/game/_/gameId/401349262"
[11] "https://www.espn.com/nhl/game/_/gameId/401349275" "https://www.espn.com/nhl/game/_/gameId/401349293
评论
0赞
Michael
10/15/2023
这是一个很好的起点。我会试一试。
评论