提问人:JustB 提问时间:8/29/2023 最后编辑:JustB 更新时间:8/29/2023 访问量:84
使用 RSelenium 进行网页抓取
Web scraping using RSelenium
问:
我正在尝试通过网络抓取网站来检索不同州所有医院的网址。例如,https://guide.prod.iam.aha.org/guide/searchResults?query=Alabama 或 https://guide.prod.iam.aha.org/guide/searchResults?query=Alaska 使用 JavaScript 动态填充网页以更改内容和 css 样式。当显示每个州的搜索结果时,它可以是一个或多个页面。如果结果是多个页面,则使用 JavaScript 添加导航 div,如下所示
可以使用页面底部的分页 div 导航结果,也可以只在一个页面上查看。但是,当页面只有一个时,不会显示导航 div。此外,当显示导航 div 时,可以单击所有类名为“nav-link”的锚标记以访问最后一个结果页面。当显示最后一个结果页时,类名为“navigation-last”或“navigation-next”的 li 标签也会获得一个名为“disabled”的新类名。
在这种情况下,我想遍历结果以收集和组合每个州每家医院的所有 URL,这些 url 附加到每个医院的“配置文件”按钮,如下所示
我下面的代码适用于多个页面状态结果,但不适用于一个页面结果。但我需要它适用于这两种情况,无论分页 div 是否存在。
library(tidyverse)
library(rvest)
library(RSelenium)
library(wdman)
library(netstat)
selenium_object <- selenium(retcommand = T,check = F)
remote_driver <- rsDriver(browser = 'chrome',
chromever = "116.0.5845.98",
verbose = F,
port = free_port())
remDr <- remote_driver$client
collect_hospital_urls <- function(state_url){
remDr$navigate(state_url)
preferred_class <- "disabled"
all_profile_urls <- list()
while (TRUE) {
# start to collect all hospital profile links on each page
profile_tags <- remDr$findElements(using = "css", value = "a[_ngcontent-c10]")
# Extract href attributes
profile_href_attributes <- sapply(profile_tags, function(tag) {
tag$getElementAttribute("href")[[1]]
})
# combine to all page profiles
all_profile_urls <- append(all_profile_urls, profile_href_attributes)
# Find the li HTML element by its CSS selector
li_pagination_next <- remDr$findElement(using = "class name", "pagination-next")
# Check if the preferred class name is present
if (preferred_class %in% as.character(str_split(unlist(li_pagination_next$getElementAttribute("class")),"\\s+",simplify = T))) {
#
print("Preferred class found!")
break
} else {
# Click on the link to potentially load new content
next_button <- remDr$findElement(using = 'link text', 'Next')
next_button$clickElement()
print("Oj")
# Wait for some time to allow the new content to load
Sys.sleep(2)
}
}
all_profile_urls <- all_profile_urls |> unlist()
}
x <- collect_hospital_urls(state_url)
我期待您的帮助
我尝试使用 while 循环遍历结果,但当它只是一页并且未显示导航 div 时不起作用
答:
如果 RSelenium 不是一个严格的要求,我会使用 GraphQL API 和 JSON 响应。
请求有效负载可以通过浏览器的开发人员工具找到,对于超级懒惰的方法来说,这有点具有挑战性,例如将请求复制为 cURL 并在后者与 JSON 字符串中的转义序列作斗争时传递它,但对于快速原型设计,我们仍然可以通过 https://curlconverter.com/r/ 获得一个工作请求,只是为了看看是否有任何东西会阻止 no-js 方法(Cloudflare 反抓取措施, 例如)。在这种情况下,它工作正常,我们可以测试它是否在不传递 cookie 和额外标头时继续工作。httr2::curl_translate()
{httr}
library(dplyr, warn.conflicts = FALSE)
library(httr2)
library(purrr)
# graphql query extracted from POST request payload
graphql_payload <- jsonlite::parse_json(
'{
"query": "query Search($searchTerm: String, $searchType: String, $state: String, $cities: [String], $counties: [String], $pageNum: Int!) {\\n search(searchTerm: $searchTerm, searchType: $searchType, state: $state, city: $cities, county: $counties, pageNum: $pageNum) {\\n total\\n start\\n pageLength\\n result {\\n index\\n resultType\\n orgDisplayName\\n systemDisplayName\\n region\\n personDisplayName\\n title\\n address\\n ahaId\\n ahaMember\\n affiliateUnitOf\\n __typename\\n }\\n facet {\\n name\\n facetValue\\n __typename\\n }\\n error\\n __typename\\n }\\n}\\n",
"variables": {
"searchTerm": "Alabama",
"searchType": "all",
"pageNum": 1
},
"operationName": "Search"
}')
# set search term and page number in graphql query, make request
graphql_search <- function(graphql, term, page = 1){
graphql$variables$searchTerm <- term
graphql$variables$pageNum <- page
request("https://guide.prod.iam.aha.org/guide/graphql") %>%
req_body_json(graphql) %>%
req_perform() %>%
resp_body_json()
}
# execute hospital search, calculate last page number from the first response,
# if there there are more pages, fetch those as well
hosp_search <- function(term, graphql = graphql_payload){
results <- graphql_search(graphql, term, 1) %>% pluck("data", "search") %>% list()
last_page_n <- ceiling(as.numeric(results[[1]]$total) / as.numeric(results[[1]]$pageLength))
if (last_page_n > 1){
results_cont <- map(2:last_page_n, \(page) graphql_search(graphql, term, page) %>% pluck("data", "search"))
results <- c(results, results_cont)
}
results
}
# execute search,
# pluck "result" elements from returned list (each returned page is a list item),
# convert resulting list of named lists to a data.frame / tibble with bind_rows and
# generate urls from ahaId field
hosp_search("Alaska") %>%
map("result") %>%
bind_rows() %>%
mutate(url = paste0("https://guide.prod.iam.aha.org/guide/hospitalProfile/", ahaId)) %>%
select(orgDisplayName, URL)
结果:
#> # A tibble: 4 × 2
#> orgDisplayName url
#> <chr> <chr>
#> 1 Alaska Regional Hospital https://guide.prod.iam.aha.org/guide/hospita…
#> 2 Alaska Native Medical Center https://guide.prod.iam.aha.org/guide/hospita…
#> 3 Alaska Psychiatric Institute https://guide.prod.iam.aha.org/guide/hospita…
#> 4 Providence Alaska Medical Center https://guide.prod.iam.aha.org/guide/hospita…
创建于 2023-08-29 使用 reprex v2.0.2
评论