需要使用 rvest 来抓取动态内容-解网

问：

我不得不从一个名为Unicorn Auctions的拍卖网站上删除数据。

enter image description here

当我尝试使用 rvest 执行此操作时，我能得到的只是拍卖标题和 URL，但我还需要它的开始和结束日期。当我尝试找到它的CSS类时，我找到的只是以下代码行：

enter image description here

我很害怕使用 RSelenium 抓取它，正如我在 Stack Overflow 中找到的那样。但我的老板希望它只是用 rvest 来制作的。他说这是可能的，但我找不到任何有用的 youtube 视频或文章。

我不希望任何人只是给我解决方案，我只是需要一些帮助！

R Web-scraping Dynamic Rvest Rselenium

library(rvest)
library(stringr)
library(dplyr)

upcominng <- 
  read_html("https://bid.unicornauctions.com/") |>
  html_element(xpath = "//script[contains(text(),'viewVars =')]") |>
  html_text() |>
   # remove few bits from javascript to to make parseble as JSON
  str_remove("^\\s+viewVars =") |>
  str_remove(";\\s+$") |>
  jsonlite::fromJSON() |>
   # extract results_page from the list
  purrr::pluck("upcomingAuctions", "result_page") |>
  as_tibble()

select(upcominng, title, contains("time")) |> glimpse()
#> Rows: 4
#> Columns: 9
#> $ title                    <chr> "November 'No Reserves' Unicorn Auction 2023"…
#> $ time_start               <chr> "2023-11-06T01:00:00Z", "2023-11-13T01:00:00Z…
#> $ time_start_live_auction  <lgl> NA, NA, NA, NA
#> $ time_start_proxy_bidding <lgl> NA, NA, NA, NA
#> $ timezone                 <chr> "America/Chicago", "America/Chicago", "Americ…
#> $ effective_end_time       <chr> "2023-11-13T00:00:00Z", "2023-11-20T00:00:00Z…
#> $ extended_end_time        <lgl> NA, NA, NA, NA
#> $ realtime_server_url      <lgl> NA, NA, NA, NA
#> $ is_times_the_money       <lgl> FALSE, FALSE, FALSE, FALSE

^{创建于 2023-11-12 with reprex v2.0.2}

上一个：Excel公式，用于从垂直和水平溢出范围动态过滤范围[关闭]

下一个：动态加载嵌套组件时出现子 comp javascript 问题

需要使用 rvest 来抓取动态内容

Need to use rvest to scrap dynamic content

评论