提问人:youconmeconwecon 提问时间:10/25/2023 最后编辑:youconmeconwecon 更新时间:10/28/2023 访问量:28
使用 Rvest 创建循环以将文本从多个 url 抓取到数据帧中
Creating loop using Rvest to scrape text into dataframe from multiple urls
问:
我在创建一个循环时遇到了问题,该循环将使用 rvest 从 url 列表中抓取文本。我可以为单个 url 替换文本,但通过循环调用多个 url 对我不起作用。我已经搜索了过去的帖子,但无法使任何解决方案起作用。非常感谢任何指导。这是我尝试从中提取文本的 url 示例,它是照片下方的“最后售出、日期和价格”行:https://www.redfin.com/CA/Los-Angeles/1501-Rollins-Dr-90063/home/6954819
我试过使用一个函数...
library(tidyverse)
library(rvest)
urls <- read_csv('C:\\Users\\XXXXX\\URLFile.csv',show_col_types = FALSE)
lst <- lapply(urls, function(x) {
pg <- read_html(x)
html_nodes(pg, "div.ListingStatusBannerSection")
html_text()
})
以及 for 循环...
library(tidyverse)
library(rvest)
urls <- read_csv('C:\\Users\\XXXXX\\URLFile.csv',show_col_types = FALSE)
for(i in length(urls)) {
link = paste0(urls[i])
page <- read_html(link[i]) %>%
text %>%
html_nodes(page[i], "div.ListingStatusBannerSection") %>%
html_text()
}
df <- data.frame(urls,webpage)
答:
0赞
margusl
10/28/2023
#1
lapply(urls, function(x){...})
将该函数应用于 Tibble 的每一列,而您可能希望它循环访问单个列中的值。并且没有输入 -- 上一个表达式后面应该跟一个管道:urls
html_text()
library(rvest)
library(dplyr, warn.conflicts = FALSE)
urls <- tribble(
~link,
"https://www.redfin.com/CA/Los-Angeles/1501-Rollins-Dr-90063/home/6954819",
"https://www.redfin.com/CA/Los-Angeles/1517-N-Eastern-Ave-90063/home/6955418"
)
urls
#> # A tibble: 2 × 1
#> link
#> <chr>
#> 1 https://www.redfin.com/CA/Los-Angeles/1501-Rollins-Dr-90063/home/6954819
#> 2 https://www.redfin.com/CA/Los-Angeles/1517-N-Eastern-Ave-90063/home/6955418
lapply(urls$link, function(x) {
read_html(x) %>% html_nodes("div.ListingStatusBannerSection") %>% html_text()
})
#> [[1]]
#> [1] "LAST SOLD ON SEP 22, 2023 FOR $590,000"
#>
#> [[2]]
#> [1] "SOLD ON JUL 21, 2023"
如果这个单一的元素是你所需要的,你只需要和:rowwise()
mutate()
urls %>%
rowwise() %>%
mutate(status =
read_html(link) %>%
html_nodes("div.ListingStatusBannerSection") %>%
html_text()) %>%
ungroup()
#> # A tibble: 2 × 2
#> link status
#> <chr> <chr>
#> 1 https://www.redfin.com/CA/Los-Angeles/1501-Rollins-Dr-90063/home/69548… LAST …
#> 2 https://www.redfin.com/CA/Los-Angeles/1517-N-Eastern-Ave-90063/home/69… SOLD …
(想必您确实知道抓取与 https://www.redfin.com/about/terms-of-use 背道而驰 )
上一个:返回值不匹配循环
评论