提问人:ronzenith 提问时间:9/18/2022 最后编辑:ronzenith 更新时间:9/19/2022 访问量:218
如何使用 Rvest 抓取带有嵌套列的 HTML 表格?
How to scrape HTML table with nested column with Rvest?
问:
我在抓取带有嵌套列的 HTML 表格时遇到了一个大问题。
该表来自香港入境事务处。
屏幕截图如下所示:
我试图用 rvest 来做,但结果很混乱。
library(rvest)
library(tidyverse)
library(stringr)
library(dplyr)
url_data <- "https://www.immd.gov.hk/eng/stat_20220901.html"
url_data %>%
read_html()
css_selector <- "body > section:nth-child(7) > div > div > div > div > table"
immiTable <- url_data %>%
read_html() %>% html_element(css = css_selector) %>% html_table()
immiTable
我的目标是提取第一行(即机场)并将其绘制到饼图上,并生成整个表的数据帧并将其保存到 excel 中。
我意识到取消嵌套表和报废嵌套表的教材相当稀缺。因此,我需要你的指导。非常感谢您的帮助。
答:
3赞
Rui Barradas
9/18/2022
#1
这是一种方法。标头格式使事情复杂化,但下面的代码有效。它提取整个表,而不仅仅是第一行。
suppressPackageStartupMessages({
library(rvest)
library(dplyr)
library(ggplot2)
})
url_data <- "https://www.immd.gov.hk/eng/stat_20220901.html"
page <- url_data %>% read_html()
page %>%
html_elements("[headers='Arrival']") %>%
html_text() %>%
paste("Arrival", .) -> col_names
page %>%
html_elements("[headers='Departure']") %>%
html_text() %>%
paste("Departure", .) %>%
c(col_names, .) -> col_names
page %>%
html_elements("[headers='Control_Point']") %>%
html_text() -> row_names
page %>%
html_elements("[class='hRight']") %>%
html_text() %>%
sub(",", "", .) %>%
as.numeric() %>%
matrix(nrow = length(row_names), byrow = TRUE) %>%
as.data.frame() %>%
setNames(col_names) %>%
`row.names<-`(row_names) -> final
final
#> Arrival Hong Kong Residents
#> Airport 4258
#> Express Rail Link West Kowloon 0
#> Hung Hom 0
#> Lo Wu 0
#> Lok Ma Chau Spur Line 0
#> Heung Yuen Wai 0
#> Hong Kong-Zhuhai-Macao Bridge 333
#> Lok Ma Chau 0
#> Man Kam To 0
#> Sha Tau Kok 0
#> Shenzhen Bay 3404
#> China Ferry Terminal 0
#> Harbour Control 0
#> Kai Tak Cruise Terminal 0
#> Macau Ferry Terminal 0
#> Total 7995
#> Arrival Mainland Visitors Arrival Other Visitors
#> Airport 1488 422
#> Express Rail Link West Kowloon 0 0
#> Hung Hom 0 0
#> Lo Wu 0 0
#> Lok Ma Chau Spur Line 0 0
#> Heung Yuen Wai 0 0
#> Hong Kong-Zhuhai-Macao Bridge 28 39
#> Lok Ma Chau 0 0
#> Man Kam To 0 0
#> Sha Tau Kok 0 0
#> Shenzhen Bay 348 37
#> China Ferry Terminal 0 0
#> Harbour Control 0 0
#> Kai Tak Cruise Terminal 0 0
#> Macau Ferry Terminal 0 0
#> Total 1864 498
#> Arrival Total Departure Hong Kong Residents
#> Airport 6168 3775
#> Express Rail Link West Kowloon 0 0
#> Hung Hom 0 0
#> Lo Wu 0 0
#> Lok Ma Chau Spur Line 0 0
#> Heung Yuen Wai 0 0
#> Hong Kong-Zhuhai-Macao Bridge 400 243
#> Lok Ma Chau 0 0
#> Man Kam To 0 0
#> Sha Tau Kok 0 0
#> Shenzhen Bay 3789 1301
#> China Ferry Terminal 0 0
#> Harbour Control 0 0
#> Kai Tak Cruise Terminal 0 0
#> Macau Ferry Terminal 0 0
#> Total 10357 5319
#> Departure Mainland Visitors
#> Airport 1154
#> Express Rail Link West Kowloon 0
#> Hung Hom 0
#> Lo Wu 0
#> Lok Ma Chau Spur Line 0
#> Heung Yuen Wai 0
#> Hong Kong-Zhuhai-Macao Bridge 194
#> Lok Ma Chau 0
#> Man Kam To 0
#> Sha Tau Kok 0
#> Shenzhen Bay 524
#> China Ferry Terminal 0
#> Harbour Control 0
#> Kai Tak Cruise Terminal 0
#> Macau Ferry Terminal 0
#> Total 1872
#> Departure Other Visitors Departure Total
#> Airport 315 5244
#> Express Rail Link West Kowloon 0 0
#> Hung Hom 0 0
#> Lo Wu 0 0
#> Lok Ma Chau Spur Line 0 0
#> Heung Yuen Wai 0 0
#> Hong Kong-Zhuhai-Macao Bridge 15 452
#> Lok Ma Chau 0 0
#> Man Kam To 0 0
#> Sha Tau Kok 0 0
#> Shenzhen Bay 28 1853
#> China Ferry Terminal 0 0
#> Harbour Control 0 0
#> Kai Tak Cruise Terminal 0 0
#> Macau Ferry Terminal 0 0
#> Total 358 7549
创建于 2022-09-18 with reprex v2.0.2
要绘制饼图,请绘制条形图,然后更改为极坐标。ggplot
Airport <- final[1,,]
Airport %>%
t() %>%
as.data.frame() %>%
mutate(`Arrival/Departure` = row.names(.)) %>%
ggplot(aes("", Airport, fill = `Arrival/Departure`)) +
geom_col(width = 1) +
scale_fill_manual(values = RColorBrewer::brewer.pal(n = 8, name = "Spectral")) +
coord_polar(theta = "y", start = 0) +
theme_void()
创建于 2022-09-18 with reprex v2.0.2
评论
0赞
ronzenith
9/18/2022
你救了我的命@Rui巴拉达斯!我想知道诀窍,即您的语法如何删除前四行标头?看来你所做的是,首先,抓取数字,其次,添加行和列名。我说得对吗?
0赞
ronzenith
9/18/2022
也许我的问题应该是这样的:如何在不完全提取列名的情况下抓取表格?@ruibarradas
0赞
Rui Barradas
9/19/2022
@ronzenith 区别在于要提取哪些元素,对于数字和标题,嗯,它是和,然后是(行名)。class='hRight'
headers='Arrival'
Departure
headers = 'Control_Point
2赞
QHarr
9/19/2022
#2
另一种方法是选择 tbody 行,按属性过滤掉隐藏的项目,然后稍后添加标题。
library(rvest)
library(tidyverse)
rows <- read_html("https://www.immd.gov.hk/eng/stat_20220901.html") %>% html_elements(".table-passengerTrafficStat tbody tr")
prefixes <- c("arr", "dep")
cols <- c("Hong Kong Residents", "Mainland Visitors", "Other Visitors", "Total")
headers <- c("Control_Point", crossing(prefixes, cols) %>% unite("headers", 1:2, remove = T) %>% unlist() %>% unname())
df <- map_dfr(
rows,
function(x) {
x %>%
html_elements("td[headers]") %>%
set_names(headers) %>%
html_text()
}
) %>%
mutate(across(c(-1), ~ str_replace(.x, ",", "") %>% as.integer()))
或者有点浓缩,
library(rvest)
library(tidyverse)
rows <- read_html("https://www.immd.gov.hk/eng/stat_20220901.html") %>% html_elements(".table-passengerTrafficStat tbody tr")
prefixes <- c("arr", "dep")
cols <- c("Hong Kong Residents", "Mainland Visitors", "Other Visitors", "Total")
headers <- c("Control_Point", crossing(prefixes, cols) %>% unite("headers", 1:2, remove = T) %>% unlist() %>% unname())
df <- map_dfr(rows, ~ set_names(.x %>% html_elements("td[headers]") %>% html_text(), headers)) %>%
mutate(across(c(-1), ~ str_replace(.x, ",", "") %>% as.integer()))
评论
0赞
ronzenith
10/2/2022
谢谢你的帮助。当我尝试仔细阅读您的代码时,我不太了解编写额外函数的用法 function(x) { x %>% html_elements(“td[headers]”) %>% set_names(headers) %>% html_text() } (1) x 在这里是什么意思?(2)为什么函数可以自行运行,而不需要调用函数并告诉R什么是x?@qharr
0赞
QHarr
10/3/2022
我想逐行应用,因此需要逐行调用函数,在本例中为 map_dfr。x 是从左边(行列表)传递到右边(应用于每行的函数)的内容。因此,x 是单行。该函数由 map_dfr 应用。
评论