如何使用 Rvest 抓取带有嵌套列的 HTML 表格?

How to scrape HTML table with nested column with Rvest?

提问人:ronzenith 提问时间:9/18/2022 最后编辑:ronzenith 更新时间:9/19/2022 访问量:218

问:

我在抓取带有嵌套列的 HTML 表格时遇到了一个大问题。

该表来自香港入境事务处

屏幕截图如下所示:

enter image description here

我试图用 rvest 来做,但结果很混乱。

library(rvest)
library(tidyverse)
library(stringr)
library(dplyr)

url_data <- "https://www.immd.gov.hk/eng/stat_20220901.html"

url_data %>%
read_html()
css_selector <- "body > section:nth-child(7) > div > div > div > div > table"
immiTable <- url_data %>% 
read_html() %>% html_element(css = css_selector) %>% html_table()
immiTable

enter image description here

我的目标是提取第一行(即机场)并将其绘制到饼图上,并生成整个表的数据帧并将其保存到 excel 中。

我意识到取消嵌套表和报废嵌套表的教材相当稀缺。因此,我需要你的指导。非常感谢您的帮助。

HTML R 网页抓取 rvest 嵌套表

评论


答:

3赞 Rui Barradas 9/18/2022 #1

这是一种方法。标头格式使事情复杂化,但下面的代码有效。它提取整个表,而不仅仅是第一行。

suppressPackageStartupMessages({
  library(rvest)
  library(dplyr)
  library(ggplot2)
})

url_data <- "https://www.immd.gov.hk/eng/stat_20220901.html"

page <- url_data %>% read_html()

page %>%
  html_elements("[headers='Arrival']") %>%
  html_text() %>%
  paste("Arrival", .) -> col_names
page %>%
  html_elements("[headers='Departure']") %>%
  html_text() %>%
  paste("Departure", .) %>%
  c(col_names, .) -> col_names
page %>%
  html_elements("[headers='Control_Point']") %>%
  html_text() -> row_names
page %>%
  html_elements("[class='hRight']") %>%
  html_text() %>%
  sub(",", "", .) %>%
  as.numeric() %>%
  matrix(nrow = length(row_names), byrow = TRUE) %>%
  as.data.frame() %>%
  setNames(col_names) %>%
  `row.names<-`(row_names) -> final

final
#>                                Arrival Hong Kong Residents
#> Airport                                               4258
#> Express Rail Link West Kowloon                           0
#> Hung Hom                                                 0
#> Lo Wu                                                    0
#> Lok Ma Chau Spur Line                                    0
#> Heung Yuen Wai                                           0
#> Hong Kong-Zhuhai-Macao Bridge                          333
#> Lok Ma Chau                                              0
#> Man Kam To                                               0
#> Sha Tau Kok                                              0
#> Shenzhen Bay                                          3404
#> China Ferry Terminal                                     0
#> Harbour Control                                          0
#> Kai Tak Cruise Terminal                                  0
#> Macau Ferry Terminal                                     0
#> Total                                                 7995
#>                                Arrival Mainland Visitors Arrival Other Visitors
#> Airport                                             1488                    422
#> Express Rail Link West Kowloon                         0                      0
#> Hung Hom                                               0                      0
#> Lo Wu                                                  0                      0
#> Lok Ma Chau Spur Line                                  0                      0
#> Heung Yuen Wai                                         0                      0
#> Hong Kong-Zhuhai-Macao Bridge                         28                     39
#> Lok Ma Chau                                            0                      0
#> Man Kam To                                             0                      0
#> Sha Tau Kok                                            0                      0
#> Shenzhen Bay                                         348                     37
#> China Ferry Terminal                                   0                      0
#> Harbour Control                                        0                      0
#> Kai Tak Cruise Terminal                                0                      0
#> Macau Ferry Terminal                                   0                      0
#> Total                                               1864                    498
#>                                Arrival Total Departure Hong Kong Residents
#> Airport                                 6168                          3775
#> Express Rail Link West Kowloon             0                             0
#> Hung Hom                                   0                             0
#> Lo Wu                                      0                             0
#> Lok Ma Chau Spur Line                      0                             0
#> Heung Yuen Wai                             0                             0
#> Hong Kong-Zhuhai-Macao Bridge            400                           243
#> Lok Ma Chau                                0                             0
#> Man Kam To                                 0                             0
#> Sha Tau Kok                                0                             0
#> Shenzhen Bay                            3789                          1301
#> China Ferry Terminal                       0                             0
#> Harbour Control                            0                             0
#> Kai Tak Cruise Terminal                    0                             0
#> Macau Ferry Terminal                       0                             0
#> Total                                  10357                          5319
#>                                Departure Mainland Visitors
#> Airport                                               1154
#> Express Rail Link West Kowloon                           0
#> Hung Hom                                                 0
#> Lo Wu                                                    0
#> Lok Ma Chau Spur Line                                    0
#> Heung Yuen Wai                                           0
#> Hong Kong-Zhuhai-Macao Bridge                          194
#> Lok Ma Chau                                              0
#> Man Kam To                                               0
#> Sha Tau Kok                                              0
#> Shenzhen Bay                                           524
#> China Ferry Terminal                                     0
#> Harbour Control                                          0
#> Kai Tak Cruise Terminal                                  0
#> Macau Ferry Terminal                                     0
#> Total                                                 1872
#>                                Departure Other Visitors Departure Total
#> Airport                                             315            5244
#> Express Rail Link West Kowloon                        0               0
#> Hung Hom                                              0               0
#> Lo Wu                                                 0               0
#> Lok Ma Chau Spur Line                                 0               0
#> Heung Yuen Wai                                        0               0
#> Hong Kong-Zhuhai-Macao Bridge                        15             452
#> Lok Ma Chau                                           0               0
#> Man Kam To                                            0               0
#> Sha Tau Kok                                           0               0
#> Shenzhen Bay                                         28            1853
#> China Ferry Terminal                                  0               0
#> Harbour Control                                       0               0
#> Kai Tak Cruise Terminal                               0               0
#> Macau Ferry Terminal                                  0               0
#> Total                                               358            7549

创建于 2022-09-18 with reprex v2.0.2


要绘制饼图,请绘制条形图,然后更改为极坐标。ggplot

Airport <- final[1,,]
Airport %>%
  t() %>%
  as.data.frame() %>%
  mutate(`Arrival/Departure` = row.names(.)) %>%
  ggplot(aes("", Airport, fill = `Arrival/Departure`)) +
  geom_col(width = 1) +
  scale_fill_manual(values = RColorBrewer::brewer.pal(n = 8, name = "Spectral")) +
  coord_polar(theta = "y", start = 0) +
  theme_void()

创建于 2022-09-18 with reprex v2.0.2

评论

0赞 ronzenith 9/18/2022
你救了我的命@Rui巴拉达斯!我想知道诀窍,即您的语法如何删除前四行标头?看来你所做的是,首先,抓取数字,其次,添加行和列名。我说得对吗?
0赞 ronzenith 9/18/2022
也许我的问题应该是这样的:如何在不完全提取列名的情况下抓取表格?@ruibarradas
0赞 Rui Barradas 9/19/2022
@ronzenith 区别在于要提取哪些元素,对于数字和标题,嗯,它是和,然后是(行名)。class='hRight'headers='Arrival'Departureheaders = 'Control_Point
2赞 QHarr 9/19/2022 #2

另一种方法是选择 tbody 行,按属性过滤掉隐藏的项目,然后稍后添加标题。

library(rvest)
library(tidyverse)

rows <- read_html("https://www.immd.gov.hk/eng/stat_20220901.html") %>% html_elements(".table-passengerTrafficStat tbody tr")
prefixes <- c("arr", "dep")
cols <- c("Hong Kong Residents", "Mainland Visitors", "Other Visitors", "Total")
headers <- c("Control_Point", crossing(prefixes, cols) %>% unite("headers", 1:2, remove = T) %>% unlist() %>% unname())

df <- map_dfr(
  rows,
  function(x) {
    x %>%
      html_elements("td[headers]") %>%
      set_names(headers) %>%
      html_text()
  }
) %>%
  mutate(across(c(-1), ~ str_replace(.x, ",", "") %>% as.integer()))

或者有点浓缩,

library(rvest)
library(tidyverse)

rows <- read_html("https://www.immd.gov.hk/eng/stat_20220901.html") %>% html_elements(".table-passengerTrafficStat tbody tr")
prefixes <- c("arr", "dep")
cols <- c("Hong Kong Residents", "Mainland Visitors", "Other Visitors", "Total")
headers <- c("Control_Point", crossing(prefixes, cols) %>% unite("headers", 1:2, remove = T) %>% unlist() %>% unname())

df <- map_dfr(rows, ~ set_names(.x %>% html_elements("td[headers]") %>% html_text(), headers)) %>%
  mutate(across(c(-1), ~ str_replace(.x, ",", "") %>% as.integer()))

评论

0赞 ronzenith 10/2/2022
谢谢你的帮助。当我尝试仔细阅读您的代码时,我不太了解编写额外函数的用法 function(x) { x %>% html_elements(“td[headers]”) %>% set_names(headers) %>% html_text() } (1) x 在这里是什么意思?(2)为什么函数可以自行运行,而不需要调用函数并告诉R什么是x?@qharr
0赞 QHarr 10/3/2022
我想逐行应用,因此需要逐行调用函数,在本例中为 map_dfr。x 是从左边(行列表)传递到右边(应用于每行的函数)的内容。因此,x 是单行。该函数由 map_dfr 应用。