如何将str_extract函数应用于一系列表格单元格？-解网

问：

我的最终目标是使用维基百科表格中的数据源绘制图表。

我使用 R/selenium。

我成功地修改了其中一个单元格：

library(RSelenium)
library(wdman)
library(netstat)

selenium()

selenium_object <- selenium(retcommand = TRUE, check = F)
selenium_object

binman::list_versions("chromedriver")

#start the server
rs_driver_object <- rsDriver(browser = "chrome",
                          chromever = "117.0.5938.92",
                          verbose = F,
                          port = free_port())

#create a client object
remDr <- rs_driver_object$client

#open a browser
remDr$open()

remDr$navigate("https://zh.wikipedia.org/zh-hk/%E6%B2%99%E7%94%B0_(%E9%A6%99%E6%B8%AF)")

webpage_text <- remDr$getPageSource()[[1]]

tables <- getNodeSet(htmlParse(webpage_text),"//table")

df <- readHTMLTable(tables[[3]], Encoding("UTF-8"))

str_extract(df[4,2], "\\d+([.,]\\d+)?+(?<!\\()")

问题是：如何将正则表达式函数应用于一系列表格单元格？并且，如何在相同的 df 中获取结果？

我的尝试是这样的，但失败了：

celsius <- str_extract(df[4,2], "\\d+([.,]\\d+)?+(?<!\\()")

apply(df[c(2:3),c(5:8)], c(1,2), celsius)

难题是如何绕过中的字符串属性。celsius

谢谢。

R 正则表达式 Selenium-WebDriver 应用字符串

library(rvest)
library(dplyr)
url <- "https://zh.wikipedia.org/zh-hk/%E6%B2%99%E7%94%B0_(%E9%A6%99%E6%B8%AF)"

df <- read_html(url) |>
    html_elements(".wikitable") |>
    html_table(header = FALSE) |>
    purrr::pluck(1) |>
    filter(grepl("°C（°F）", X1)) |> # only temperature columns
    # set month names to English for my benefit
    setNames(c("month", month.abb, "annual"))

然后，要提取温度，您可以使用 tidyr：：separate_wider_regex（）。我们可以使用模式 c（c = “.+”， f = “\（.+\）”），这意味着括号前的所有内容都是摄氏度（），括号中的所有内容都是华氏度（）。由于您只对表格感兴趣，因此只需对列感兴趣。celsius"c""fselect()c

df |>
    tidyr::separate_wider_regex(
        cols = Jan:annual, c(c = ".+", f = "\\(.+\\)"),
        names_sep = "_"
    ) |>
    select(month, ends_with("_c")) |>
    # remove _c from the end of the name
    rename_with(\(x) gsub("_c$", "", x)) |>
    # make temperature numeric
    mutate(across(Jan:annual, as.numeric))

# # A tibble: 5 × 14
#   month                 Jan   Feb   Mar   Apr   May   Jun   Jul   Aug   Sep   Oct   Nov   Dec annual
#   <chr>               <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>  <dbl>
# 1 歷史最高溫 °C（°F）  27.6  28.6  31.8  33    36.6  36.4  37.5  38.1  36.5  35.1  31.8  28.9   38.1
# 2 平均高溫 °C（°F）    19.3  19.8  22.1  25.7  29    30.8  31.9  31.9  31.1  28.5  24.9  20.9   26.3
# 3 日均氣溫 °C（°F）    15.8  16.5  19.1  22.6  26.1  28.1  28.9  28.6  27.8  25.2  21.5  17.2   23.1
# 4 平均低溫 °C（°F）    12.9  13.9  16.7  20.2  23.7  25.8  26.3  26    25.2  22.6  18.7  14.1   20.5
# 5 歷史最低溫 °C（°F）   2.9   4     4.4  10.2  15.3  19.9  21.3  22.1  19.9  14.4   6.3   4.8    2.9

您可能还希望将数据放在长格式中。如果是这样，只需将最后一个值通过管道传递给。tidyr::pivot_longer(-month)

如果我继续使用 Selenium，我会尝试以下代码但失败了： colnames（df） <- df[2，] rownames（df） <- df[，1] df <- df[c（-1， -2， -10），-1] df1 <- df for （i in 1：nrow（df1））{ str_extract（df1[i，]， “\\d+（[.，]\\d+）？+（？<！\（）”） } 可以在单个行上运行： str_extract（df1[1，]， “\\d+（[.，]\\d+）？+（？<！\（）”）并返回一个列表。我可以知道哪个部分是错误的吗？

1赞 SamR 10/8/2023

我认为硒不是要走的路。它引入了额外的复杂性，我认为你在这里不需要它。恐怕我不知道你面临的错误具体是什么。我的建议是除非你需要，否则不要使用它。

上一个：正则表达式在 R 中提取部分文件名

下一个：strsplit 和 stri_extract_all_regex 的贪婪

如何将str_extract函数应用于一系列表格单元格？

How to apply a str_extract function to a range of table cells?

评论

评论