在 R 中，基于外部向量创建新行的替代方案更快？-解网

问：

我们想要复制行以适应长度为 > 1 的外部向量的一个典型用例是，当我们想要引入新日期，或允许每个日期显示不同的个体时。

想象一下，我们想在虹膜数据集中为每个月创建测量值，一种选择是这样做：

group_and_tidyr_expand <- function(df){
  df %>% 
    group_by(pick(everything())) %>% 
    tidyr::expand(date = seq.Date(from = as.Date("2022-01-01"), to = as.Date("2023-01-01"), by = "1 month")) %>% 
    ungroup()
}

但是，现在有一个函数，允许每个分组具有多个值（即输出行），即 .与上面的代码等效的是：dplyrreframe

group_and_reframe <- function(df){
  df %>% 
    group_by(pick(everything())) %>% 
    reframe(date = seq.Date(from = as.Date("2022-01-01"), to = as.Date("2023-01-01"), by = "1 month"))
}

这两种选择中哪一种更快？

注意：在重构中不需要显示取消分组，因为它默认已经取消了输出的分组。

r dplyr tidyr

microbenchmark::microbenchmark(
    "reframe" = group_and_reframe(iris), 
    "tidyr_expand" = group_and_tidyr_expand(iris)
)

#> Unit: milliseconds
#>          expr      min        lq      mean   median       uq      max neval
#>       reframe  59.8355  63.90195  74.08853  70.2310  79.2985 219.7671   100
#>  tidyr_expand 207.6898 227.03515 261.62132 243.3526 269.4305 517.4461   100

library(tidyverse)
library(nycflights13)

library(bench)


big_iris <- iris %>% uncount(1E4)
distinct_expand <- function(df){
  out <- tidyr::expand_grid(dplyr::distinct(df), 
                     dplyr::tibble(date = seq.Date(from = as.Date("2022-01-01"), to = as.Date("2023-01-01"), by = "1 month")))
  dplyr::dplyr_reconstruct(out, df)
}
group_and_reframe <- function(df){
  df %>% 
    reframe(date = seq.Date(from = as.Date("2022-01-01"), to = as.Date("2023-01-01"), by = "1 month"),
            .by = everything())
}
mark(distinct_expand(big_iris), group_and_reframe(big_iris))
#> # A tibble: 2 × 6
#>   expression                       min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr>                  <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 distinct_expand(big_iris)      106ms    106ms      9.46    22.5MB    28.4 
#> 2 group_and_reframe(big_iris)    130ms    147ms      6.82    34.2MB     6.82

^{创建于 2023-11-14 使用 reprex v2.0.2}

上一个：在 R 中将列的值分组到类别中

下一个：使用宽格式与长格式的摘要

在 R 中，基于外部向量创建新行的替代方案更快？

What alternative is faster to create new rows based on an external vector in R?

评论

评论