R:按多列汇总大型数据帧,但不要汇总所有列中定义为值汇总为 NA 的行

R: summarize a large dataframe by multiple columns but don't summarize rows that in all the columns defined to summarize by the value is NA

提问人:Maya Eldar 提问时间:3/21/2023 更新时间:3/21/2023 访问量:42

问:

我想用多列来总结一个大数据帧,但是,数据帧中的某些行在我希望总结的列中只包含 NA,即使它们实际上是不同的观察值。我想保留包含所有 NA 值的所有行。

数据帧示例:

df <- data.frame(a = c(1,2,3,4,5,6),
b = c(2,NA,2,1,NA,4),
c = c(1,2,1,NA,NA,1),
d = c(2,3,2,NA,NA,2),
e = c(3,2,NA,NA,NA,NA),
f = c(4,1,3,NA,NA,3))

我想按 c、d、e 和 f 列以及所需的输出进行总结:

df <- data.frame(a = c(1,2,3,4,5),
b = c(2,NA,2,1,NA),
c = c(1,2,1,NA,NA),
d = c(2,3,2,NA,NA),
e = c(3,2,NA,NA,NA),
f = c(4,1,3,NA,NA))

TNX!

r dplyr na summarize

评论


答:

2赞 zephryl 3/21/2023 #1

使用 dplyr,使用所有列都位于其中的行的累积计数作为附加分组变量:NA

library(dplyr)

df %>%
  mutate(
    all_na = if_all(c:f, is.na),
    all_na_grp = ifelse(all_na, cumsum(all_na), 0)
  ) %>%
  group_by(c, d, e, f, all_na_grp) %>%
  summarize(across(a:b, first), .groups = "drop") %>%
  select(a, b, !all_na_grp)
# A tibble: 5 × 6
      a     b     c     d     e     f
  <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1     1     2     1     2     3     4
2     3     2     1     2    NA     3
3     2    NA     2     3     2     1
4     4     1    NA    NA    NA    NA
5     5    NA    NA    NA    NA    NA