按组计算唯一值

Count unique values by group

提问人:bvowe 提问时间:8/15/2022 最后编辑:akrunbvowe 更新时间:8/15/2022 访问量:467

问:

    DATA = data.frame("TRIMESTER" = c(1,1,1,1,1,1,1,2,2,2,2,2,2,2,3,3,3,3,3,3,3,3,3),
                      "STUDENT" = c(1,2,3,4,5,6,7,1,2,3,5,9,10,11,3,7,10,6,12,15,17,16,21))
    
    WANT = data.frame("TRIMESTER" = c(1,2,3),
                      "NEW_ENROLL" = c(7,3,5),
                      "TOTAL_ENROLL" = c(7,10,15))

我有“数据”,想做“想要”,它有三列,对于每个“三个月”,你计算新的“学生”的数量,然后对于“TOTAL_ENROLL”,你只计算每个三个月唯一“学生”的总数。

我的尝试只计算每个三个月的数字。

library(dplyr)
DATA %>%
group_by(TRIMESTER) %>%
count()
r dplyr

评论


答:

4赞 akrun 8/15/2022 #1

我们将 'STUDENT' 中的元素进行分组,得到非 NA 元素,最后做累积和 (replaceduplicatedNATRIMESTERsumcumsum)

library(dplyr)
DATA %>%
   mutate(STUDENT = replace(STUDENT, duplicated(STUDENT), NA)) %>% 
  group_by(TRIMESTER) %>% 
  summarise(NEW_ENROLL = sum(!is.na(STUDENT)), .groups= 'drop') %>%
  mutate(TOTAL_ENROLL = cumsum(NEW_ENROLL))

-输出

# A tibble: 3 × 3
  TRIMESTER NEW_ENROLL TOTAL_ENROLL
      <dbl>      <int>        <int>
1         1          7            7
2         2          3           10
3         3          5           15

或者用distinct

distinct(DATA, STUDENT, .keep_all = TRUE) %>%
 group_by(TRIMESTER) %>% 
 summarise(NEW_ENROLL = n(), .groups = 'drop') %>% 
 mutate(TOTAL_ENROLL = cumsum(NEW_ENROLL))
# A tibble: 3 × 3
  TRIMESTER NEW_ENROLL TOTAL_ENROLL
      <dbl>      <int>        <int>
1         1          7            7
2         2          3           10
3         3          5           15
9赞 Rui Barradas 8/15/2022 #2

这是一种方法。

suppressPackageStartupMessages(library(dplyr))

DATA <- data.frame("TRIMESTER" = c(1,1,1,1,1,1,1,2,2,2,2,2,2,2,3,3,3,3,3,3,3,3,3),
                  "STUDENT" = c(1,2,3,4,5,6,7,1,2,3,5,9,10,11,3,7,10,6,12,15,17,16,21))
DATA %>% 
  mutate(NEW_ENROLL = !duplicated(STUDENT)) %>%
  group_by(TRIMESTER) %>%
  summarise(NEW_ENROLL = sum(NEW_ENROLL)) %>%
  ungroup() %>%
  mutate(TOTAL_ENROLL = cumsum(NEW_ENROLL))
#> # A tibble: 3 × 3
#>   TRIMESTER NEW_ENROLL TOTAL_ENROLL
#>       <dbl>      <int>        <int>
#> 1         1          7            7
#> 2         2          3           10
#> 3         3          5           15

创建于 2022-08-14 由 reprex 软件包 (v2.0.1)

9赞 Mohamed Desouky 8/15/2022 #3
  • 对于品种,我们可以使用Base Raggregatetransform
transform(aggregate(. ~ TRIMESTER , DATA[!duplicated(DATA$STUDENT),] , length),
         TOTAL_ENROLL = cumsum(STUDENT))
  • 输出

  TRIMESTER STUDENT TOTAL_ENROLL
1         1       7            7
2         2       3           10
3         3       5           15