在数据框中按组计算唯一/非重复值

Counting unique / distinct values by group in a data frame

提问人:Mehper C. Palavuzlar 提问时间:10/11/2012 最后编辑:JaapMehper C. Palavuzlar 更新时间:11/14/2023 访问量:213492

问:

假设我有以下数据框:

> myvec
    name order_no
1    Amy       12
2   Jack       14
3   Jack       16
4   Dave       11
5    Amy       12
6   Jack       16
7    Tom       19
8  Larry       22
9    Tom       19
10  Dave       11
11  Jack       17
12   Tom       20
13   Amy       23
14  Jack       16

我想计算每个 .它应产生以下结果:order_noname

name    number_of_distinct_orders
Amy     2
Jack    3
Dave    1
Tom     2
Larry   1

我该怎么做?

数据帧 非重复值 R-FAQ

评论


答:

31赞 mpiktas 10/11/2012 #1

这应该可以解决问题:

ddply(myvec,~name,summarise,number_of_distinct_orders=length(unique(order_no)))

这需要软件包 plyr。

评论

16赞 smci 7/7/2018
这已经过时了,不应该是公认的答案:plyr 自 2014 年以来一直被封存。dplyr 或 data.table 是这些天使用的软件包。
2赞 Aren Cambre 7/24/2018
@smci同意了。这是许多与 R 相关的答案的问题,尤其是当它们适用于我们今天使用 Tidyverse 做的事情时。
3赞 Tyler Rinker 10/11/2012 #2

这也可以,但不如 plyr 解决方案雄辩:

x <- sapply(split(myvec, myvec$name),  function(x) length(unique(x[, 2]))) 
data.frame(names=names(x), number_of_distinct_orders=x, row.names = NULL)
54赞 Sven Hohenstein 10/11/2012 #3

这是一个具有以下功能的简单解决方案:aggregate

aggregate(order_no ~ name, myvec, function(x) length(unique(x)))
9赞 Jeffrey Evans 10/12/2012 #4

只需将内置的 R 函数与tapplylength

tapply(myvec$order_no, myvec$name, FUN = function(x) length(unique(x)))
90赞 mnel 10/16/2012 #5

一种方法data.table

library(data.table)
DT <- data.table(myvec)

DT[, .(number_of_distinct_orders = length(unique(order_no))), by = name]

data.tablev >= 1.9.5 现在有一个内置函数uniqueN

DT[, .(number_of_distinct_orders = uniqueN(order_no)), by = name]
82赞 Henrik 1/7/2017 #6

在您可以使用“计算唯一值的数量”:dplyrn_distinct

library(dplyr)
myvec %>%
  group_by(name) %>%
  summarise(n_distinct(order_no))

评论

3赞 jormaga 7/25/2020
我不推荐这个。如果您想使用 dplyr,请使用 length(unique(order_no)) 而不是 n_distinct(order_no),因为n_distinct真的很慢。
3赞 Henrik 7/25/2020
@jormaga 这是一个已知问题,请参阅汇总许多组的性能下降,其中还显示了您的解决方法;“我们知道这一点,我们有一个计划,但它不会在 1.0.0 中发生。"
9赞 jogo 5/4/2017 #7

这是一个解决方案sqldf

library("sqldf")

myvec <- read.table(header=TRUE, text=
"   name order_no
1    Amy       12
2   Jack       14
3   Jack       16
4   Dave       11
5    Amy       12
6   Jack       16
7    Tom       19
8  Larry       22
9    Tom       19
10  Dave       11
11  Jack       17
12   Tom       20
13   Amy       23
14  Jack       16")
sqldf("SELECT name,COUNT(distinct(order_no)) as number_of_distinct_orders FROM myvec GROUP BY name")
# > sqldf("SELECT name,COUNT(distinct(order_no)) as number_of_distinct_orders FROM myvec GROUP BY name")
#    name number_of_distinct_orders
# 1   Amy                         2
# 2  Dave                         1
# 3  Jack                         3
# 4 Larry                         1
# 5   Tom                         2
0赞 moodymudskipper 7/4/2017 #8

用:table

library(magrittr)
myvec %>% unique %>% '['(1) %>% table %>% as.data.frame %>%
  setNames(c("name","number_of_distinct_orders"))

#    name number_of_distinct_orders
# 1   Amy                         2
# 2  Dave                         1
# 3  Jack                         3
# 4 Larry                         1
# 5   Tom                         2
2赞 mattbawn 7/18/2017 #9
my.1 <- table(myvec)

my.1[my.1 != 0] <- 1

rowSums(my.1)

评论

0赞 Jaap 7/18/2017
也许换行:来取回数据帧。rowSums(my.1)stackstack(rowSums(my.1))[2:1]
0赞 PlasmaBinturong 3/5/2020
你可以把它做成一个单行字:rowSums( table(myvec) != 0 )
21赞 Aurèle 7/28/2017 #10

以下是 @David Arenburg 解决方案的基准,以及此处发布的一些解决方案的回顾(@mnel@Sven Hohenstein@Henrik):

library(dplyr)
library(data.table)
library(microbenchmark)
library(tidyr)
library(ggplot2)

df <- mtcars
DT <- as.data.table(df)
DT_32k <- rbindlist(replicate(1e3, mtcars, simplify = FALSE))
df_32k <- as.data.frame(DT_32k)
DT_32M <- rbindlist(replicate(1e6, mtcars, simplify = FALSE))
df_32M <- as.data.frame(DT_32M)
bench <- microbenchmark(
  base_32 = aggregate(hp ~ cyl, df, function(x) length(unique(x))),
  base_32k = aggregate(hp ~ cyl, df_32k, function(x) length(unique(x))),
  base_32M = aggregate(hp ~ cyl, df_32M, function(x) length(unique(x))),
  dplyr_32 = summarise(group_by(df, cyl), count = n_distinct(hp)),
  dplyr_32k = summarise(group_by(df_32k, cyl), count = n_distinct(hp)),
  dplyr_32M = summarise(group_by(df_32M, cyl), count = n_distinct(hp)),
  data.table_32 = DT[, .(count = uniqueN(hp)), by = cyl],
  data.table_32k = DT_32k[, .(count = uniqueN(hp)), by = cyl],
  data.table_32M = DT_32M[, .(count = uniqueN(hp)), by = cyl],
  times = 10
)

结果:

print(bench)

# Unit: microseconds
#            expr          min           lq         mean       median           uq          max neval  cld
#         base_32      816.153     1064.817 1.231248e+03 1.134542e+03     1263.152     2430.191    10 a   
#        base_32k    38045.080    38618.383 3.976884e+04 3.962228e+04    40399.740    42825.633    10 a   
#        base_32M 35065417.492 35143502.958 3.565601e+07 3.534793e+07 35802258.435 37015121.086    10    d
#        dplyr_32     2211.131     2292.499 1.211404e+04 2.370046e+03     2656.419    99510.280    10 a   
#       dplyr_32k     3796.442     4033.207 4.434725e+03 4.159054e+03     4857.402     5514.646    10 a   
#       dplyr_32M  1536183.034  1541187.073 1.580769e+06 1.565711e+06  1600732.034  1733709.195    10  b  
#   data.table_32      403.163      413.253 5.156662e+02 5.197515e+02      619.093      628.430    10 a   
#  data.table_32k     2208.477     2374.454 2.494886e+03 2.448170e+03     2557.604     3085.508    10 a   
#  data.table_32M  2011155.330  2033037.689 2.074020e+06 2.052079e+06  2078231.776  2189809.835    10   c 

情节:

as_tibble(bench) %>% 
  group_by(expr) %>% 
  summarise(time = median(time)) %>% 
  separate(expr, c("framework", "nrow"), "_", remove = FALSE) %>% 
  mutate(nrow = recode(nrow, "32" = 32, "32k" = 32e3, "32M" = 32e6),
         time = time / 1e3) %>% 
  ggplot(aes(nrow, time, col = framework)) +
  geom_line() +
  scale_x_log10() +
  scale_y_log10() + ylab("microseconds")

aggregate-VS-dplyr-VS-datatable

会议信息:

sessionInfo()
# R version 3.4.1 (2017-06-30)
# Platform: x86_64-pc-linux-gnu (64-bit)
# Running under: Linux Mint 18
# 
# Matrix products: default
# BLAS: /usr/lib/atlas-base/atlas/libblas.so.3.0
# LAPACK: /usr/lib/atlas-base/atlas/liblapack.so.3.0
# 
# locale:
# [1] LC_CTYPE=fr_FR.UTF-8       LC_NUMERIC=C               LC_TIME=fr_FR.UTF-8       
# [4] LC_COLLATE=fr_FR.UTF-8     LC_MONETARY=fr_FR.UTF-8    LC_MESSAGES=fr_FR.UTF-8   
# [7] LC_PAPER=fr_FR.UTF-8       LC_NAME=C                  LC_ADDRESS=C              
# [10] LC_TELEPHONE=C             LC_MEASUREMENT=fr_FR.UTF-8 LC_IDENTIFICATION=C       
# 
# attached base packages:
# [1] stats     graphics  grDevices utils     datasets  methods   base     
# 
# other attached packages:
# [1] ggplot2_2.2.1          tidyr_0.6.3            bindrcpp_0.2           stringr_1.2.0         
# [5] microbenchmark_1.4-2.1 data.table_1.10.4      dplyr_0.7.1           
# 
# loaded via a namespace (and not attached):
# [1] Rcpp_0.12.11     compiler_3.4.1   plyr_1.8.4       bindr_0.1        tools_3.4.1      digest_0.6.12   
# [7] tibble_1.3.3     gtable_0.2.0     lattice_0.20-35  pkgconfig_2.0.1  rlang_0.1.1      Matrix_1.2-10   
# [13] mvtnorm_1.0-6    grid_3.4.1       glue_1.1.1       R6_2.2.2         survival_2.41-3  multcomp_1.4-6  
# [19] TH.data_1.0-8    magrittr_1.5     scales_0.4.1     codetools_0.2-15 splines_3.4.1    MASS_7.3-47     
# [25] assertthat_0.2.0 colorspace_1.3-2 labeling_0.3     sandwich_2.3-4   stringi_1.1.5    lazyeval_0.2.0  
# [31] munsell_0.4.3    zoo_1.8-0 
1赞 Marcus Paget 9/11/2020 #11

几年了..虽然有类似的要求,但最终写了我自己的解决方案。在这里申请:

 x<-data.frame(
 
 "Name"=c("Amy","Jack","Jack","Dave","Amy","Jack","Tom","Larry","Tom","Dave","Jack","Tom","Amy","Jack"),
 "OrderNo"=c(12,14,16,11,12,16,19,22,19,11,17,20,23,16)
)

table(sub("~.*","",unique(paste(x$Name,x$OrderNo,sep="~",collapse=NULL))))

  Amy  Dave  Jack Larry   Tom
    2     1     3     1     2
2赞 akrun 3/17/2021 #12

使用 fromfNdistinctcollapse

library(collapse)
library(magrittr)
myvec %>%
     fgroup_by(name) %>%
     fsummarise(number_of_distinct_orders = fNdistinct(order_no))
#   name number_of_distinct_orders
#1   Amy                         2
#2  Dave                         1
#3  Jack                         3
#4 Larry                         1
#5   Tom                         2

数据

myvec <- structure(list(name = c("Amy", "Jack", "Jack", "Dave", "Amy", 
"Jack", "Tom", "Larry", "Tom", "Dave", "Jack", "Tom", "Amy", 
"Jack"), order_no = c(12L, 14L, 16L, 11L, 12L, 16L, 19L, 22L, 
19L, 11L, 17L, 20L, 23L, 16L)), class = "data.frame", row.names = c("1", 
"2", "3", "4", "5", "6", "7", "8", "9", "10", "11", "12", "13", 
"14"))
0赞 Yun 11/14/2023 #13

data.table 或 dplyr 的另一个选项是先做唯一,然后计数。

df <- mtcars
DT_32M <- data.table::rbindlist(replicate(1e6, mtcars, simplify = FALSE))
df_32M <- as.data.frame(DT_32M)
DT2_32M <- data.table::rbindlist(replicate(1e6, mtcars, simplify = FALSE))
bench::mark(
  base_32M = aggregate(hp ~ cyl, df_32M, function(x) length(unique(x))),
  dplyr_length_unique = dplyr::summarise(df_32M,
    count = length(unique(hp)),
    .by = cyl
  ),
  dplyr_n_distinct = dplyr::summarise(df_32M,
    count = dplyr::n_distinct(hp),
    .by = cyl
  ),
  dplyr_unique_then_count = dplyr::summarise(
    dplyr::distinct(df_32M, hp, cyl),
    count = dplyr::n(),
    .by = cyl
  ),
  dt_length_unique = DT_32M[, .(count = length(unique(hp))), keyby = cyl],
  dt_uniqueN = DT_32M[, .(count = data.table::uniqueN(hp)), keyby = cyl],
  dt_unique_then_count = {
    dt <- unique(DT2_32M, by = c("hp", "cyl"), cols = character())
    dt[, list(count = .N), keyby = "cyl"]
  },
  check = FALSE
)
#> Warning: Some expressions had a GC in every iteration; so filtering is
#> disabled.
#> # A tibble: 7 × 6
#>   expression                   min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr>              <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 base_32M                   13.2s    13.2s    0.0757    8.98GB    0.227
#> 2 dplyr_length_unique      713.7ms  713.7ms    1.40      1.28GB    0
#> 3 dplyr_n_distinct         776.5ms  776.5ms    1.29      1.28GB    0
#> 4 dplyr_unique_then_count  412.3ms  419.7ms    2.38    378.36MB    0
#> 5 dt_length_unique         566.8ms  566.8ms    1.76    777.89MB    1.76
#> 6 dt_uniqueN                 572ms    572ms    1.75    457.83MB    0
#> 7 dt_unique_then_count     489.2ms    494ms    2.02     122.2MB    0