提问人:MikeTP 提问时间:3/15/2012 最后编辑:zephrylMikeTP 更新时间:2/24/2023 访问量:228353
汇总/汇总每组的多个变量(例如总和、平均值)
Aggregate / summarize multiple variables per group (e.g. sum, mean)
问:
从数据框中,有没有一种简单的方法可以同时聚合(、等)多个变量?sum
mean
max
以下是一些示例数据:
library(lubridate)
days = 365*2
date = seq(as.Date("2000-01-01"), length = days, by = "day")
year = year(date)
month = month(date)
x1 = cumsum(rnorm(days, 0.05))
x2 = cumsum(rnorm(days, 0.05))
df1 = data.frame(date, year, month, x1, x2)
我想按年和月同时聚合数据框中的 和 变量。以下代码聚合了变量,但是否可以同时聚合变量?x1
x2
df2
x1
x2
### aggregate variables by year month
df2=aggregate(x1 ~ year+month, data=df1, sum, na.rm=TRUE)
head(df2)
答:
是的,在你的 中,你可以对要聚合的数值变量:formula
cbind
aggregate(cbind(x1, x2) ~ year + month, data = df1, sum, na.rm = TRUE)
year month x1 x2
1 2000 1 7.862002 -7.469298
2 2001 1 276.758209 474.384252
3 2000 2 13.122369 -128.122613
...
23 2000 12 63.436507 449.794454
24 2001 12 999.472226 922.726589
请参见 ,参数和示例。?aggregate
formula
评论
.
aggregate(. ~ year + month, df1, sum, na.rm = TRUE)
sum
这个功能从何而来?year()
您还可以将该包用于此任务:reshape2
require(reshape2)
df_melt <- melt(df1, id = c("date", "year", "month"))
dcast(df_melt, year + month ~ variable, sum)
# year month x1 x2
1 2000 1 -80.83405 -224.9540159
2 2000 2 -223.76331 -288.2418017
3 2000 3 -188.83930 -481.5601913
4 2000 4 -197.47797 -473.7137420
5 2000 5 -259.07928 -372.4563522
评论
recast
reshape2
melt
dcast
recast(df1, year + month ~ variable, sum, id.var = c("date", "year", "month"))
使用快速的包(对较大的数据集有用)data.table
https://github.com/Rdatatable/data.table/wiki
library(data.table)
df2 <- setDT(df1)[, lapply(.SD, sum), by = .(year, month), .SDcols = c("x1","x2")]
setDF(df2) # convert back to dataframe
使用 plyr 软件包
require(plyr)
df2 <- ddply(df1, c("year", "month"), function(x) colSums(x[c("x1", "x2")]))
使用 Hmisc 包中的 summarize() (不过,在我的示例中,列标题很混乱)
# need to detach plyr because plyr and Hmisc both have a summarize()
detach(package:plyr)
require(Hmisc)
df2 <- with(df1, summarize( cbind(x1, x2), by=llist(year, month), FUN=colSums))
评论
dt[, .(x1.sum = sum(x1), x2.sum = sum(x2), by = c(year, month)
.SDcols
使用 dplyr 包,您可以使用 across()
使用 tidyselect 语言聚合多个变量。对于示例数据集,您可以按如下方式执行此操作:
library(dplyr)
set.seed(13)
# summarising all non-grouping variables
df1 %>% group_by(year, month) %>% summarise(across(everything(), n_distinct))
# summarising a specific set of non-grouping variables
df1 %>% group_by(year, month) %>% summarise(across(x1:x2, sum))
df1 %>% group_by(year, month) %>% summarise(across(c(x1, x2), sum))
df1 %>% group_by(year, month) %>% summarise(across(-date, sum))
# summarising a specific set of non-grouping variables using selection helpers:
df1 %>% group_by(year, month) %>% summarise(across(starts_with('x'), sum))
df1 %>% group_by(year, month) %>% summarise(across(matches('.*[0-9]'), sum))
# summarising a specific set of non-grouping variables based on condition (class)
df1 %>% group_by(year, month) %>% summarise(across(where(is.numeric), sum))
除第一个结果外,所有结果都为:
# A tibble: 24 × 4
# Groups: year [2]
year month x1 x2
<dbl> <dbl> <dbl> <dbl>
1 2000 1 131. 27.4
2 2000 2 44.8 155.
3 2000 3 60.7 207.
4 2000 4 -11.5 379.
5 2000 5 64.0 441.
6 2000 6 -16.5 517.
7 2000 7 210. 530.
8 2000 8 112. 573.
9 2000 9 -129. 347.
10 2000 10 -165. 444.
# … with 14 more rows
您还可以将多个函数应用于所选列:
df1 %>%
group_by(year, month) %>%
summarise(across(x1:x2, list(sum = sum, avg = mean)))
# A tibble: 24 × 6
# Groups: year [2]
year month x1_sum x1_avg x2_sum x2_avg
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 2000 1 131. 4.24 27.4 0.884
2 2000 2 44.8 1.54 155. 5.34
3 2000 3 60.7 1.96 207. 6.69
4 2000 4 -11.5 -0.385 379. 12.6
5 2000 5 64.0 2.06 441. 14.2
6 2000 6 -16.5 -0.550 517. 17.2
7 2000 7 210. 6.76 530. 17.1
8 2000 8 112. 3.60 573. 18.5
9 2000 9 -129. -4.30 347. 11.6
10 2000 10 -165. -5.33 444. 14.3
# … with 14 more rows
最后的几点说明:
- 默认情况下,删除最后一个分组级别,因此上述所有示例仍将按 分组。要删除所有分组,您可以添加呼叫,或在呼叫中设置。
summarise()
year
ungroup()
.groups = "drop"
summarise()
- 从 dplyr 1.1.0 开始,您也可以使用
.by
参数为汇总操作指定分组,例如,df1 %>% summarise(across(c(x1, x2), sum), .by = c(year, month))
across()
也适用于其他 dplyr 动词,例如 和 .mutate()
reframe()
- 在引入 dplyr 1.0.0 之前,这些类型的操作是通过 、 、 和(甚至更早)完成的。这些现在已被取代或弃用,取而代之的是 .
across()
summarise_all()
summarise_at()
summarise_if()
summarise_each()
across()
晚会,但最近找到了另一种获取汇总统计数据的方法。
library(psych)
describe(data)
将输出: 每个变量的平均值、最小值、最大值、标准差、n、标准误差、峰度、偏度、中位数和范围。
评论
describe
describe.by(column, group = grouped_column)
将对值进行分组
有趣的是,这里没有展示基础 R 的方法,上面使用了公式接口,所以为了完整起见:aggregate
data.frame
aggregate(
x = df1[c("x1", "x2")],
by = df1[c("year", "month")],
FUN = sum, na.rm = TRUE
)
聚合的 data.frame 方法的更通用用法:
由于我们提供了一个
data.frame
as 和x
- a(也是 a )作为,如果我们需要以动态方式使用它,这将非常有用,例如,使用其他列进行聚合和聚合依据非常简单
list
data.frame
list
by
- 还具有定制的聚合功能
例如,像这样:
colsToAggregate <- c("x1")
aggregateBy <- c("year", "month")
dummyaggfun <- function(v, na.rm = TRUE) {
c(sum = sum(v, na.rm = na.rm), mean = mean(v, na.rm = na.rm))
}
aggregate(df1[colsToAggregate], by = df1[aggregateBy], FUN = dummyaggfun)
评论
aggregate
mean
sum
使用版本 >= ,我们还可以使用dplyr
1.0.0
summarise
across
library(dplyr)
df1 %>%
group_by(year, month) %>%
summarise(across(starts_with('x'), sum))
# A tibble: 24 x 4
# Groups: year [2]
# year month x1 x2
# <dbl> <dbl> <dbl> <dbl>
# 1 2000 1 11.7 52.9
# 2 2000 2 -74.1 126.
# 3 2000 3 -132. 149.
# 4 2000 4 -130. 4.12
# 5 2000 5 -91.6 -55.9
# 6 2000 6 179. 73.7
# 7 2000 7 95.0 409.
# 8 2000 8 255. 283.
# 9 2000 9 489. 331.
#10 2000 10 719. 305.
# … with 14 more rows
评论
有关更灵活、更快速的数据聚合方法,请查看 CRAN 上提供的折叠 R 包中的函数:collap
library(collapse)
# Simple aggregation with one function
head(collap(df1, x1 + x2 ~ year + month, fmean))
year month x1 x2
1 2000 1 -1.217984 4.008534
2 2000 2 -1.117777 11.460301
3 2000 3 5.552706 8.621904
4 2000 4 4.238889 22.382953
5 2000 5 3.124566 39.982799
6 2000 6 -1.415203 48.252283
# Customized: Aggregate columns with different functions
head(collap(df1, x1 + x2 ~ year + month,
custom = list(fmean = c("x1", "x2"), fmedian = "x2")))
year month fmean.x1 fmean.x2 fmedian.x2
1 2000 1 -1.217984 4.008534 3.266968
2 2000 2 -1.117777 11.460301 11.563387
3 2000 3 5.552706 8.621904 8.506329
4 2000 4 4.238889 22.382953 20.796205
5 2000 5 3.124566 39.982799 39.919145
6 2000 6 -1.415203 48.252283 48.653926
# You can also apply multiple functions to all columns
head(collap(df1, x1 + x2 ~ year + month, list(fmean, fmin, fmax)))
year month fmean.x1 fmin.x1 fmax.x1 fmean.x2 fmin.x2 fmax.x2
1 2000 1 -1.217984 -4.2460775 1.245649 4.008534 -1.720181 10.47825
2 2000 2 -1.117777 -5.0081858 3.330872 11.460301 9.111287 13.86184
3 2000 3 5.552706 0.1193369 9.464760 8.621904 6.807443 11.54485
4 2000 4 4.238889 0.8723805 8.627637 22.382953 11.515753 31.66365
5 2000 5 3.124566 -1.5985090 7.341478 39.982799 31.957653 46.13732
6 2000 6 -1.415203 -4.6072295 2.655084 48.252283 42.809211 52.31309
# When you do that, you can also return the data in a long format
head(collap(df1, x1 + x2 ~ year + month, list(fmean, fmin, fmax), return = "long"))
Function year month x1 x2
1 fmean 2000 1 -1.217984 4.008534
2 fmean 2000 2 -1.117777 11.460301
3 fmean 2000 3 5.552706 8.621904
4 fmean 2000 4 4.238889 22.382953
5 fmean 2000 5 3.124566 39.982799
6 fmean 2000 6 -1.415203 48.252283
注意:您可以将 etc. 与 一起使用基本函数,但 etc. 是基于 C++ 的分组函数,在折叠包中提供,它们的速度要快得多(即在大型数据聚合上的性能与 data.table 相同,同时提供了更大的灵活性,并且这些快速分组函数也可以在没有 的情况下使用)。mean, max
collap
fmean, fmax
collap
注意 2:还支持灵活的多类型数据聚合,您当然可以使用参数来完成,但您也可以以半自动的方式将函数应用于数值和非数值列:collap
custom
# wlddev is a data set of World Bank Indicators provided in the collapse package
head(wlddev)
country iso3c date year decade region income OECD PCGDP LIFEEX GINI ODA
1 Afghanistan AFG 1961-01-01 1960 1960 South Asia Low income FALSE NA 32.292 NA 114440000
2 Afghanistan AFG 1962-01-01 1961 1960 South Asia Low income FALSE NA 32.742 NA 233350000
3 Afghanistan AFG 1963-01-01 1962 1960 South Asia Low income FALSE NA 33.185 NA 114880000
4 Afghanistan AFG 1964-01-01 1963 1960 South Asia Low income FALSE NA 33.624 NA 236450000
5 Afghanistan AFG 1965-01-01 1964 1960 South Asia Low income FALSE NA 34.060 NA 302480000
6 Afghanistan AFG 1966-01-01 1965 1960 South Asia Low income FALSE NA 34.495 NA 370250000
# This aggregates the data, applying the mean to numeric and the statistical mode to categorical columns
head(collap(wlddev, ~ iso3c + decade, FUN = fmean, catFUN = fmode))
country iso3c date year decade region income OECD PCGDP LIFEEX GINI ODA
1 Aruba ABW 1961-01-01 1962.5 1960 Latin America & Caribbean High income FALSE NA 66.58583 NA NA
2 Aruba ABW 1967-01-01 1970.0 1970 Latin America & Caribbean High income FALSE NA 69.14178 NA NA
3 Aruba ABW 1976-01-01 1980.0 1980 Latin America & Caribbean High income FALSE NA 72.17600 NA 33630000
4 Aruba ABW 1987-01-01 1990.0 1990 Latin America & Caribbean High income FALSE 23677.09 73.45356 NA 41563333
5 Aruba ABW 1996-01-01 2000.0 2000 Latin America & Caribbean High income FALSE 26766.93 73.85773 NA 19857000
6 Aruba ABW 2007-01-01 2010.0 2010 Latin America & Caribbean High income FALSE 25238.80 75.01078 NA NA
# Note that by default (argument keep.col.order = TRUE) the column order is also preserved
下面是总结多列的另一种方法,当函数需要更多参数时特别有用。您可以通过以下方式选择所有列,也可以选择列的子集,如 。everything()
any_of(c("a", "b"))
library(dplyr)
# toy data
df <- tibble(a = sample(c(NA, 5:7), 30, replace = TRUE),
b = sample(c(NA, 1:5), 30, replace = TRUE),
c = sample(1:5, 30, replace = TRUE),
grp = sample(1:3, 30, replace = TRUE))
df
#> # A tibble: 30 × 4
#> a b c grp
#> <int> <int> <int> <int>
#> 1 7 1 3 1
#> 2 7 4 4 2
#> 3 5 1 3 3
#> 4 7 NA 3 2
#> 5 7 2 5 2
#> 6 7 4 4 2
#> 7 7 NA 3 3
#> 8 NA 5 4 1
#> 9 5 1 1 2
#> 10 NA 3 1 2
#> # … with 20 more rows
df %>%
group_by(grp) %>%
summarise(across(everything(),
list(mean = ~mean(., na.rm = TRUE),
q75 = ~quantile(., probs = .75, na.rm = TRUE))))
#> # A tibble: 3 × 7
#> grp a_mean a_q75 b_mean b_q75 c_mean c_q75
#> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 1 6.6 7 2.88 4.25 3 4
#> 2 2 6.33 7 2.62 3.25 2.9 4
#> 3 3 5.78 6 3.33 4 3.09 4
更新的解决方案:由于 ,您可以使用 .by
in 进行内联临时分组(在计算后自动进行分组)。dplyr
dplyr 1.1.0
summarise
ungroup
使用 (available from ) 允许同时对多个列使用相同的函数。across
dplyr 1.0.0
library(dplyr)
df1 %>%
summarise(across(starts_with('x'), sum), .by = c(year, month))
# A tibble: 24 x 4
# year month x1 x2
# <dbl> <dbl> <dbl> <dbl>
# 1 2000 1 11.7 52.9
# 2 2000 2 -74.1 126.
# 3 2000 3 -132. 149.
# 4 2000 4 -130. 4.12
# 5 2000 5 -91.6 -55.9
# 6 2000 6 179. 73.7
# 7 2000 7 95.0 409.
# 8 2000 8 255. 283.
# 9 2000 9 489. 331.
#10 2000 10 719. 305.
# … with 14 more rows
评论
dplyr
devtools::install_github('tidyverse/dplyr')
上一个:如何创建数据框列表?
评论