如何按日期聚合变量

How to aggregate a variable by date

提问人:Álvaro Méndez Civieta 提问时间:11/13/2023 更新时间:11/13/2023 访问量:49

问:

我有一个包含列的 R 数据集:

  • date
  • country_code
  • version
  • money

该变量是数值的。变量 和 是分类的。和 的每个组合都定义了一个唯一的观测值,其值是可测量的。moneyversioncountry_codedatecountry_codeversionmoney

我想对变量进行分组,并在数据集中计算 3 个新变量,它们是:versioncountry_code

  • money_week:对于每个观测值,此变量存储前 7 天的聚合。money
  • money_15:对于每个观测值,此变量存储过去 15 天的聚合money
  • money_total:对于每个观测值,此变量存储截至当前日期的聚合。money

和的不同组合可能具有不同的开始和结束日期,也可能缺少日期。country_codeversion

data = tibble(
date = seq(as.Date('2022-01-01'), as.Date('2022-01-13'), by=1),
       money = 1:13)
data = data[-c(2,3,4,10,11),]

# [Some computations for adding the expected new variables]

# A tibble: 10 × 2
  date       money    money_week
2022-01-01     1          1
2022-01-05     5          6
2022-01-06     6          12
2022-01-07     7          19
2022-01-08     8          26
2022-01-09     9          35        
2022-01-12    12          42
2022-01-13    13          49

请注意,每个观测值不仅仅是前 7 个观测值的总和,而是前 7 天的总和。例如,日期观测值是日期 、 、 、 的观测值的总和,它将包括观测值 、 ,但缺少最后一个观测值。2022-01-082022-01-082022-01-072022-01-062022-01-052022-01-042022-01-032022-01-02

r 日期 缺失数据 滚动计算

评论

0赞 jay.sf 11/13/2023
计算年周的累计总和呢?transform(data, money_week=ave(money, strftime(date, '%V'), FUN=cumsum))

答:

1赞 dufei 11/13/2023 #1

您可以使用重叠联接将日期中的值合并到日期间隔(分别为 7 天或 15 天):

library(tidyverse)

# define sample data
data = tibble(
  date = seq(as.Date('2022-01-01'), as.Date('2022-01-13'), by=1),
  money = 1:13)

data = data[-c(2,3,4,10,11),]

# prepare intervals for overlap joins
df <- data |> 
  mutate(
    date = ymd(date),
    date_7 = date - days(6),
    date_15 = date - days(14)
  )

# merge by 7-day interval
sum_7 <- df |> 
  select(date, date_7) |> 
  inner_join(df, join_by(between(y$date, x$date_7, x$date))) |> 
  summarize(money = sum(money), .by = date.x) |> 
  rename(date = date.x, money_7 = money)

# merge by 15-day interval
sum_15 <- df |> 
  select(date, date_15) |> 
  inner_join(df, join_by(between(y$date, x$date_15, x$date))) |> 
  summarize(money = sum(money), .by = date.x) |> 
  rename(date = date.x, money_15 = money)

# bind cols
data |> 
  inner_join(sum_7, join_by(date)) |> 
  inner_join(sum_15, join_by(date))
#> # A tibble: 8 × 4
#>   date       money money_7 money_15
#>   <date>     <int>   <int>    <int>
#> 1 2022-01-01     1       1        1
#> 2 2022-01-05     5       6        6
#> 3 2022-01-06     6      12       12
#> 4 2022-01-07     7      19       19
#> 5 2022-01-08     8      26       27
#> 6 2022-01-09     9      35       36
#> 7 2022-01-12    12      42       48
#> 8 2022-01-13    13      49       61

创建于 2023-11-12 with reprex v2.0.2

1赞 Seth 11/13/2023 #2

您可以将变量用作滑动窗口函数的索引。date

library(slider)
library(tidyverse)

data %>%
  mutate(money_week = slide_index_dbl(money, date, sum, .before = days(6)))
#> # A tibble: 8 × 3
#>   date       money money_week
#>   <date>     <int>      <dbl>
#> 1 2022-01-01     1          1
#> 2 2022-01-05     5          6
#> 3 2022-01-06     6         12
#> 4 2022-01-07     7         19
#> 5 2022-01-08     8         26
#> 6 2022-01-09     9         35
#> 7 2022-01-12    12         42
#> 8 2022-01-13    13         49

更多测试数据可供显示money_15

set.seed(123)
newdata = tibble(
  date = seq(as.Date('2022-01-01'), as.Date('2022-03-31'), by=1),
  money = sample(1:20,90, replace = TRUE)) %>%
  slice_sample(n = 60) %>%
  arrange(date)


newdata %>%
  mutate(money_week = slide_index_dbl(money, date, sum, .before = days(6)),
         money_15 = slide_index_dbl(money, date, sum, .before = days(14)),
         money_total = cumsum(money))
#> # A tibble: 60 × 5
#>    date       money money_week money_15 money_total
#>    <date>     <int>      <dbl>    <dbl>       <int>
#>  1 2022-01-01    15         15       15          15
#>  2 2022-01-02    19         34       34          34
#>  3 2022-01-03    14         48       48          48
#>  4 2022-01-05    10         58       58          58
#>  5 2022-01-06    18         76       76          76
#>  6 2022-01-07    11         87       87          87
#>  7 2022-01-08     5         77       92          92
#>  8 2022-01-09    20         78      112         112
#>  9 2022-01-11     5         69      117         117
#> 10 2022-01-13     9         50      126         126
#> # ℹ 50 more rows