配对两列的每个组合，并计算 data.table 中第三列的总和-解网

问：

我有两个非常大的 df：df 1 和 df2。Df 1 包含“from”、“to”和“count”列。“from”和“to”中的值表示通勤点，可以多次出现：

从1	排	计数
10020	10020	20
10020	10020	10
10020	22001	那
30030	20020	2
45001	32001	100
45001	32001	那
45001	45001	1
90080	45002	那

在 df 2 中，我想为“from”和“to”创建每个可能的组合。然后，我想填写一个新列“count_total”每对通勤者的总和。如果 df 1 中没有出现组合，我想填写 0。对于 NA，我想填写 0。我想要的输出：

从2	腐烂	count_total
10020	10020	30
10020	22001	0
10020	20020	0
10020	32001	0
10020	45001	0
10020	45002	0
30030	10020	0
30030	22001	0
30030	20020	2

...

我尝试了以下方法，但是，它没有正确总结“count_total”的值。

    df2 <- CJ(from2 = unique(df1$from1), 
                to2 = unique(df1$to1))


    df2[, count_total := sum(df1$count[
             df1$from1 == from2 &
               df1$to1 == to2
                ]), by = .(from2, to2)]

我做错了什么？谢谢！

合并 data.table 唯一键值交叉联接 dplyr

加载所需的软件包
使用 lazy_dt（）处理 dt，因此我们可以在其上使用 dplyr 函数
汇总，将具有相同 from1 和 to1 的行组合在一起
完成数据，这将为 from1 和 to1 的每个组合创建行，默认值为 0
由于它现在是懒惰的，我们调用 as.data.table（）让它实际完成工作

pacman::p_load(data.table, dtplyr)

dt <- dt |> lazy_dt()

dt |> 
  summarise(count = sum(count, na.rm = TRUE), .by = c(from1, to1)) |>
  complete(from1, to1, fill = list(count = 0)) |> 
  as.data.table()

输出：

    from1   to1 count
 1: 10020 10020    30
 2: 10020 20020     0
 3: 10020 22001     0
 4: 10020 32001     0
 5: 10020 45001     0
 6: 10020 45002     0
 7: 30030 10020     0
 8: 30030 20020     2
 9: 30030 22001     0
10: 30030 32001     0
11: 30030 45001     0
12: 30030 45002     0
13: 45001 10020     0
14: 45001 20020     0
15: 45001 22001     0
16: 45001 32001   100
17: 45001 45001     1
18: 45001 45002     0
19: 90080 10020     0
20: 90080 20020     0
21: 90080 22001     0
22: 90080 32001     0
23: 90080 45001     0
24: 90080 45002     0
    from1   to1 count

0赞 r2evans 9/7/2023 #2

我们可以做一个合并然后总结：

library(data.table)
setDT(df1)
CJ(from2 = unique(df1$from1), to2 = unique(df1$to1)
  )[df1, count2 := i.count, on = .(from2==from1, to2==to1)
  ][, .(count2 = sum(c(0, count2), na.rm = TRUE)), by = .(from2, to2)]
#     from2   to2 count2
#     <int> <int>  <num>
#  1: 10020 10020     10
#  2: 10020 20020      0
#  3: 10020 22001      0
#  4: 10020 32001      0
#  5: 10020 45001      0
#  6: 10020 45002      0
#  7: 30030 10020      0
#  8: 30030 20020      2
#  9: 30030 22001      0
# 10: 30030 32001      0
# ---                   
# 15: 45001 22001      0
# 16: 45001 32001      0
# 17: 45001 45001      1
# 18: 45001 45002      0
# 19: 90080 10020      0
# 20: 90080 20020      0
# 21: 90080 22001      0
# 22: 90080 32001      0
# 23: 90080 45001      0
# 24: 90080 45002      0

0赞 s_baldur 9/8/2023 #3

用：set()

library(data.table)

df2 <- df1[, CJ(from1, to1, unique = TRUE)][, count := 0L]

# CJ has already created a key
for (i in 1L:nrow(df1)) {
  if (is.na(df1$count[i])) next
  row <- df2[df1[i], which = TRUE]
  set(df2, row, "count", value = df2[row, count] + df1$count[i])
}

上一个：Python 解析 yaml 文件 [已关闭]

下一个：在时态表中使用旧日期为数据设定种子 - SQL Server

配对两列的每个组合，并计算 data.table 中第三列的总和

Pair each combination of two columns and calculate sum for a third column in data.table

评论