查找从数据集中删除组中重复项的所有唯一组合-解网

问：

我正在尝试弄清楚如何在 R 中创建代码以查找如何从不同组的数据集中删除重复项的所有组合，并创建所有数据集组合的列表。

测试数据示例：

groups <- c("A", "A", "A", "B", "B", "B", "C", "C", "C")
value <- c(1, 2, "duplicate1", "duplicate1", 4, "duplicate2", "duplicate2", 5, 6)
id <- 1:9
dat <- data.frame(
id = id,
groups = groups,
value = value
)

所需输出的示例：

list <- dataset1, dataset2, dataset3

一个组合的 Dataset1 如何从组中删除重复项：

群	价值
一个	1
一个	2
B	复制1
B	4
C	复制2
C	5
C	6

另一个组合 dataset2：

群	价值
一个	1
一个	2
一个	复制1
B	4
C	复制2
C	5
C	6

我如何找到在 A、B 和 C 组中删除 duplicate1 和 duplicate2 的所有方法组合？我想将数据集的所有组合作为list（）返回

我已经尝试过（嵌套）循环、combn（）、expand.grid（） - 但我不够聪明，想不出解决方案。并试图看到类似的解决方案，但我发现的那些不会从数据集中删除一行和重复。

提前非常感谢您的帮助。

r 嵌套循环

dat %>% 
  arrange(value) %>% 
  mutate(Duplicate = ifelse(row_number() == 1, FALSE, value == lag(value))) %>% 
  arrange(id) %>% 
  filter(!Duplicate) %>% 
  select(-Duplicate)
  id groups      value
1  1      A          1
2  2      A          2
3  3      A duplicate1
4  5      B          4
5  6      B duplicate2
6  8      C          5
7  9      C          6

这与您的（替代）预期输出相匹配。

combos <- datCopy %>% 
  # Identify duplicate values
  group_by(value) %>% 
  filter(n() > 1) %>% 
  # Create a grid of all combinations of duplicated row ids
  group_map(
    function(.x, .y) {
      .x %>% pull(Row)
    }
  ) %>% 
  expand.grid() %>% 
  # Label the combinations
  mutate(ComboID = row_number()) %>% 
  # Convert from one column per duplicated value to one row
  # per duplicated value.  There is no need to identify which
  # row corresponds to which duplicated value, so drop name
  rowwise() %>% 
  pivot_longer(
    starts_with("Var"),
    values_to = "Row",
  ) %>% 
  select(-name)

combos
# A tibble: 8 × 2
  ComboID   Row
    <int> <int>
1       1     3
2       1     6
3       2     4
4       2     6
5       3     3
6       3     7
7       4     4
8       4     7

在这一点上，我们已经完成了艰苦的工作。我们可以看到行排除项的可能组合（2 个选项，另外 2 个选项，或总共 2 x 2 = 4）。在每个非重复值中，标识要排除的行的值。duplicate1duplicate2ComboIDRow

所以现在遍历，处理（复制）以根据需要排除行。combosdat

combos %>% 
  group_by(ComboID) %>% 
  group_map(
    function(.x, .y) {
      datCopy %>% anti_join(.x, by = "Row") %>% 
      select(-Row)
    }
  )
[[1]]
  id groups      value
1  1      A          1
2  2      A          2
3  4      B duplicate1
4  5      B          4
5  7      C duplicate2
6  8      C          5
7  9      C          6

[[2]]
  id groups      value
1  1      A          1
2  2      A          2
3  3      A duplicate1
4  5      B          4
5  7      C duplicate2
6  8      C          5
7  9      C          6

[[3]]
  id groups      value
1  1      A          1
2  2      A          2
3  4      B duplicate1
4  5      B          4
5  6      B duplicate2
6  8      C          5
7  9      C          6

[[4]]
  id groups      value
1  1      A          1
2  2      A          2
3  3      A duplicate1
4  5      B          4
5  6      B duplicate2
6  8      C          5
7  9      C          6

我相信这就是你想要的。

该算法在重复值的数量和每个值的重复次数方面应该是稳健的。

我已将算法分解为更小的步骤。如果需要，您可以将它们全部组合到一个管道中。

查找从数据集中删除组中重复项的所有唯一组合

Find all unique combinations of removing a duplicate in groups from a data set

评论

评论

评论