data.frame 行中公共值的成对计数-解网

问：

我有一个包含许多行（>9000）和列（148）的数据框。第一列具有用于实验的唯一代码，其他列填充了实验中测试的克隆的名称。我想要一个矩阵，其中包含每个实验中常见克隆的数量（成对）。

我的数据集示例：

Exp_No    Clone1    Clone2   Clone3    Clone4
Exp1      Egxn2     Egxn11   Egxn6     Egxn13
Exp2      Egxn4     Egxn13   Egxn16    Egxn6
Exp3      Egxn2     Egxn6    Egxn11    Egxn18
Exp4      Egxn6     Egxn14   Egxn4     Egxn18
Exp5      Egxn2     Egxn11   Egxn6     Egxn13
Exp6      Egxn4     Egxn2    Egxn5     Egxn18

我需要什么：

Exp1  Exp2  2
Exp1  Exp3  3
Exp1  Exp4  1    
Exp1  Exp5  4
Exp1  Exp6  1
Exp2  Exp3  1
Exp2  Exp4  2
...

以此类推，适用于所有行对。有什么建议吗？提前谢谢你，已经上几个小时了！我找不到解决这个问题的方法。

R 数据帧成对比较

library(dplyr)

df_long <- df %>%
  tidyr::pivot_longer(contains('Clone'), names_to = NULL)

df_long %>%
  inner_join(df_long, by = join_by(value, y$Exp_No > x$Exp_No)) %>%
  count(Exp_No.x, Exp_No.y)

# # A tibble: 15 × 3
#    Exp_No.x Exp_No.y     n
#    <chr>    <chr>    <int>
#  1 Exp1     Exp2         2
#  2 Exp1     Exp3         3
#  3 Exp1     Exp4         1
#  4 Exp1     Exp5         4
#  5 Exp1     Exp6         1
#  6 Exp2     Exp3         1
#  7 Exp2     Exp4         2
#  8 Exp2     Exp5         2
#  9 Exp2     Exp6         1
# 10 Exp3     Exp4         2
# 11 Exp3     Exp5         3
# 12 Exp3     Exp6         2
# 13 Exp4     Exp5         1
# 14 Exp4     Exp6         2
# 15 Exp5     Exp6         1

`dplyr 1.0.0`或更早

df_long %>%
  inner_join(df_long, by = "value") %>%
  filter(Exp_No.y > Exp_No.x) %>%
  count(Exp_No.x, Exp_No.y)

数据

df <- read.table(text = "
Exp_No    Clone1    Clone2   Clone3    Clone4
Exp1      Egxn2     Egxn11   Egxn6     Egxn13
Exp2      Egxn4     Egxn13   Egxn16    Egxn6
Exp3      Egxn2     Egxn6    Egxn11    Egxn18
Exp4      Egxn6     Egxn14   Egxn4     Egxn18
Exp5      Egxn2     Egxn11   Egxn6     Egxn13
Exp6      Egxn4     Egxn2    Egxn5     Egxn18", header = TRUE)

library(Matrix) # for sparse matrices
library(data.table) # final solution will be stored as a data.table

m <- as(
  triu( # get the upper triangle of the symmetric matrix
    tcrossprod( # tcrossprod to get the pairwise common clone counts
      # convert the data.frame to a sparse matrix with nrow(df) rows and
      # length(unique(unlist(df[,-1]))) columns (the number of unique clones
      # in the dataset)
      sparseMatrix(
        rep(1:131, 165)[i <- which(!is.na(cl <- unlist(df[,-1], 0, 0)))],
        as.numeric(gsub("Egxn", "", cl[i])),
        x = 1L
      )
    ), k = 1 # don't keep the diagonal (comparing rows with themselves)
  ), "TsparseMatrix" # set the result as a triangular matrix
)

# build the final answer
dtPairs <- setorder(
  data.table(
    # the Exp1 and Exp2 columns are row indices from df
    # sparse matrix indices are zero-based, so add one
    Exp1 = attr(m, "i") + 1L,
    Exp2 = attr(m, "j") + 1L,
    Common = attr(m, "x")
  ), Exp1, Exp2 # sort by Exp1 then by Exp2
)

dtPairs[1:10,]
#>     Exp1 Exp2 Common
#>  1:    1    2     18
#>  2:    1    3     26
#>  3:    1    4     11
#>  4:    1    5     13
#>  5:    1    6     25
#>  6:    1    7      6
#>  7:    1    8     10
#>  8:    1    9     17
#>  9:    1   10     23
#> 10:    1   11     13

nrow(dtPairs)
#> [1] 8515

数据：

df <- cbind(
  data.frame(Exp_No = paste0("Exp", 1:131)),
  matrix(
    replicate(131, c(sample(paste0("Egxn", 1:1e3), cl <- sample(100:165, 1)), rep(NA_character_, 165L - cl))),
    131, 165, 1, list(NULL, paste0("Clone", 1:165))
  )
)

嗨，jblood94 和 @Darren Tsai，当用于整个数据集时，我无法让代码工作。我已经消除了一些我可以没有的记录，我的数据集现在由 131 个实验制作，其中种植了不同数量的克隆（最多 165 个）。由于在每次实验中植入的克隆数量不是恒定的，因此某些行将具有 NA。有没有办法让我上传 csv 以便您可以运行您的脚本？我不确定 NA 是否导致脚本错误地计算了常见克隆的数量。.再次感谢您的帮助，非常感谢！

0赞 jblood94 2/6/2023

您只需要在构建稀疏矩阵时删除 s。查看更新。NA

0赞 Ilaria 2/15/2023

嗨，jblood94，感谢您再次提供帮助！将尝试再次运行脚本。伊拉里亚

上一个：增强似然比，用于比较使用 lm（）函数获得的模型

下一个：如何检查多个变量是否彼此相等，并且等于几个替代常量之一？

data.frame 行中公共值的成对计数

Pairwise count of common values in rows of a data.frame

评论

因为`dplyr 1.1.0`

`dplyr 1.0.0`或更早

数据

评论

评论

data.frame 行中公共值的成对计数

Pairwise count of common values in rows of a data.frame

评论

因为dplyr 1.1.0

dplyr 1.0.0或更早

数据

评论

评论

因为`dplyr 1.1.0`

`dplyr 1.0.0`或更早