提问人:threadofmotion 提问时间:6/10/2023 最后编辑:threadofmotion 更新时间:6/12/2023 访问量:54
在 R 中用唯一匹配项填充矩阵中的两个数据帧
Filling in matrix with unique matches between two dataframes in R
问:
首先,我有一个 GO 术语和相关基因的数据帧 (go.d5g):
ID Gene Term
1 GO:0001922 ABL1 B-1 B cell homeostasis
2 GO:0001922 HIF1A B-1 B cell homeostasis
3 GO:0001922 TNFAIP3 B-1 B cell homeostasis
4 GO:0001922 SH2B2 B-1 B cell homeostasis
5 GO:0002901 ADA mature B cell apoptotic process
6 GO:0001777 BAX T cell homeostatic proliferation
然后,我有一个来自各种实验比较(deg)的差异表达基因的数据帧:
L2FC Gene diffexp comp
1 -2.754236 SLC13A2 Downregulated NS.CB.A,S.ED.A
2 3.161623 SNAI2 Upregulated NS.CB.A,S.ED.A
3 -2.821350 STYK1 Downregulated NS.CB.A,S.ED.A
4 -1.798022 CD84 Downregulated NS.CB.A,S.ED.A
5 -1.293536 TLE6 Downregulated NS.CB.A,S.ED.A
6 -1.011016 P2RX1 Downregulated NS.CB.A,S.ED.A
我想要一个带有 0/1 的矩阵,用于匹配 deg$Gene 和 go.d5g$ID 中的唯一值。下面是一个假例子:
GO:0001922 GO:0002901 GO:0001777 GO:0006924 GO:0033153 GO:0002204
SLC13A2 1 1 0 0 0 0
SNAI2 0 0 0 0 0 0
STYK1 0 1 1 0 1 0
CD84 0 0 0 0 0 0
TLE6 0 1 1 0 0 0
P2RX1 0 0 0 0 0 1
因此,矩阵的行是实验集中的唯一基因,而列是来自GO数据库的唯一ID。
我如何用 (1) 来填充匹配基因?我目前有一些非常粗糙的东西,比如:
g.u <- unique(deg$Gene)
goid.u <- unique(go.d5g$ID)
cmat <- matrix(0,nrow=length(g.u),ncol=length(goid.u))
rownames(cmat) <- g.u
colnames(cmat) <- goid.u
for (i in 1:length(g.u)) {
go.match <- unlist(lapply(g.u[i], function(x) which(go.d5g$Gene %in% x)))
go.match2 <- go.d5g$ID[go.match]
cmat[i,which(goid.u %in% go.match2)] <- 1
}
经过一堆修复问题后,我认为它正在以粗略的方式工作,但也许有更好的解决方案。
sum(cmat)
[1] 1457
cmat.o <- cmat[order(rowSums(cmat),decreasing=T),order(colSums(cmat),decreasing=T)]
cmat.o[1:10,1:5]
GO:0006355 GO:0043066 GO:0006468 GO:0043065 GO:0006338
TNF 0 0 0 1 0
SOX9 0 1 1 0 1
ABL1 1 0 1 1 0
IL10 0 1 0 0 0
KIT 0 0 0 0 0
IL1B 0 0 0 0 0
CCL3 0 0 0 0 0
THBS1 0 1 0 0 0
ROCK2 0 0 1 0 0
FLNA 0 1 0 0 0
谢谢!
答:
1赞
LMc
6/10/2023
#1
更新
根据您的意见:
library(dplyr)
library(tidyr)
full_join(go.d5g, deg, by = "Gene") |>
mutate(matched = as.numeric(!is.na(ID))) |>
pivot_wider(id_cols = Gene, names_from = ID, values_from = matched, values_fill = 0L) |>
filter(Gene %in% deg$Gene) |>
select(-any_of("NA"))
在这里,您可以合并以查找匹配项,然后透视数据。最后,你只保留 .deg$Gene
上一个回应
library(dplyr)
library(tidyr)
go.d5g |>
mutate(in_deg = as.numeric(Gene %in% deg$Gene)) |>
select(ID, in_deg, Gene) |>
pivot_wider(names_from = ID, values_from = in_deg, values_fill = 0L)
如果需要作为行名而不是列,只需添加到管道中即可。Gene
tibble::column_to_rownames("Gene")
输出
Gene `GO:0001922` `GO:0002901` `GO:0001777`
<chr> <dbl> <dbl> <dbl>
1 IEA25 0 0 0
2 IEA3091 0 0 0
3 ISS7128 0 0 0
4 IEA10603 0 0 0
5 IEA100 0 0 0
6 IEA581 0 0 0
数据
go.d5g <- structure(list(ID = c("GO:0001922", "GO:0001922", "GO:0001922",
"GO:0001922", "GO:0002901", "GO:0001777"), Gene = c("IEA25",
"IEA3091", "ISS7128", "IEA10603", "IEA100", "IEA581"), Term = c("B-1 B cell homeostasis",
"B-1 B cell homeostasis", "B-1 B cell homeostasis", "B-1 B cell homeostasis",
"mature B cell apoptotic process", "T cell homeostatic proliferation"
)), class = "data.frame", row.names = c(NA, -6L))
deg <- structure(list(L2FC = c(-2.754236, 3.161623, -2.82135, -1.798022,
-1.293536, -1.011016), Gene = c("SLC13A2", "SNAI2", "STYK1",
"CD84", "TLE6", "P2RX1"), diffexp = c("Downregulated", "Upregulated",
"Downregulated", "Downregulated", "Downregulated", "Downregulated"
), comp = c("NS.CB.A,S.ED.A", "NS.CB.A,S.ED.A", "NS.CB.A,S.ED.A",
"NS.CB.A,S.ED.A", "NS.CB.A,S.ED.A", "NS.CB.A,S.ED.A")), class = "data.frame", row.names = c(NA,
-6L))
评论
0赞
threadofmotion
6/10/2023
这种格式似乎很好,但由于某种原因,返回值全部为 (0)。为了回答您之前的问题,go.d5g 集合是具有相关基因的 GO 术语列表,具有大量重叠(基因显示在多个 GO:# 下)。共有 41032 个基因与 3712 个 GO 项相关。我现在明白了,我犯了一个错误,但这些行应该是 deg$Gene 的唯一行,总共 626 行。我想我以后也可以删除 0 列和行。
0赞
LMc
6/10/2023
在您提供的示例数据中,没有基因也在 中。go.d5g
deg
0赞
threadofmotion
6/10/2023
整个数据集中有匹配项,鉴于规模庞大,我只是不知道如何将其包含在这里。我在原始帖子中添加了另一个示例。
0赞
LMc
6/10/2023
在输入数据框中添加几行,以便匹配,然后更新预期的输出会很有帮助。
0赞
LMc
6/12/2023
@threadofmotion根据您的评论更新了我的回复。
评论
deg
deg
go.d5g
IEA25
Gene
GO:0001922
go.d5g
GO:0001777
deg
Gene
ID