提问人:neural_axon 提问时间:6/13/2023 最后编辑:zephrylneural_axon 更新时间:6/14/2023 访问量:53
如何根据列替换两行之间的整行
How to replace an entire row between two rows based on a column
问:
我正在处理一个非常大的 mRNA 剪接数据集。下面是一个玩具数据集来举例说明这个问题:
test_df <- data.frame(
start = c(2, 9, 13, 19, 13, 20, 25, 35, 39),
end = c(8, 12, 18, 24, 16, 24, 30, 38, 45),
gene_id = c("A", "A", "A", "A", "A", "B", "B", "B", "B"),
exon_identity = c(NA, "Upstream", NA, "Downstream", "Event", NA, "Upstream", "Downstream", NA)
)
> test_df
start end gene_id exon_identity
1 2 8 A <NA>
2 9 12 A Upstream
3 13 18 A <NA>
4 19 24 A Downstream
5 13 16 A Event
6 20 24 B <NA>
7 25 30 B Upstream
8 35 38 B Downstream
9 39 45 B <NA>
对于列中的每个唯一值,如果它存在于列中的“上游”和“下游”值之间,我想替换整行,即用第 3 行替换第 5 行。让我感到困难的是,列中的某些基因没有需要替换的行,例如列中的“B”。gene_id
exon_identity
gene_id
gene_id
基于这些资源和其他资源,我尝试了:
library(tidyverse)
test_replace <- test_df %>%
group_by(gene_id) %>%
mutate(start = replace(start, row_number() > which(exon_idnetity == "Upstream") & row_number() < which(exon_idnetity == "Downstream"), start[exon_idnetity == "Event"]),
end = replace(end, row_number() > which(exon_idnetity == "Upstream") & row_number() < which(exon_idnetity == "Downstream"), end[exon_idnetity == "Event"]),
exon_idnetity = replace(exon_idnetity, row_number() > which(exon_idnetity == "Upstream") & row_number() < which(exon_idnetity == "Downstream"), "Event")
)
Warning message:
There were 2 warnings in `mutate()`.
The first warning was:
ℹ In argument: `start = replace(...)`.
ℹ In group 1: `gene_id = "A"`.
Caused by warning in `x[list] <- values`:
! number of items to replace is not a multiple of replacement length
ℹ Run dplyr::last_dplyr_warnings() to see the 1 remaining warning.
>
> test_replace
# A tibble: 9 × 4
# Groups: gene_id [2]
start end gene_id exon_idnetity
<dbl> <dbl> <chr> <chr>
1 2 8 A NA
2 9 12 A Upstream
3 NA NA A Event
4 19 24 A Downstream
5 13 16 A Event
6 20 24 B NA
7 25 30 B Upstream
8 35 38 B Downstream
9 39 45 B NA
期望输出:
> desired_outcome
start end gene_id exon_idnetity
1 2 8 A <NA>
2 9 12 A Upstream
3 13 16 A Event
4 19 24 A Downstream
5 20 24 B <NA>
6 25 30 B Upstream
7 35 38 B Downstream
8 39 45 B <NA>
最好使用 tidyverse 包的解决方案将不胜感激。
谢谢!
答:
2赞
Melissa Key
6/13/2023
#1
在玩具示例中,对数据集进行重新排序几乎可以为您提供所需的所有内容。这在真实数据集中有效吗?例如
library(tidyverse)
test_df |>
mutate(
sandwich = lag(exon_identity == 'Upstream') & lead(exon_identity == 'Downstream')
) |>
replace_na(list(sandwich = FALSE)) |>
group_by(gene_id) |>
arrange(start) |>
ungroup() |>
filter(!sandwich) |>
select(-sandwich)
(在玩具示例中,并且不需要。我添加了它们,以防万一它在真实数据集中需要/有用。group_by
ungroup
评论
0赞
zephryl
6/13/2023
不错的洞察力。您可以将管道(尽管条件更复杂)简化为:.filter()
test_df |> filter(!replace_na(lag(exon_identity == 'Upstream') & lead(exon_identity == 'Downstream'), FALSE)) |> arrange(gene_id, start)
0赞
neural_axon
6/13/2023
谢谢!这是一个非常优雅的解决方案,完全符合我的需求。
0赞
zephryl
6/13/2023
#2
如果@MelissaKey对实际数据的结构是正确的,他们的解决方案将很好地工作。否则,这里有一个函数可以完成这项工作:group_modify()
library(dplyr)
library(tidyr)
replace_rows <- function(x, ...) {
is_bad <- replace_na(
lag(x$exon_identity) == "Upstream" & lead(x$exon_identity) == "Downstream",
FALSE
)
if (any(is_bad)) {
is_event <- replace_na(x$exon_identity == "Event", FALSE)
x <- x %>%
filter(!is_bad, !is_event) %>%
add_row(
filter(x, is_event),
.before = which(is_bad)
)
}
x
}
test_df %>%
group_by(gene_id) %>%
group_modify(replace_rows) %>%
ungroup()
# A tibble: 8 × 4
gene_id start end exon_identity
<chr> <dbl> <dbl> <chr>
1 A 2 8 <NA>
2 A 9 12 Upstream
3 A 13 16 Event
4 A 19 24 Downstream
5 B 20 24 <NA>
6 B 25 30 Upstream
7 B 35 38 Downstream
8 B 39 45 <NA>
评论
"Upstream"
"Downstream"
"Downstream"