如何根据列替换两行之间的整行

How to replace an entire row between two rows based on a column

提问人:neural_axon 提问时间:6/13/2023 最后编辑:zephrylneural_axon 更新时间:6/14/2023 访问量:53

问:

我正在处理一个非常大的 mRNA 剪接数据集。下面是一个玩具数据集来举例说明这个问题:

test_df <- data.frame(
  start = c(2, 9, 13, 19, 13, 20, 25, 35, 39),
  end = c(8, 12, 18, 24, 16, 24, 30, 38, 45),
  gene_id = c("A", "A", "A", "A", "A", "B", "B", "B", "B"),
  exon_identity = c(NA, "Upstream", NA, "Downstream", "Event", NA, "Upstream", "Downstream", NA)
)

> test_df
  start end gene_id exon_identity
1     2   8       A          <NA>
2     9  12       A      Upstream
3    13  18       A          <NA>
4    19  24       A    Downstream
5    13  16       A         Event
6    20  24       B          <NA>
7    25  30       B      Upstream
8    35  38       B    Downstream
9    39  45       B          <NA>

对于列中的每个唯一值,如果它存在于列中的“上游”和“下游”值之间,我想替换整行,即用第 3 行替换第 5 行。让我感到困难的是,列中的某些基因没有需要替换的行,例如列中的“B”。gene_idexon_identitygene_idgene_id

这个问题朝着前面提出的问题的方向发展 这里这里.

基于这些资源和其他资源,我尝试了:

library(tidyverse)

test_replace <- test_df %>% 
  group_by(gene_id) %>% 
  mutate(start = replace(start, row_number() > which(exon_idnetity == "Upstream") & row_number() < which(exon_idnetity == "Downstream"), start[exon_idnetity == "Event"]),
         end = replace(end, row_number() > which(exon_idnetity == "Upstream") & row_number() < which(exon_idnetity == "Downstream"), end[exon_idnetity == "Event"]),
         exon_idnetity = replace(exon_idnetity, row_number() > which(exon_idnetity == "Upstream") & row_number() < which(exon_idnetity == "Downstream"), "Event")
         )


Warning message:
There were 2 warnings in `mutate()`.
The first warning was:
ℹ In argument: `start = replace(...)`.
ℹ In group 1: `gene_id = "A"`.
Caused by warning in `x[list] <- values`:
! number of items to replace is not a multiple of replacement length
ℹ Run dplyr::last_dplyr_warnings() to see the 1 remaining warning. 
> 
> test_replace
# A tibble: 9 × 4
# Groups:   gene_id [2]
  start   end gene_id exon_idnetity
  <dbl> <dbl> <chr>   <chr>        
1     2     8 A       NA           
2     9    12 A       Upstream     
3    NA    NA A       Event        
4    19    24 A       Downstream   
5    13    16 A       Event        
6    20    24 B       NA           
7    25    30 B       Upstream     
8    35    38 B       Downstream   
9    39    45 B       NA     

期望输出:


> desired_outcome 
  start end gene_id exon_idnetity
1     2   8       A          <NA>
2     9  12       A      Upstream
3    13  16       A         Event
4    19  24       A    Downstream
5    20  24       B          <NA>
6    25  30       B      Upstream
7    35  38       B    Downstream
8    39  45       B          <NA>

最好使用 tidyverse 包的解决方案将不胜感激。

谢谢!

r dplyr 操作 数据 清理

评论

0赞 zephryl 6/13/2023
和 之间会有多行吗?之后有多个“替换”行怎么办?"Upstream""Downstream""Downstream"
0赞 neural_axon 6/13/2023
在需要替换行的情况下,“上游”和“下游”之间始终只有一行。“事件”行可以位于下游下方的多行,但不一定位于下游的正下方。

答:

2赞 Melissa Key 6/13/2023 #1

在玩具示例中,对数据集进行重新排序几乎可以为您提供所需的所有内容。这在真实数据集中有效吗?例如

library(tidyverse)
test_df |>
  mutate(
    sandwich = lag(exon_identity == 'Upstream') & lead(exon_identity == 'Downstream')
  ) |>
  replace_na(list(sandwich = FALSE)) |>
  group_by(gene_id) |>
  arrange(start) |>
  ungroup() |>
  filter(!sandwich) |>
  select(-sandwich)

(在玩具示例中,并且不需要。我添加了它们,以防万一它在真实数据集中需要/有用。group_byungroup

评论

0赞 zephryl 6/13/2023
不错的洞察力。您可以将管道(尽管条件更复杂)简化为:.filter()test_df |> filter(!replace_na(lag(exon_identity == 'Upstream') & lead(exon_identity == 'Downstream'), FALSE)) |> arrange(gene_id, start)
0赞 neural_axon 6/13/2023
谢谢!这是一个非常优雅的解决方案,完全符合我的需求。
0赞 zephryl 6/13/2023 #2

如果@MelissaKey对实际数据的结构是正确的,他们的解决方案将很好地工作。否则,这里有一个函数可以完成这项工作:group_modify()

library(dplyr)
library(tidyr)

replace_rows <- function(x, ...) {
  is_bad <- replace_na(
    lag(x$exon_identity) == "Upstream" & lead(x$exon_identity) == "Downstream",
    FALSE
  )
  if (any(is_bad)) {
    is_event <- replace_na(x$exon_identity == "Event", FALSE)
    x <- x %>%
      filter(!is_bad, !is_event) %>%
      add_row(
        filter(x, is_event),
        .before = which(is_bad)
      )
  }
  x
}

test_df %>% 
  group_by(gene_id) %>% 
  group_modify(replace_rows) %>%
  ungroup()
# A tibble: 8 × 4
  gene_id start   end exon_identity
  <chr>   <dbl> <dbl> <chr>        
1 A           2     8 <NA>         
2 A           9    12 Upstream     
3 A          13    16 Event        
4 A          19    24 Downstream   
5 B          20    24 <NA>         
6 B          25    30 Upstream     
7 B          35    38 Downstream   
8 B          39    45 <NA>